<h2 class="wp-block-heading">Choosing Your Analysis Database</h2>



<p>There are <em>a lot </em>of databases out there. Many of them work well for their users and were well-chosen for their specific analysis use case. But there’s one antipattern that frustrates analysts again and again: MySQL.</p>



<p>This post is an attempt to do some good in the world. If you’re choosing a database for analysis purposes and you stumble across this post, read on and learn why MySQL is the wrong choice.</p>



<h2 class="wp-block-heading">1. No Window Functions</h2>



<p>Window functions are one of the greatest tools in an analyst’s tool belt. Their superpower is their flexibility in letting you aggregate across groupings without restructuring your query.</p>



<p>Let’s take an example: Daily revenue with day-over-day deltas. Here’s the SQL:</p>



<pre class="wp-block-code"><code>select
  dt, 
  revenue, 
  (revenue - rev_yesterday) / rev_yesterday as daily_delta
from (
  select
    date(created_at) as dt,
    sum(price) as revenue,
    lag(sum(price), 1) over (order by dt) as rev_yesterday
  from purchases
  group by 1
) t</code></pre>



<p>This simple query is enabled by the lag window function, which lets us compute yesterday’s revenue inline in the query. All together, it gives us a very nice revenue graph, complete with growth rate, seen here:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-25.png" alt="Revenue graph with growth chart" class="wp-image-80377"/></figure>



<p>Let’s do the same computation with MySQL:</p>



<pre class="wp-block-code"><code>select
  dt, 
  revenue, 
  (revenue - rev_yesterday) / rev_yesterday as daily_delta
from (
  select
    date(created_at) as dt,
    sum(purchases_today.price) as revenue,
    sum(purchases_yesterday.price) as rev_yesterday
  from purchases purchases_today 
  join purchases purchases_yesterday 
    on datediff(
         date(purchases_today.created_at), 
         date(purchases_yesterday.created_at)
       ) = 1
  group by 1
) t</code></pre>



<p>This query is both cumbersome and very expensive. The self-join of a potentially large table like purchases is pretty bad. On top of that, we have to compute the date of every single created_at twice, and then do a datediff on every pair! All for a simple daily delta.</p>



<h2 class="wp-block-heading">2. No Set-Returning Functions</h2>



<p>In Postgres, a <a href="https://www.postgresql.org/docs/9.4/functions-srf.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">set-returning function</a> is a function that returns a table which you can join to the rest of your query. These functions are useful in many ways, but for a simple example, let’s look at a common function: <a href="https://www.sisense.com/blog/use-generate-series-to-get-continuous-results/">generate_series</a>.</p>



<p>Let’s say we’ve only had purchases on three of the days in the last week. If we want to graph purchases per day, a simple group-and-count will give these results:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-02-24.png" alt="Purchases per day" class="wp-image-80386"/></figure>



<p>The empty days are not showing up at all! To get meaningful results, we have to find a way to get zeroes on the other days. We do that by joining to a list of dates created by generate_series:</p>



<pre class="wp-block-code"><code>select d, coalesce(sum(price), 0)
from 
  generate_series(
    date(now() - interval '7 day'), 
    date(now()), 
    '1 day'
  ) d
  left join purchases on date(purchases.created_at) = d
group by 1</code></pre>



<p>As we can see, <strong>d </strong>becomes a table of dates that is left-joined to purchases. Here’s the new results in a pretty graph:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-03-20.png" alt="Purchases by day" class="wp-image-80416"/></figure>



<p>This time, there’s not even a clunky workaround for MySQL. It’s <strong>simply impossible</strong> to generate the list of dates as part of the query! As a MySQL-specific workaround, many analysts will build a table filled with numbers or dates and update it nightly with a script.</p>



<h2 class="wp-block-heading">3. No Strictness on Groupings</h2>



<p>Few things ruin an analyst’s day as badly as building a bunch of reports that appeared to be right, only to turn out to be incorrect. Yet that is exactly the issue this problem creates.</p>



<p>To understand, let’s take a look at a broken query:</p>



<pre class="wp-block-code"><code>select date(created_at), platform, sum(price) 
from purchases
group by 1</code></pre>



<p>At first glance, this appears to be revenue by platform by day. However, it’s missing a second group by! Postgres-family databases will return an error insisting that the query group by platform:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-04-15.png" alt="" class="wp-image-80423"/></figure>



<p>MySQL, however, will <strong>return the wrong results</strong>! Here they are:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-05-11.png" alt="Wrong results" class="wp-image-80430"/></figure>



<p>Look closely: There’s only one row per date! One of the platforms is chosen at random for that day. You’re left to realize later that revenue is higher than your results initially indicated.</p>



<h2 class="wp-block-heading">4. No JSON Support</h2>



<p>After three increasingly unacceptable gaps, this one is more of a nice-to-have. But the fact is: App developers will often log semistructured data in JSON fields, and analysts will be left to pick up the pieces if they want to analyze that data.</p>



<p>Postgres provides a <a href="https://www.sisense.com/blog/the-lazy-analysts-guide-to-postgres-json/">rich set of functions</a> for getting at that data. For example, it’s increasingly common to store app events like pageviews in JSON blobs. You might have an object each day that looks like this:</p>



<pre class="wp-block-code"><code>{
  pageview: 12,
  shopping_cart_click: 7,
  purchase: 3
}

In Postgres, you’d count the daily pageviews this way:

select d, ct from (
  select date(event_time) d, key, sum(value::float) ct
  from events, json_each_text(events.event_data)
) t where key = 'pageview'
group by 1</code></pre>



<p>In MySQL, you’re out of luck. If your data’s formatted this way, counting pageviews by day is impossible without writing an ETL in another language.</p>



<h2 class="wp-block-heading">When to use MySQL</h2>



<p>MySQL is a solid choice as a production serving database, especially in high-load, highly-replicated environments. MySQL’s replication is well-understood and battle-tested, whereas newer databases like Postgres have less mature replication technologies.</p>



<p>But when choosing a database for analysis, don’t miss out on all these great features. Postgres itself is a solid choice for small datasets. For larger ones, we recommend a warehouse in the Postgres family.</p>


4 Reasons Not To Use MySQL For Analysis

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article