<h2 class="wp-block-heading"><strong>Beyond group and count</strong></h2>



<p>Window functions are a wonderfully useful SQL technique. They make complex aggregations simple to build.</p>



<p>After using them to great effect for <a href="https://www.sisense.com/blog/use-subqueries-and-window-functions-to-compute-running-averages/">selecting only one row</a>, <a href="https://www.sisense.com/blog/use-subqueries-and-window-functions-to-compute-running-averages/">computing running averages</a>, and breaking out <a href="https://www.sisense.com/blog/computing-day-over-day-changes-with-window-functions/">day-over-day changes</a>, we thought it was high time to explain them in more detail.</p>



<h2 class="wp-block-heading"><strong>Simple aggregations and percentages</strong></h2>



<p>Let’s start with some data from a video game company. For each platform, we want to know how many times a user played a game on that platform and what percent of all gameplays that platform has.</p>



<pre class="wp-block-code"><code>select 
  platform, 
  count(1) as plays,
  count(1) / (sum(count(1)) over ())::float as "% of plays"
from gameplays
group by 1</code></pre>



<p>The window function in this query is sum(count(1)) over ().</p>



<p>sum(count(1)) gives us the total number of gameplays. over () specifies to aggregate over all the rows without collapsing them. Thus this function gives us the total number of gameplays across all platforms.</p>



<p>The count(1) in the numerator is not part of the window function, and so it applies to all rows in the group, giving us a per-platform count.</p>



<p>Putting it all together, here are the results:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Platform-plays-table.png" alt="Platform plays" class="wp-image-75852"/></figure>



<h2 class="wp-block-heading"><strong>Calculating ntiles</strong></h2>



<p>Quartiles or deciles can be a very useful way to split a dataset. Windows are by far the easiest way to do this in SQL. Let’s look at user spend quartiles, and the min and max spend within each quartile.</p>



<pre class="wp-block-code"><code>select 
  quartile, 
  min(spend) as min, 
  max(spend) as max
from (
  select 
    spend, 
    ntile(4) over (order by spend asc) quartile
  from (
    select user_id, sum(price) as spend
    from purchases
    group by 1
  ) user_spend
) user_spend_quartiles
group by 1 
order by ntile asc</code></pre>



<p>The inner query gives us a table of spend per user. The middle query annotates each row with the quartile — ntile(4) — of spend. Finally, the outer query aggregates the rows into just the min and max of each quartile.</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Ntile-table.png" alt="Ntile table" class="wp-image-75857"/></figure>



<h2 class="wp-block-heading"><strong>Cumulative </strong>m<strong>etrics</strong></h2>



<p>Say what you will about cumulative metrics — they are certainly to be used sparingly — but they are easy to calculate with window functions. Here we’ll compute a running sum of all revenue.</p>



<pre class="wp-block-code"><code>select 
  day, 
  sum(spend) over (
    order by day asc 
    rows between unbounded preceding and current row
  )
from (
  select 
    date(created_at) as day, 
    sum(price) as spend
  from purchases
  group by 1
) daily_revenue</code></pre>



<p>The inner query defines a simple daily sum of all revenue. The outer query makes it cumulative, summing all the values between the first day and the current day.</p>



<p>That’s accomplished with <strong>rows between unbounded preceding and current row</strong>. For each row, <strong>unbounded preceding</strong> begins the sum at the beginning of the table, and <strong>current row</strong> halts the sum at, well, the current row.</p>



<p>Here are the results of both the inner and outer queries:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Revenue-by-Day.png" alt="Revenue by day" class="wp-image-75862"/></figure>



<h2 class="wp-block-heading"><strong>Determining the position of a row</strong></h2>



<p>Ordering information is another useful trick window functions give us. Let’s take the previous query, and also add a ranking column for which platform has the highest number of plays:</p>



<pre class="wp-block-code"><code>select 
  platform, 
  plays, 
  plays / (sum(plays) over ())::float as "% of plays",
  rank() over (order by plays desc)
from (
  select platform, count(1) as plays
  from gameplays
  group by 1
) plays_by_platform</code></pre>



<p>rank() gives the row’s number, and over (order by plays desc) specifies the order in which to apply the rank.</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Platform-Plays-2-table.png" alt="Platform plays 2" class="wp-image-75867"/></figure>



<h2 class="wp-block-heading"><strong>Multiple windows with partition</strong></h2>



<p>Often we want a separate ordering for different parts of the table. This is what the partition feature enables. It splits the window function, applying it separately to each specified partition.</p>



<p>For example, let’s find the players with the most gameplays for each platform:</p>



<pre class="wp-block-code"><code>select 
  platform, 
  user_id,
  plays, 
  rank() over (partition by platform order by plays desc)
from (
  select platform, user_id, count(1) as plays
  from gameplays
  group by 1, 2
) plays_by_user_and_platform</code></pre>



<p>Our <strong>partition by platform</strong> makes the rank() function give us a separate rank for each platform.</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Platform-user-ID-table.png" alt="Platform user ID table" class="wp-image-75872"/></figure>



<h2 class="wp-block-heading"><strong>How It All Works</strong></h2>



<p>Superficially, window functions are similar to your basic “group by” functionality. However, rather than subdividing tables into exclusive “groups” of rows and collapsing them, window functions can look at arbitrary “windows” and do so without collapsing the windows into a single row.</p>



<h3 class="wp-block-heading"><strong>Pieces of a Window Function</strong></h3>



<p>Dissecting our last example, rank() over (partition by platform order by plays desc), we can pull out three pieces:</p>



<ul><li>rank() — the function, which aggregates, ranks, or filters the rows in the partition</li><li>over(&#8230;) — the window, which specifies which rows the function applies to</li><li>partition by platform — which subset of rows are considered (in this case, all rows with the same platform are in this partition)</li><li>order by plays desc — the order of the rows in the window; this is especially useful for functions like first() or row_number() which depend on ordering</li></ul>



<p>Finally, the over() window definition can also have a row specifier, which further restricts which rows are in the window. The cumulative metrics section above goes into this in more detail.</p>



<h3 class="wp-block-heading"><strong>Evaluation Order</strong></h3>



<p>Window functions are evaluated after the join, group, and having clauses, at the same time as other select statements.</p>



<p>That, unfortunately, means your window functions can’t refer to other fields in the select statement. To do this, you’ll need to wrap the select in a subquery and put your window function in the outer query.</p>



<h3 class="wp-block-heading"><strong>Window Function Availability</strong></h3>



<p>Window functions were defined in SQL:2003 and are available in PostgreSQL, SQL Server, Redshift (which supports a subset of Postgres’s functions) and Oracle (which calls them “analytic functions”).</p>



<p>Unfortunately, they’re not supported on MySQL, though you can get a lot of mileage out of <a href="https://dev.mysql.com/doc/refman/8.0/en/user-variables.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">variables</a> and <a href="https://dev.mysql.com/doc/refman/5.5/en/group-by-functions.html#function_group-concat" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">group_concat</a>.</p>



<h3 class="wp-block-heading"><strong>More Neat Tricks</strong></h3>



<p>As you can see, we’re big fans of window functions! Here are some times we’ve used them to great effect:</p>



<ul><li><a href="https://www.sisense.com/blog/predicting-exponential-growth-with-sql/">Predicting Exponential Growth with SQL</a>, in which they calculate a regression of exponential data</li><li><a href="https://www.sisense.com/blog/use-window-functions-for-local-percentages/">Use window functions for time-series percentages</a>, in which they make a time series proportional</li><li><a href="https://www.sisense.com/blog/generate-series-in-redshift-and-mysql/">Generate Series in Redshift and MySQL</a>, in which they replace Redshift’s unfortunate lack of generate_series</li><li><a href="https://www.sisense.com/blog/4-ways-to-join-only-the-first-row-in-sql/">4 Ways to Join Only The First Row in SQL</a>, in which (spoiler alert!) one of the ways is to use a window function</li><li><a href="https://www.sisense.com/blog/ascii-art-charts-in-the-terminal/">ASCII Art Charts in the Terminal</a>, in which they auto-scale our ascii charts</li></ul>


Fun with Window Functions

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article