Fun with Window Functions
Beyond Group And Count
Window functions are a wonderfully useful SQL technique. They make complex aggregations simple to build.
After using them to great effect for everything from selecting only one row, to computing running averages, to breaking out day-over-day changes, we thought it was high time to explain them in more detail.
Simple Aggregations and Percentages
Let’s start with some data from a video game company. For each platform, we want to know how many times a user played a game on that platform, and what percent of all gameplays that platform has.
count(1) as plays,
count(1) / (sum(count(1)) over ())::float as "% of plays"
group by 1
The window function in this query is sum(count(1)) over ().
sum(count(1)) gives us the total number of gameplays. over () specifies to aggregate over all the rows without collapsing them. Thus this function gives us the total number of gameplays across all platforms.
The count(1) in the numerator is not part of the window function, and so it applies to all rows in the group, giving us a per-platform count.
Putting it all together, here are the results:
Quartiles or deciles can be a very useful way to split a dataset. Window are by far the easiest way to do this in SQL. Let’s look at user spend quartiles, and the min and max spend within each quartile.
min(spend) as min,
max(spend) as max
ntile(4) over (order by spend asc) quartile
select user_id, sum(price) as spend
group by 1
group by 1
order by ntile asc
The inner query gives us a table of spend per user. The middle query annotates each row with the quartile — ntile(4) — of spend. Finally, the outer query aggregates the rows into just the min and max of each quartile.
Say what you will about cumulative metrics — they are certainly to be used sparingly — but they are easy to calculate with window functions. Here we’ll compute a running sum of all revenue.
sum(spend) over (
order by day asc
rows between unbounded preceding and current row
date(created_at) as day,
sum(price) as spend
group by 1
The inner query defines a simply daily sum of all revenue. The outer query makes it cumulative, summing all the values between the first day and the current day.
That’s accomplished with rows between unbounded preceding and current row. For each row, unbounded preceding begins the sum at the beginning of the table, and current row halts the sum at, well, the current row.
Here are the results of both the inner and outer queries:
Determining The Position of a Row
Ordering information is another useful trick window functions give us. Let’s take the previous query, and also add a ranking column for which platform has the highest number of plays:
plays / (sum(plays) over ())::float as "% of plays",
rank() over (order by plays desc)
select platform, count(1) as plays
group by 1
rank() gives the row’s number, and over (order by plays desc) specifies the order in which to apply the rank.
Multiple Windows With Partition
Often we want a separate ordering for different parts of the table. This is what the partition feature enables. It splits the window function, applying it separately to each specified partition.
For example, let’s find the players with the most gameplays for each platform:
rank() over (partition by platform order by plays desc)
select platform, user_id, count(1) as plays
group by 1, 2
Our partition by platform makes the rank() function give us a separate rank for each platform.
How It All Works
Superficially, window functions are similar to your basic “group by” functionality. However, rather than subdividing tables into exclusive “groups” of rows and collapsing them, window functions can look at arbitrary “windows”, and do so without collapsing the windows into a single row.
Pieces of a Window Function
Dissecting our last example, rank() over (partition by platform order by plays desc), we can pull out three pieces:
- rank() — The function, which aggregates, ranks, or filters the rows in the partition.
- over(...) — The window, which specifies which rows the function applies to.
- partition by platform — Which subset of rows are considered. In this case, all rows with the same platform are in this partition.
- order by plays desc — The order of the rows in the window. This is especially useful for functions like first() or row_number() which depend on ordering.
Finally, the over() window definition can also have a row specifier, which further restricts which rows are in the window. The Cumulative Metrics section above goes into this in more detail.
Window functions are evaluated after the join, group & having clauses, at the same time as other select statements.
That unfortunately means your window functions can’t refer to other fields in the select statement. To do this, you’ll need to wrap the select in a subquery and put your window function in the outer query.
Window Function Availability
Window functions were defined in SQL:2003 and are available in PostgreSQL, SQL Server, Redshift (which suppports a subset of Postgres’s functions) and Oracle (which calls them “analytic functions”).
More Neat Tricks
As you can see, we’re big fans of window functions! Here are some times we’ve used them to great effect:
- Predicting Exponential Growth with SQL, in which they calculate a regression of exponential data.
- Use window functions for time-series percentages, in which they make a time series proportional.
- Generate Series in Redshift and MySQL, in which they replace Redshift’s unfortunate lack of generate_series.
- 4 Ways to Join Only The First Row in SQL, in which (spoiler alert!) one of the ways is to use a window function.
- ASCII Art Charts in the Terminal, in which they auto-scale our ascii charts.