Want to make data analysis fast for everyone?Join Us!
For today’s daily report, we need a list of users and the most recent widget each user has created. We have a users table and a widgets table, and each user has many widgets.
users.id is the primary key on
widgets.user_id is the corresponding foreign key in
To solve this problem, we need to join only the first row. There are several ways to do this. Here are a few different techniques and when to use them.
Use Correlated Subqueries when the foreign key is indexed
Correlated subqueries are subqueries that depend on the outer query. It’s like a for loop in SQL. The subquery will run once for each row in the outer query:
where widgets.user_id = users.id clause in the subquery. It queries the
widgets table once for each
user row and selects that user’s most recent
widget row. It’s very efficient if
user_id is indexed and there are few users.
Use a Complete Subquery when you don’t have indexes
Correlated subqueries break down when the foreign key isn’t indexed, because each subquery will require a full table scan.
In that case, we can speed things up by rewriting the query to use a single subquery, only scanning the
widgets table once:
This new subquery returns a list of the most recent widgets, one for each user. We then join it to the
users table to get our list.
We’ve used Postgres’
DISTINCT ON syntax to easily query for only one widget per
user_id. If your database doesn’t support something like
DISTINCT ON, you have two options:
Use Nested Subqueries if you have an ordered ID column
In our example, the most recent row always has the highest
id value. This means that even without
DISTINCT ON, we can cheat with our nested subqueries like this:
We start by selecting the list of IDs repreenting the most recent widget per user. Then we filter the main
widgets table to those IDs. This gets us the same result as
DISTINCT ON since sorting by
created_at happen to be equivalent.
Use Window Functions if you need more control
If your table doesn’t have an
id column, or you can’t depend on its min or max to be the most recent row, use
row_number with a window function. It’s a little more complicated, but a lot more flexible:
The interesting part is here:
over (partition by user_id order by created_at desc specifies a sub-table, called a window, per user_id, and sorts those windows by
row_number() returns a row’s position within its window. Thus the first widget for each
user_id will have
In the outer subquery, we select only the rows with a
row_number of 1. With a similar query, you could get the 2nd or 3rd or 10th rows instead.
In a future post we’ll go deeper on window functions and how they can make queries like this one even more powerful!