Periscope Data
The world’s fastest visualization platform for data analysts.

Getting the First Row per Group 5X Faster

May 20, 2015

Previously we’ve written about selecting and joining the first row in each group in the context of analysis queries. While it’s OK for analysis queries to take a few minutes — or a few seconds if you use Periscope’s cache — production queries need to run in tens of milliseconds.

At Periscope we have a job queuing system that needs to know the next job per customer. That means we need to get the first row per customer_id, really fast. Our fastest query uses distinct on with Postgres’s ordered index.

The Wrong Way: Subselects

Our job queuing system needs the highest-priority job per customer. When two jobs have the same priority, we pick the earliest sorted by created_at. We can get close with a subselect:

select *
from jobs
where id in (
  select min(id)
  from jobs
  group by customer_id

While fast, this job list is incorrect. Not only does it ignore priority, it assumes id and created_at will always increase monotonically together.

The Slow Way: Window Functions

To get around these issues, we could use a window function. It gives us the ability to sort the rows by priority and created_at for each customer. Once they’re sorted, we filter down to first row per customer to get the jobs list:

select *
from jobs
where id in (
  select id
  from (
      row_number() over (partition by customer_id
        order by priority desc, created_at) as row_num
    from jobs
  ) as ordered_jobs
  where row_num = 1

On a jobs table with 50,000 rows this takes 351 ms, and here’s why:

This query does a full table scan every time! It needs to group, sort, and then filter all the rows in the entire table. This query is too slow to use in production, so let’s speed it up.

The Fast Way: Distinct On and Ordered Indexes

The distinct on clause in Postgres allows us to select complete rows while specifying which columns to use when distincting. For this query, we want distinct rows by customer_id. When using distinct on, we specify sort order of rows so that we can keep the first distinct row of each group:

select distinct on (customer_id) *
from jobs
order by customer_id, priority desc, created_at

This simple query yields the same results as the large window functionpreviously, but it’s just as slow since it also needs a full table scan:

To speed it up, we’ll use an ordered index. When creating an index in Postgres you can specify the order of each column in the index. By default, indexes are ordered ascending. For this query, we need priority descending and created_at ascending, per customer.

The create statement for this index looks like any other, with the addition of desc after priority:

create index jobs_per_customer_index on
jobs (customer_id, priority desc, created_at)

Notice that the order by in our query and the columns list in the index are identical. With the index, the query runs a lot faster because it skips the table scan:

On that same table this indexed query takes 65 milliseconds, over 5 times faster! Now that this query is able to pluck the right rows and ignore everything else, it’s ready for production.

Bonus Round: Clustering

On disk, table row data is typically ordered by insertion. In this example, we have a relatively expensive query hitting this table quite often. We can make that query less expensive by re-sorting the table on disk to be organized like our jobs_per_customer_index.

To do this in Postgres, use the cluster command:

cluster jobs using jobs_per_customer_index;

With the freshly clustered jobs table, the 65 millisecond query is down to 43 milliseconds!

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?