Previously we’ve written about <a href="https://www.sisense.com/blog/selecting-only-one-row-per-group/">selecting</a> and <a href="https://www.sisense.com/blog/4-ways-to-join-only-the-first-row-in-sql/">joining the first row</a> in each group in the context of analysis queries. While it’s OK for analysis queries to take a few minutes — or a few seconds if you use a cache — production queries need to run in tens of milliseconds.



We have a job queuing system that needs to know the next job per customer. That means we need to get the first row per customer_id really fast. Our fastest query uses distinct on with Postgres’s ordered index.



<h2 class="wp-block-heading">The Wrong Way: Subselects</h2>



Our job queuing system needs the highest-priority job per customer. When two jobs have the same priority, we pick the earliest sorted by created_at. We can get close with a subselect:



<pre class="wp-block-code"><code>select *
from jobs
where id in (
 select min(id)
 from jobs
 group by customer_id
)</code></pre>



While fast, this job list is incorrect. Not only does it ignore priority, it assumes id and created_at will always increase <a href="https://en.wikipedia.org/wiki/Monotonic_function" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">monotonically</a> together.



<h2 class="wp-block-heading">The Slow Way: Window Functions</h2>



To get around these issues, we could use a window function. It gives us the ability to sort the rows by priority and created_at for each customer. Once they’re sorted, we filter down to first row per customer to get the jobs list:



<pre class="wp-block-code"><code>select *
from jobs
where id in (
 select id
 from (
 select
 id,
 row_number() over (partition by customer_id
 order by priority desc, created_at) as row_num
 from jobs
 ) as ordered_jobs
 where row_num = 1
)</code></pre>



 On a jobs table with 50,000 rows, this takes 351 ms, and here’s why: 



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Jobs-table-map.png" alt="Jobs table map" class="wp-image-75748"/></figure>



This query does a full table scan every time! It needs to group, sort, and then filter all the rows in the entire table. This query is too slow to use in production, so let’s speed it up.



<h2 class="wp-block-heading">The Fast Way: Distinct On and Ordered Indexes</h2>



The distinct on clause in Postgres allows us to select complete rows while specifying which columns to use when distincting. For this query, we want distinct rows by customer_id. When using distinct on, we specify sort order of rows so that we can keep the first distinct row of each group:



<pre class="wp-block-code"><code>select distinct on (customer_id) *
from jobs
order by customer_id, priority desc, created_at</code></pre>



This simple query yields the same results as the large window function previously, but it’s just as slow since it also needs a full table scan:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/window-function-.png" alt="" class="wp-image-75753"/></figure>



To speed it up, we’ll use an ordered index. When creating an index in Postgres you can specify the order of each column in the index. By default, indexes are ordered ascending. For this query, we need priority descending and created_at ascending, per customer.



The create statement for this index looks like any other, with the addition of desc after priority:



<pre class="wp-block-code"><code>create index jobs_per_customer_index on
jobs (customer_id, priority desc, created_at)</code></pre>



Notice that the order by in our query and the columns list in the index are identical. With the index, the query runs a lot faster because it skips the table scan:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Jobs-unique-table.png" alt="Jobs unique table" class="wp-image-75758"/></figure>



On that same table this indexed query takes 65 milliseconds, over 5 times faster! Now that this query is able to pluck the right rows and ignore everything else, it’s ready for production.



<h2 class="wp-block-heading">Bonus Round: Clustering</h2>



On disk, table row data is typically ordered by insertion. In this example, we have a relatively expensive query hitting this table quite often. We can make that query less expensive by re-sorting the table on disk to be organized like our jobs_per_customer_index.



To do this in Postgres, use the cluster command:



<pre class="wp-block-code"><code>cluster jobs using jobs_per_customer_index;</code></pre>



With the freshly clustered jobs table, the 65 millisecond query is down to 43 milliseconds!

Getting the First Row per Group 5X Faster

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article