<h2 class="wp-block-heading">Calculating Lifetime Metrics</h2>



Lifetime metrics are a great way to get a long term sense of the health of a business. They’re particularly useful for seeing how healthy a segment is by comparing it to others over the long term.



For example, here’s a fictional graph of lifetime game players by platform:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Lifetime-Players-by-Platform.png" alt="Lifetime players by platform" class="wp-image-75905"/></figure>



We can see that iOS is our fastest growing platform, quickly surpassing the others in lifetime players despite being launched months later!



However, metrics like these can be incredibly slow, since they’re typically calculated with distinct counts over asymmetric joins. In this post, we’ll show you how to calculate these metrics in milliseconds instead of minutes.



<h2 class="wp-block-heading">Self-Joining</h2>



The easiest way to write the query is to join gameplays onto itself. The first gameplays, which we name dates, provides the dates that will go on our X axis. The second gameplays, called plays, supplies every gameplay that happened on or before each date.



The SQL is nice and simple:



<pre class="wp-block-code"><code>select 
 date(dates.created_at), 
 plays.platform, 
 count(distinct plays.user_id)
from gameplays dates join gameplays plays
group by 1, 2</code></pre>



This query has miserable performance: 37 minutes on a 10,000-row table!



To see why, let’s look at our explain:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/logs10-map.png" alt="Logs 10" class="wp-image-75910"/></figure>



By asymmetrically joining gameplays to itself, we’re bogging down in an n2 loop. Then we’re sorting the whole blown out table!



Fortunately, there’s a better way.



<h2 class="wp-block-heading">Generate Series</h2>



Astute readers will notice that we’re only using the first gameplays table in the join for its dates. It conveniently has the exact dates we need for our graph — the dates of every gameplay — but we can select those separately and then drop them into a <a href="https://www.sisense.com/blog/use-generate-series-to-get-continuous-results/">generate_series</a>.



The resulting SQL is still pretty simple:



<pre class="wp-block-code"><code>select d, platform, count(distinct user_id)
from generate_series(
 (select date(min(created_at)) from gameplays),
 (select date(max(created_at)) from gameplays),
 '1 day'
)
join gameplays on created_at &lt;= d
group by 1, 2</code></pre>



This puppy runs in 26 seconds, achieving an 86X speedup over the previous query!



Here’s why:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/logs10-table-2.png" alt="Logs 10 Table 2" class="wp-image-75915"/></figure>



Don’t be fooled by the scans at the top. Those are just to get the single min and max values for generate_series.



Notice that instead of joining and nested-looping over gameplays multiplied by gameplays, we are joining with gameplays and a Postgres function, in this case, generate_series. Since the series has only one row per date, the result is a lot faster to loop over and sort.



But we can still do a lot better.



<h2 class="wp-block-heading">Window Functions To The Rescue</h2>



Every user started playing on a particular platform on one day. Once they started, they count forever. So let’s start with the first gameplay for each user on each platform:



<pre class="wp-block-code"><code>select date(min(created_at)) dt, platform, user_id
from gameplays
group by platform, user_id</code></pre>



Now that we have that, we can aggregate it into how many first gameplays there were on each platform on each day:



<pre class="wp-block-code"><code>with first_gameplays as (
 select date(min(created_at)) dt, platform, user_id
 from gameplays
 group by platform, user_id
)
select dt, platform, count(1) user_ct
from first_gameplays
group by dt, platform</code></pre>



We’ve included each user’s first gameplay per platform in a with clause, and then simply counted the number of users for each date and platform.



Now that we have the data organized this way, we can simply sum the values for each platform over time to get the lifetime numbers! Because we need each date’s sum to include previous dates but not new dates, and we need the sums to be separated by platform, we’ll use a window function. Here’s the full query:



<pre class="wp-block-code"><code>with
first_gameplays as (
 select date(min(created_at)) dt, platform, user_id
 from gameplays
 group by platform, user_id
),
daily_first_gameplays as (
 select dt, platform, count(1) user_ct
 from first_gameplays
 group by dt, platform
)
select
 dt,
 platform,
 sum(user_ct) over (
 partition by platform 
 order by dt 
 rows between unbounded preceding and current row
 ) users
from daily_first_gameplays</code></pre>



The window function is worth a closer look:



<pre class="wp-block-code"><code>sum(user_ct) over (
 partition by platform 
 order by dt 
 rows between unbounded preceding and current row
)</code></pre>



partition by platform means we want a separate sum for each platform. order by dt specifies the order of the rows, in this case, by date. That’s important because rows between unbounded preceding and current row specifies that each row’s sum will include all previous rows (e.g. all previous dates), but no future rows. Thus it becomes a rolling sum, which is exactly what we want.



With the full query in hand, let’s look at the explain:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Logs10-table-3.png" alt="" class="wp-image-75920"/></figure>



No joins at all! And when we’re sorting, it’s only over the relatively small daily_first_gameplays with-clause, and the final aggregated result.



As a result, this version runs in 28 milliseconds, a whopping 928X speedup over the previous version, and an unbelievable 79,000X speedup over the original self-join. Window functions to the rescue!

How To Optimize Lifetime Distinct Counts Using Window Functions

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article