Want to make data analysis fast for everyone?Join Us!
Calculating Lifetime Metrics
Lifetime metrics are a great way to get a long-term sense of the health of business. They’re particularly useful for seeing how healthy a segment is by comparing to others over the long term.
For example, here’s a fictional graph of lifetime game players by platform:
We can see that iOS is our fastest-growing platform, quickly surpassing the others in lifetime players despite being launched months later!
However, metrics like these can be incredibly slow, since they’re typically calculated with distinct counts over asymmetric joins. In this post, we’ll show you how to calculate these metrics in milliseconds instead of minutes.
The easiest way to write the query is to join gameplays onto itself. The first gameplays, which we name
dates, provides the dates that will go on our X axis. The second gameplays, called
plays, supplies every game play that happend on or before each date.
The SQL is nice and simple:
This query has miserable performance: 37 minutes on a 10,000-row table!
To see why, let’s look at our explain:
By asymmetrically joining gameplays to itself, we’re bogging down in an n2 loop. Then we’re sorting the whole blown-out table!
Fortunately, there’s a better way.
Astute readers will notice that we’re only using the first gameplays table in the join for its dates. It conveniently has the exact dates we need for our graph — the dates of every gameplay — but we can select those separately and then drop them into a generate_series.
The resulting SQL is still pretty simple:
This puppy runs in 26 seconds, achieving an 86X speedup over the previous query!
Don’t be fooled by the scans at the top. Those are just to get the single
max values for
Notice that instead of joining and nested-looping over gameplays multiplied by gameplays, we are joining with gameplays and a Postgres function, in this case generate_series. Since the series has only one row per date, the result is a lot faster to loop over and sort.
But we can still do a lot better.
Window Functions To The Rescue
Every user started playing on a particular platform on one day. Once they started, they count forever. So let’s start with the first gameplay for each user on each platform:
Now that we have that, we can aggregate it into how many first gameplays there were on each platform on each day:
We’ve included each user’s first gameplay per platform in a
with clause, and then simply counted the number of users for each date and platform.
Now that we have the data organized this way, we can simply sum the values for each platform over time to get the lifetime numbers! Because we need each date’s sum to include previous dates but not new dates, and we need the sums to be separate by platform, we’ll use a window function. Here’s the full query:
The window function is worth a closer look:
partition by platform means we want a separate sum for each platform.
order by dt specifies the order of the rows, in this case by date. That’s important because
rows between unbounded preceding and current row specifies that each row’s sum will include all previous rows (e.g. all previous dates), but no future rows. Thus it becomes a rolling sum, which is exactly what we want.
With the full query in hand, let’s look at the explain:
No joins at all! And when we’re sorting, it’s only over the relatively small
daily_first_gameplays with-clause, and the final aggregated result.
As a result, this version runs in 28 milliseconds, a whopping 928X speedup over the previous version, and an unbelievable 79,000X speedup over the original self-join. Window functions to the rescue!