## How To Optimize Lifetime Distinct Counts Using Window Functions

### Calculating Lifetime Metrics

Lifetime metrics are a great way to get a long-term sense of the health of business. They’re particularly useful for seeing how healthy a segment is by comparing to others over the long term.

For example, here’s a fictional graph of lifetime game players by platform:

We can see that iOS is our fastest-growing platform, quickly surpassing the others in lifetime players despite being launched months later!

However, metrics like these can be incredibly slow, since they’re typically calculated with distinct counts over asymmetric joins. In this post, we’ll show you how to calculate these metrics in milliseconds instead of minutes.

### Self-Joining

The easiest way to write the query is to join gameplays onto itself. The first gameplays, which we name dates, provides the dates that will go on our X axis. The second gameplays, called plays, supplies every game play that happend on or before each date.

The SQL is nice and simple:

This query has miserable performance: 37 minutes on a 10,000-row table!

To see why, let’s look at our explain:

By asymmetrically joining gameplays to itself, we’re bogging down in an n2 loop. Then we’re sorting the whole blown-out table!

Fortunately, there’s a better way.

### Generate Series

Astute readers will notice that we’re only using the first gameplays table in the join for its dates. It conveniently has the exact dates we need for our graph — the dates of every gameplay — but we can select those separately and then drop them into a generate_series.

The resulting SQL is still pretty simple:

This puppy runs in 26 seconds, achieving an 86X speedup over the previous query!

Here’s why:

Don’t be fooled by the scans at the top. Those are just to get the single min and max values for generate_series.

Notice that instead of joining and nested-looping over gameplays multiplied by gameplays, we are joining with gameplays and a Postgres function, in this case generate_series. Since the series has only one row per date, the result is a lot faster to loop over and sort.

But we can still do a lot better.

### Window Functions To The Rescue

Every user started playing on a particular platform on one day. Once they started, they count forever. So let’s start with the first gameplay for each user on each platform:

Now that we have that, we can aggregate it into how many first gameplays there were on each platform on each day:

We’ve included each user’s first gameplay per platform in a with clause, and then simply counted the number of users for each date and platform.

Now that we have the data organized this way, we can simply sum the values for each platform over time to get the lifetime numbers! Because we need each date’s sum to include previous dates but not new dates, and we need the sums to be separate by platform, we’ll use a window function. Here’s the full query:

The window function is worth a closer look:

partition by platform  means we want a separate sum for each platform. order by dt specifies the order of the rows, in this case by date. That’s important because rows between unbounded preceding and current row specifies that each row’s sum will include all previous rows (e.g. all previous dates), but no future rows. Thus it becomes a rolling sum, which is exactly what we want.

With the full query in hand, let’s look at the explain:

No joins at all! And when we’re sorting, it’s only over the relatively small daily_first_gameplays with-clause, and the final aggregated result.

As a result, this version runs in 28 milliseconds, a whopping 928X speedup over the previous version, and an unbelievable 79,000X speedup over the original self-join. Window functions to the rescue!

Thank you