Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

How To Optimize Lifetime Distinct Counts Using Window Functions

June 24, 2014

Calculating Lifetime Metrics

Lifetime metrics are a great way to get a long-term sense of the health of business. They’re particularly useful for seeing how healthy a segment is by comparing to others over the long term.

For example, here’s a fictional graph of lifetime game players by platform:

We can see that iOS is our fastest-growing platform, quickly surpassing the others in lifetime players despite being launched months later!

However, metrics like these can be incredibly slow, since they’re typically calculated with distinct counts over asymmetric joins. In this post, we’ll show you how to calculate these metrics in milliseconds instead of minutes.

Self-Joining

The easiest way to write the query is to join gameplays onto itself. The first gameplays, which we name dates, provides the dates that will go on our X axis. The second gameplays, called plays, supplies every game play that happend on or before each date.

The SQL is nice and simple:

select 
  date(dates.created_at), 
  plays.platform, 
  count(distinct plays.user_id)
from gameplays dates join gameplays plays
group by 1, 2

This query has miserable performance: 37 minutes on a 10,000-row table!

To see why, let’s look at our explain:

By asymmetrically joining gameplays to itself, we’re bogging down in an n2 loop. Then we’re sorting the whole blown-out table!

Fortunately, there’s a better way.

Generate Series

Astute readers will notice that we’re only using the first gameplays table in the join for its dates. It conveniently has the exact dates we need for our graph — the dates of every gameplay — but we can select those separately and then drop them into a generate_series.

The resulting SQL is still pretty simple:

select d, platform, count(distinct user_id)
from generate_series(
  (select date(min(created_at)) from gameplays),
  (select date(max(created_at)) from gameplays),
  '1 day'
)
join gameplays on created_at <= d
group by 1, 2

This puppy runs in 26 seconds, achieving an 86X speedup over the previous query!

Here’s why:

Don’t be fooled by the scans at the top. Those are just to get the single min and max values for generate_series.

Notice that instead of joining and nested-looping over gameplays multiplied by gameplays, we are joining with gameplays and a Postgres function, in this case generate_series. Since the series has only one row per date, the result is a lot faster to loop over and sort.

But we can still do a lot better.

Window Functions To The Rescue

Every user started playing on a particular platform on one day. Once they started, they count forever. So let’s start with the first gameplay for each user on each platform:

select date(min(created_at)) dt, platform, user_id
from gameplays
group by platform, user_id

Now that we have that, we can aggregate it into how many first gameplays there were on each platform on each day:

with first_gameplays as (
  select date(min(created_at)) dt, platform, user_id
  from gameplays
  group by platform, user_id
)
select dt, platform, count(1) user_ct
from first_gameplays
group by dt, platform

We’ve included each user’s first gameplay per platform in a with clause, and then simply counted the number of users for each date and platform.

Now that we have the data organized this way, we can simply sum the values for each platform over time to get the lifetime numbers! Because we need each date’s sum to include previous dates but not new dates, and we need the sums to be separate by platform, we’ll use a window function. Here’s the full query:

with
first_gameplays as (
  select date(min(created_at)) dt, platform, user_id
  from gameplays
  group by platform, user_id
),
daily_first_gameplays as (
  select dt, platform, count(1) user_ct
  from first_gameplays
  group by dt, platform
)
select
  dt,
  platform,
  sum(user_ct) over (
    partition by platform 
    order by dt 
    rows between unbounded preceding and current row
  ) users
from daily_first_gameplays

The window function is worth a closer look:

sum(user_ct) over (
  partition by platform 
  order by dt 
  rows between unbounded preceding and current row
)

partition by platform means we want a separate sum for each platform. order by dt specifies the order of the rows, in this case by date. That’s important because rows between unbounded preceding and current row specifies that each row’s sum will include all previous rows (e.g. all previous dates), but no future rows. Thus it becomes a rolling sum, which is exactly what we want.

With the full query in hand, let’s look at the explain:

No joins at all! And when we’re sorting, it’s only over the relatively small daily_first_gameplays with-clause, and the final aggregated result.

As a result, this version runs in 28 milliseconds, a whopping 928X speedup over the previous version, and an unbelievable 79,000X speedup over the original self-join. Window functions to the rescue!


Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter