Periscope Data
The world’s fastest visualization platform for data analysts.

Cohorted Engagement with Comparable Time Windows in SQL

June 16, 2016

Cohort analysis is a powerful tool for evaluating the behavior of groups of users over time. One of the most common ways to cohort users is by when they started using your product or playing your game.

Here’s a graph of how many games the average user plays for a fictitious game company, cohorted by the month the user started playing:

select
  [users.created_at:month]
  , count(distinct gameplays.id) / count(distinct users.id)
from
  [users+gameplays]
group by
  1

This chart makes it look like the game is getting dramatically less engaging over time - but it’s simply not true! Earlier users have had more time to play the game, so their average games per user is much higher.

To correct for this bias, we can introduce a time bound on each user’s gameplays. This way all users will have the same amount of time to play the game in this chart, and simply being an earlier user won’t bias the data.

Avoiding Overcounting

We’ll add a 45-day window to the where clause. This ensures that the average games per user will only ever be calculated off the first 45-days of each user’s lifetime:

select
  [users.created_at:month]
  , count(distinct gameplays.id) / count(distinct users.id)
from
  [users+gameplays]
where
  gameplays.created_at <= users.created_at + interval '45 days'
group by
  1

December through April look good, but something is wrong with May and June due to a second bias.

Avoiding Undercounting

Like our earlier users having more time to play the game, our newest users have had less time. These users haven’t had 45 days to play the game, so their average games per user is unfairly lower.

To correct for that, we can add a second where clause to exclude users created in the last 45 days, ensuring that all users in the metric have had exactly 45 days of play time.

select
  [users.created_at:month]
  , count(distinct gameplays.id) / count(distinct users.id)
from
  [users+gameplays]
where
  gameplays.created_at <= users.created_at + interval '45 days'
  and users.created_at < getdate() - interval '45 days'
group by
  1

And finally we have a correct gameplays per user chart. Users are chorted by the month they started and exactly 45 days of gameplays counted for each of them.

To further this analysis we could start segmenting the data to see if certain dimensions were far less stable, or experiment with different cohorting windows (daily, weekly, yearly) and time bounds to achieve the right granularity.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?