Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

How To Calculate Cohort Retention in SQL

June 4, 2014

‍Losing users sucks. Losing customers really sucks. If you’re a startup you know that Retention is King. You should always be measuring and improving your user retention, so you can keep more users over time. In this post we’ll show you how to calculate user retention on your own data in SQL.

Defining retention

If Gloria used the product on Monday, and used the product again on Tuesday, she is a retained user. If Bill used the product on Monday, but not on Tuesday, he is a lapsed user. Retention for Monday is the number of retained users divided by the number of total users. If Gloria and Bill were the only two users on Monday, then retention for Monday is 50%.

Calculating basic user retention

The key to calculating retention is counting users who were active at time #1, then counting how many were active at time #2. An easy way to do this in SQL is to left join your user activity table to itself like so:

select * 
from activity
left join activity as future_activity on
  activity.user_id = future_activity.user_id
  and activity.date = future_activity.date - interval '1 day'

Now, for every row of user activity, we have — in that same row — their activity 1 day in the future. This gives us an ideal table for calculating retention with some simple counts:

select 
  activity.date, 
  count(distinct activity.user_id) as active_users, 
  count(distinct future_activity.user_id) as retained_users,
  count(distinct future_activity.user_id) / 
    count(distinct activity.user_id)::float as retention
from activity
left join activity as future_activity on
  activity.user_id = future_activity.user_id
  and activity.date = future_activity.date - interval '1 day'
group by 1

In our data analysis tool we get this chart:

For extra credit, change the 1-day retention to 7-day or 30-day to capture a sense of longer-term user engagement.

Calculating retention of new vs. existing users

Often retention is quite different for users who just signed up as compared to loyal long-term users. To calculate new-user retention, simply join in your users table and only look at activity rows that occurred on the user’s join date:

select 
  users.date as date,
  count(distinct activity.user_id) as new_users, 
  count(distinct future_activity.user_id) as retained_users,
  count(distinct future_activity.user_id) / 
    count(distinct activity.user_id)::float as retention
from activity
-- Limits activity to activity from new users
join users on
  activity.user_id = users.id 
  and users.date = activity.date
left join activity as future_activity on
  activity.user_id = future_activity.user_id
  and activity.date = future_activity.date - interval '1 day'
group by 1

We see that while overall retention is 46%, new user retention is only 5.8%! Now we see why it’s so useful to split out new users. Improving new-user retention should clearly be a priority.

To look at returning-user retention, simply change:

users.date = activity.date 

to:

users.date != activity.date

This effectively excludes activity from users who joined that day. Our query now looks like:

select 
  activity.date as date,
  count(distinct activity.user_id) as new_users, 
  count(distinct future_activity.user_id) as retained_users,
  count(distinct future_activity.user_id) / 
    count(distinct activity.user_id)::float as retention
from activity
-- Limits activity to activity from existing users
join users on 
  activity.user_id = users.id 
  and users.date != activity.date
left join activity as future_activity on
  activity.user_id = future_activity.user_id
  and activity.date = future_activity.date - interval '1 day'
group by 1

As expected, existing-user retention is higher than the overall average: 66% vs. 46%.

Calculating retention in cohorts

It can be very helpful to compare the retention of users who joined in week A with those who joined in week B. This lets us see if our product changes are improving our retention rate. Ideally we’d end up with a chart like this:

We’ll start by defining a few handy subqueries to simplify the problem. new_user_activity restricts user activity to new users:

with new_user_activity as (
  select activity.* from activity
  join users on
    users.id = activity.user_id
    and users.date = activity.date
)

cohort_active_user_count calculates the total number of active users — the denominator in our retention calculation — in each daily cohort:

, cohort_active_user_count as (
  select 
    date, count(distinct user_id) as count 
  from new_user_activity
  group by 1
)

On top of that, we’ll make a few smaller changes to the main query:

Calculate the retention period — the number days retained — as  future_activity.date - new_user_activity.date and group by it.With this grouping, we lose the simple count of active users in the cohort. Fortunately we thought of this and made our cohort_active_user_count subquery, which we can join in and use as the denominator.Finally, we’ll wrap the query in an outer query that makes nice-looking output, excludes bogus rows, and sorts.

Without further ado:

select date, 'Day ' || to_char(period, 'DD') as period,
  new_users, retained_users, retention 
from (
  select 
    new_user_activity.date as date,
    (future_activity.date 
      - new_user_activity.date) as period,
    max(cohort_size.count) as new_users, -- all equal in group
    count(distinct future_activity.user_id) as retained_users,
    count(distinct future_activity.user_id) / 
      max(cohort_size.count)::float as retention
  from new_user_activity
  left join activity as future_activity on
    new_user_activity.user_id = future_activity.user_id
    and new_user_activity.date < future_activity.date
    and (new_user_activity.date + interval '10 days')
      >= future_activity.date
  left join cohort_active_user_count as cohort_size on 
    new_user_activity.date = cohort_size.date 
  group by 1, 2) t
where period is not null
order by date, period

Notice also the range join, one of our favorite SQL tricks, to get multiple days of retention in one chart.

This results in the table:

With Periscope, we can automatically pivot the result then color by percentile:

Making retention more specific

Remember how different new-user and existing-user retention were? You can see the same variations across lots of user segments. It’s worth breaking out retention by:

DemographicsUser acquision channelPaying vs. non-paying usersDifferent kinds of activity — viewing, creating, buying, etc.

Or click the ‘Retention’ button

Finally, if you sign up for a free trial of Periscope, we’ll do the heavy-lifting for you! Periscope is a tool for easily creating charts and dashboards from your SQL database. You can use it to calculate lots of metrics like retention, and share them with your teammates, customers, developers and more!


Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter