Want to make data analysis fast for everyone?Join Us!
Joining a table to itself
Most sophisticated user analysis looks at repeat usage: How many users come back? How many don’t? How many come back after an extended absence?
To get at these answers, we have to look at what our current users have done in the past. This is tricky in SQL, which doesn’t have explicit ways to bucket users based on past behavior.
The answer is to join tables to themselves. By using a template like
select ... from events past_events join events current_events, you can get users’ past activity and current activity in a single row, making it much easier to reason about both at the same time.
In this post, we’ll take you through a few common metrics — Retention, Churn, and Reactivation — and show you how to build them in SQL.
We’ll start with the most common metric: User retention. Simply put, we want to know how many users were active both last month and this month.
Let’s start by aggregating our
events table into a monthly rollup, like so:
Now that we have this table, we’ll join it to itself. The left-hand-side of the join will represent this month, and the right-hand side will represent last month. We want to make sure that there’s only one row per
user_id, and that row only gets included in the join if the user was active this month and last.
Here’s the query:
Our two join conditions are:
this_month.month = last_month.month + interval '1 month': This sets up how the join works. We want rows that include activity from this month and activity from last month, so we can reason about both time periods in the same statement.
this_month.user_id = last_month.user_id: Ensures one row per
user_id, and importantly, excludes rows where the user wasn’t present both months.
Together, these two conditions give us a table containing only retained users every month, which we simply group and count over!
Now we’ll take retention and turn it on its head: How many users last month did not come back this month?
We’ll use the same self join, except this time, we want to only include rows in last_month that do not have equivalents in this_month. Here’s the query:
We’ve changed our query in two ways:
left join: This includes every row from last month, not just the ones with users who were active this month. This sets up the next step:
where this_month.user_id is null: After the join, we filter out rows where the user was active this month.
This leaves us with a table of users who were active last month but not this month, which once again we can group and count over!
Reactivated Users are users who previously churned, but have now come back. Perhaps they had a renewed need for the product, or perhaps your compelling email retention campaign swayed them.
For reactivated users, we’ll introduce a new table, containing users’ first active months. Here it is:
We’re going to include all active users each month for whom:
- This is not their first month (because then they’d be new).
- They were not active the previous month (because then they’d be retained).
Here’s how we do it:
Similar to our Churn query, we employ a couple things in tandem:
left join: We want every activity from the current month, even if they weren’t active last month.
where last_month.user_id is null: This is the reverse of the trick we used for our Churn query. We want only users who were active this month and not last month.
join first_activity ... on first_activity.month != this_month.month: This clause excludes new users who joined this month.
Combined together, we get users who were active this month, were not active last month, and are not new: Reactivated users!
Let Periscope Optimize Monthly Activity
For reasons we discussed yesterday, self joins can be really slow. One way to optimize them is to sign up for Periscope and set up your
first_activity tables as views in Periscope!
We’ll keep them hot in our cache for you, making the joins really fast. And you won’t even have to specify them in every query. Learn more at our homepage.