Want to make data analysis fast for everyone?Join Us!
We recently launched the instant-classic underwater warfare game Scope: Battle for the High Seas, and our investors are excited to see how the game’s subscriptions are growing. We are using a payment processor to handle collections of subscriptions and have access to a few tables. While this makes analyzing the data possible, it is initially not in the best format to do so. Let’s investigate how we can normalize our date ranges to compute the growth of subscriptions since our game launched.
Starting from the raw data
Our primary table is a subscription table that has the company id, plan id, and the start and end dates of each plan:
This data structure allows us to easily analyze some date slices of our data, for example we can look at the number of plans started on a given date:
But it doesn’t allow us to see inside of our start and end ranges, such as how many plans were active on a specific date. In order to do this, we must first get our data into a format friendly for aggregating across our ranges.
Getting the Relevant Dates
We first want to generate date series for each of our ranges, and we can reference these posts for best practices on doing so: Generate Series in Redshift and MySQL and Use generate_series to get continuous results
In our case we will opt for Redshift syntax and use a window function to get our date series. Since we only want a limited series of data, we can preform a check using a nested select to make sure that our dates only cover our periods with a plan.
Accounting for Null Dates
In our subscriptions table we currently have the dates where they have started and ended.
We like to pre-process our data to ensure data consistency when other analysts re-use our work.
To ensure that active subscriptions are included in our analysis, we’ll want to give our end dates a value as well.
We can accomplish this by using a
coalesce statement to evaluate the
plan_end value and inserting the current date in the event that it is null:
Joining on Inequalities
Now that we have our cleaned subscription data in one temp table and our relevant dates in another, we can join them together. We typically join two tables by looking for equal attributes, but we are certainly not limited to the equality operator. For our analysis we want our result set to have a set of rows for every date that a plan was active.
To do this, we will left join
cleaned_subs where the
plan_date is greater than or equal to the
plan_date is less than or equal to the
Now that we’ve got our data merged together, we are able to see a record in the result set tying every subscription
id to every day it was active.
Aggregating Our Data
Now that our data is joined across date ranges, we can perform the next step in our analysis and aggregate the data!
We will use the
count function to determine the number of each type of plan that we have by day.
With that, we are able to see the growth of plans over time:
This is great information, but won’t tell the full story to our investors.
What we really want to display is how
Scope is growing and how its users are signing up and upgrading over time.
To get this picture of our data, we can join our aggregated data with our plan data to understand the value within each plan:
Now we are able to see that our users are in fact moving from our Starter plan to a Pro Plan over time and that the growth in revenue is promising!
Pushing joins beyond their normal use cases opens the door to analyze your data beyond its initial state. Let us know how else you take advantage of joins by tweeting us @Periscopedata!