Common Statistical Operations
Coalesce Down using Correlated Subqueries
Filling in gaps in your dataset usually takes one of two forms. You can interpolate, where intermediate points are generated based on their neighbors, or you can carry values forward.
In this post we’re going to carry values forward. Below we have a log of when a customer changed payment plans. Since the customer didn’t change plans everyday, we only have data for certain days.
Most daily revenue metrics work best when you have daily data, so this isn’t the most convenient form for data analysis. Using a correlated subquery we’ll convert this list of events to a time series showing the customer’s current plan on each date.
This customer upgraded from a $10/day plan to $15 and then $20, and then downgraded to $15:
date(created_at) as d,
plan_amount as plan
customer_id = 1
Filling in the Gaps
We want the $10 plan value to fill down to 2016-03-24 and 2016-03-25, for $15 to fill in 2016-03-27 and 2016-03-28, etc.
It looks like window functions could work well here, but they won’t. The lag window function can look at previous rows, but it cannot look at previous versions of itself, so you’ll end up with a result like this:
, coalesce(plan, lag(plan) over(order by d))
With the lag function we only fill in one new value. To fill all of them in, we need a correlated subquery.
Correlate Subqueries for Multi-Row Lag
Be careful: correlated subqueries can be very slow. They execute once for each row in your result set, so make sure they’re using indices or the data set is small.
In our case, we want to find the the most recent plan, at or before the current date. Using a correlated subquery here is similar to a self join or getting the first row from a join, but with the logic in the select statement as a new column:
, (select plan
from spotty_data as backfill
where backfill.d <= original.d
and backfill.plan is not null
order by backfill.d desc
limit 1) as coalesce_down_plan
spotty_data as original
We define a new column that’s selecting a single value out of the same table, renamed backfill. The original spotty_data we’ll call original.
The condition backfill.d <= original.d coupled with order by backfill.d desc ensures that the top row of our subselect is the most recent plan. The limit 1 is required as the expression can only return one row to be used as the value for the new column.
With our correlated subquery, we’re able to fill in the blanks for each date, making our revenue analysis much easier.