Periscope Data
The world’s fastest visualization platform for data analysts.

Coalesce Down using Correlated Subqueries

April 21, 2016

Filling in gaps in your dataset usually takes one of two forms. You can interpolate, where intermediate points are generated based on their neighbors, or you can carry values forward.

In this post we’re going to carry values forward. Below we have a log of when a customer changed payment plans. Since the customer didn’t change plans everyday, we only have data for certain days.

Most daily revenue metrics work best when you have daily data, so this isn’t the most convenient form for data analysis. Using a correlated subquery we’ll convert this list of events to a time series showing the customer’s current plan on each date.

This customer upgraded from a $10/day plan to $15 and then $20, and then downgraded to $15:

select
  date(created_at) as d,
  plan_amount as plan
from
  playment_plans
where
  customer_id = 1
order by
  1

By using generate series or joining to a list of dates, we can build most of the daily plan dataset we’re looking for:

select
  *
from
  spotty_data
order by
  1

Filling in the Gaps

We want the $10 plan value to fill down to 2016-03-24 and 2016-03-25, for $15 to fill in 2016-03-27 and 2016-03-28, etc.

It looks like window functions could work well here, but they won’t. The lag window function can look at previous rows, but it cannot look at previous versions of itself, so you’ll end up with a result like this:

select
  *
  , coalesce(plan, lag(plan) over(order by d))
from
  spotty_data
order by
  1

With the lag function we only fill in one new value. To fill all of them in, we need a correlated subquery.

Correlate Subqueries for Multi-Row Lag

Be careful: correlated subqueries can be very slow. They execute once for each row in your result set, so make sure they’re using indices or the data set is small.

In our case, we want to find the the most recent plan, at or before the current date. Using a correlated subquery here is similar to a self join or getting the first row from a join, but with the logic in the select statement as a new column:

select
  *
  , (select plan
      from spotty_data as backfill
      where backfill.d <= original.d
        and backfill.plan is not null
      order by backfill.d desc
      limit 1) as coalesce_down_plan
from
  spotty_data as original
order by
  1

We define a new column that’s selecting a single value out of the same table, renamed backfill. The original spotty_data we’ll call original.

The condition backfill.d <= original.d coupled with order by backfill.d desc ensures that the top row of our subselect is the most recent plan. The limit 1 is required as the expression can only return one row to be used as the value for the new column.

With our correlated subquery, we’re able to fill in the blanks for each date, making our revenue analysis much easier.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?