Filling in gaps in your dataset usually takes one of two forms. You can interpolate, where intermediate points are generated based on their neighbors, or you can carry values forward.



In this post we’re going to carry values forward. Below we have a log of when a customer changed payment plans. Since the customer didn’t change plans everyday, we only have data for certain days.



Most daily revenue metrics work best when you have daily data, so this isn’t the most convenient form for data analysis. Using a correlated subquery, we’ll convert this list of events to a time series showing the customer’s current plan on each date.



This customer upgraded from a $10/day plan to $15 and then $20, and then downgraded to $15:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/payment-plan-changes.png" alt="Payment plan changes" class="wp-image-77715"/></figure>



<pre class="wp-block-code"><code>select
 date(created_at) as d,
 plan_amount as plan
from
 playment_plans
where
 customer_id = 1
order by
 1</code></pre>



By <a href="https://www.sisense.com/blog/use-generate-series-to-get-continuous-results/">using generate series</a> or joining a list of dates, we can build most of the daily plan dataset we’re looking for:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/date-list-empty.png" alt="Date list empty" class="wp-image-77721"/></figure>



<pre class="wp-block-code"><code>select
 *
from
 spotty_data
order by
 1</code></pre>



<h2 class="wp-block-heading">Filling in the Gaps</h2>



We want the $10 plan value to fill down to 2016-03-24 and 2016-03-25, for $15 to fill in 2016-03-27 and 2016-03-28, etc.



It looks like window functions could work well here, but they won’t. The lag window function can look at previous rows, but it cannot look at previous versions of itself, so you’ll end up with a result like this:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/window-coalesce.png" alt="Window coalesce" class="wp-image-77727"/></figure>



<pre class="wp-block-code"><code>select
 *
 , coalesce(plan, lag(plan) over(order by d))
from
 spotty_data
order by
 1</code></pre>



With the lag function, we only fill in one new value. To fill all of them in, we need a correlated subquery.



<h2 class="wp-block-heading">Correlated Subqueries for Multi-Row Lag</h2>



Be careful: correlated subqueries can be very slow. They execute once for each row in your result set, so make sure they’re using indices or the data set is small.



In our case, we want to find the most recent plan, at or before the current date. Using a correlated subquery here is similar to a <a href="https://www.sisense.com/blog/use-self-joins-to-calculate-your-retention-churn-and-reactivation-metrics/">self join </a>or <a href="https://www.sisense.com/blog/4-ways-to-join-only-the-first-row-in-sql/">getting the first row from a join</a>, but with the logic in the select statement as a new column:



<pre class="wp-block-code"><code>select
 *
 , (select plan
 from spotty_data as backfill
 where backfill.d &lt;= original.d
 and backfill.plan is not null
 order by backfill.d desc
 limit 1) as coalesce_down_plan
from
 spotty_data as original
order by
 1</code></pre>



We define a new column that’s selecting a single value out of the same table, renamed&nbsp;backfill. The original&nbsp;spotty_data&nbsp;we’ll call&nbsp;original.



The condition&nbsp;backfill.d &lt;= original.d&nbsp;coupled with&nbsp;order by backfill.d desc&nbsp;ensures that the top row of our subselect is the most recent plan. The&nbsp;limit 1&nbsp;is required as the expression can only return one row to be used as the value for the new column.



With our correlated subquery, we’re able to fill in the blanks for each date, making our revenue analysis much easier.



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/coalesce-down.png" alt="Coalesce down" class="wp-image-77733"/></figure>

Coalesce Down using Correlated Subqueries

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article