Periscope Data
The world’s fastest visualization platform for data analysts.

Range joins give you accurate histories

April 30, 2014

Sliding Date Ranges

One fine morning, your investor Leo asks you: How many charts are there across all of Periscope? Because he’s an investor, he also wants to know: How fast is that number growing?

We’ll start simply, by pulling the number of daily new charts:

select date(created_at), count(1)
from charts
group by 1

The results look straightforward:

Now let’s answer the question of how many charts have ever been created as of a given date:

select d, count(charts.id)
from generate_series(
  current_date - interval '12 months',
  current_date,
  '1 day'
) d
left join charts on charts.created_at <= d
group by 1

We start with our old friend generate_series to get a continuous list of dates. (You can also join charts to itself as long as there’s at least one widget for each day.)

The range join comes into play when we join on charts.created_at <= d. For each date d, any row in charts with created_at <= d will be included in the join. This means any given row in charts will be included once for each date after its creation date. Most rows in charts will be included many times!

Esesentially, we’ve made our previous graph cumulative.

Excluding Deleted Charts

Unfortunately for our simple analysis, users can delete charts. In fact, they do it all the time. That makes this graph pretty misleading.

Fixing this should be simple, right? Just filter out the deleted charts!

select d, count(charts.id)
from generate_series(
  current_date - interval '12 months',
  current_date,
  '1 day'
) d
left join charts on charts.created_at <= d
where charts.deleted_at is null
group by 1

Sure enough, we see a lot fewer charts:

Don’t Rewrite History

There’s one more wrinkle: That graph changes history! If I make a chart in April and delete it in June, that chart is retroactively gone from the April bucket. This is not ideal. That chart could be present in April and May, but not in June. Once April is over, the April number should never change.

Rather than filtering all deleted charts after the join, we’ll use the deleted_at timestamp in the join itself:

select d, count(charts.id)
from generate_series(
  current_date - interval '12 months'
  current_date
  '1 day'
) d
left join charts on charts.created_at <= d and (
  charts.deleted_at is null or
  charts.deleted_at > d
)
group by 1

In our join, we specify that for each date, we want to include charts that were created before that date and deleted after that date (or never deleted at all).

This looks much better. Using range joins, we can always get an accurate picture of any previous point in time.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?