Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

Predicting Exponential Growth with SQL

Predicting Exponential Growth with SQL
June 11, 2014

SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.

Let’s estimate total users over time for a rapidly growing fictitious mobile app. The early days of this app were pretty messy, so we’ll chop them off add them in as the starting sum, only plotting dates after October 2013. Instead of a range join to get the rolling sum we’ll use a window function. To keep things organized we’ll put each step into a with query:

with
daily_new_users as (
  select created_at::date dt, count(1) daily_ct 
  from users where created_at > '2013-10-01' group by 1),
daily_user_volume as (
  select 
    dt,
    dt - '2013-10-01' as dt_id, -- integer version of date
    84066 -- users before October 2013
      + sum(daily_ct) over (
        order by dt
        rows between unbounded preceding and current row
      ) 
      as user_ct
  from daily_new_users
)
select dt, daily_ct from daily_new_users;
select dt, user_ct from daily_user_volume

Here are the two curves:

The daily_user_volume data has the look of an exponential growth curve, making it an ideal candidate for an exponential projection.

Linearizing Exponential Data

The easiest kind of regression is a linear regression. Of course, fitting a line to exponential data would yield a terrible fit. Instead, we can linearize the exponential data by taking the log of the data, fit that to a linear regression, and then inverse process on the future data points to get the predicted future growth.

The log’ed version of daily_user_volume is quite linear, so this will be a great fit:

To make a linear regression, we need to find the best estimates for A and B (intercept and slope) that minimize the error in this formula:

We’ll use the Ordinary Least Squares method for minimizing error of our estimates, which lets us solve for A and B like this:

Solving for B first, we’ll define estimate_b as:

with
...
estimate_b as (
  select sum(covar.s) / sum(var.s) b
  from (
    select (
      dt_id - avg(dt_id::float8) over ()) * (
        log(user_ct) - avg(log(user_ct)) over ()
      ) as
      from daily_user_volume) covar
    join (
      select pow(dt_id - avg(dt_id::float8) over (), 2) as
      from daily_user_volume
    ) var
    on true
),

Critically, we’re taking the log of user_ct to linearize those exponential data points!

Our window functions use over () so that the window is applied to the whole result set. It’s very convenient in situations like this, where you want to compare each row to an aggregation over every row.

Getting estimate_a is more straightforward:

with
...
estimate_a as (
  select 
    avg(log(user_ct)) - avg(dt_id::float) * 
      (select b from estimate_b) a
  from daily_user_volume
),

Now that we have our A and B for the regression, it’s time to project forward.

Projecting and De-linearizing

With the estimates computed, we can simply generate the y-values for the current and future dates and then invert the logarithm! We’ll generate a series of dates that will start with and then exceed the dt in daily_user_volume, use them as x-values to predict the log(y), and invert the logarithm with pow.

with
...
predictions as (
select
  '2013-10-01'::date + i, 
  coalesce(user_ct, 7111884) as user_ct, -- last real count
  pow(10, (select a from estimate_a) + (
      select b from estimate_b
    ) * i) estimate
from 
  -- make more dt_ids for the projection
  generate_series(1, 275, 1) i 
    left join daily_user_volume 
      on i = daily_user_volume.dt_id
),
select * from predictions

Look at that beautiful fit! We’ve fit an exponential curve to our cumulative user counts so that we can project the counts into the future.

Here’s the full SQL for all the steps together:

with
daily_new_users as (
  select created_at::date dt, count(1) daily_ct 
  from users where created_at > '2013-10-01' group by 1
),
daily_user_volume as (
  select 
    dt,
    dt - '2013-10-01' as dt_id, -- integer version of date
    84066 -- users before October 2013
      + sum(daily_ct) over (
        order by dt
        rows between unbounded preceding and current row
      ) as user_ct
  from daily_new_users
),
estimate_b as (
  select sum(covar.s) / sum(var.s) b
  from (
    select (
      dt_id - avg(dt_id::float8) over ()) * (
        log(user_ct) - avg(log(user_ct)) over ()
      ) as
    from daily_user_volume
  ) covar
  join (
    select pow(dt_id - avg(dt_id::float8) over (), 2) as
    from daily_user_volume
  ) var
  on true
),
estimate_a as (
  select 
    avg(log(user_ct)) - avg(dt_id::float) * (
      select b from estimate_b
    ) a
  from daily_user_volume
),
predictions as (
select
  '2013-10-01'::date + i, 
  coalesce(user_ct, 7111884) as user_ct, -- last real count
  pow(10, (select a from estimate_a) + (
      select b from estimate_b) * i
    ) estimate
from 
  -- make more dt_ids for the projection
  generate_series(1, 275, 1) i 
    left join daily_user_volume 
      on i = daily_user_volume.dt_id
)
select * from predictions
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?