###### Common Statistical Operations

# Predicting Exponential Growth with SQL

SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.

Let’s estimate total users over time for a rapidly growing fictitious mobile app. The early days of this app were pretty messy, so we’ll chop them off add them in as the starting sum, only plotting dates after October 2013. Instead of a range join to get the rolling sum we’ll use a window function. To keep things organized we’ll put each step into a * with* query:

with

daily_new_usersas(

selectcreated_at::date dt,count(1) daily_ct

fromuserswherecreated_at>'2013-10-01'groupby1),

daily_user_volumeas(

select

dt,

dt-'2013-10-01'asdt_id,-- integer version of date

84066-- users before October 2013

+sum(daily_ct) over (

orderbydt

rowsbetweenunbounded precedingandcurrentrow

)

asuser_ct

fromdaily_new_users

)

selectdt, daily_ctfromdaily_new_users;

selectdt, user_ctfromdaily_user_volume

Here are the two curves:

The * daily_user_volume* data has the look of an exponential growth curve, making it an ideal candidate for an exponential projection.

### Linearizing Exponential Data

The easiest kind of regression is a linear regression. Of course, fitting a line to exponential data would yield a terrible fit. Instead, we can linearize the exponential data by taking the log of the data, fit that to a linear regression, and then inverse process on the future data points to get the predicted future growth.

The log’ed version of * daily_user_volume* is quite linear, so this will be a great fit:

To make a linear regression, we need to find the best estimates for * A* and

*(intercept and slope) that minimize the error in this formula:*

**B**We’ll use the Ordinary Least Squares method for minimizing error of our estimates, which lets us solve for A and B like this:

Solving for B first, we’ll define * estimate_b* as:

with

...

estimate_bas(

selectsum(covar.s)/sum(var.s) b

from(

select(

dt_id-avg(dt_id::float8) over ())*(

log(user_ct)-avg(log(user_ct)) over ()

)ass

fromdaily_user_volume) covar

join(

selectpow(dt_id-avg(dt_id::float8) over (),2)ass

fromdaily_user_volume

) var

ontrue

),

Critically, we’re taking the * log* of

*to linearize those exponential data points!*

**user_ct**Our window functions use * over ()* so that the window is applied to the whole result set. It’s very convenient in situations like this, where you want to compare each row to an aggregation over every row.

Getting * estimate_a* is more straightforward:

with

...

estimate_aas(

select

avg(log(user_ct))-avg(dt_id::float)*

(selectbfromestimate_b) a

fromdaily_user_volume

),

Now that we have our A and B for the regression, it’s time to project forward.

### Projecting and De-linearizing

With the estimates computed, we can simply generate the y-values for the current and future dates and then invert the logarithm! We’ll generate a series of dates that will start with and then exceed the * dt* in

*, use them as x-values to predict the*

**daily_user_volume***, and invert the logarithm with*

**log(y)***.*

**pow**with

...

predictionsas(

select

::date'2013-10-01'+i,

coalesce(user_ct,7111884)asuser_ct,-- last real count

pow(10, (selectafromestimate_a)+(

selectbfromestimate_b

)*i) estimate

from

-- make more dt_ids for the projection

generate_series(1,275,1) i

leftjoindaily_user_volume

oni=daily_user_volume.dt_id

),

select*frompredictions

Look at that beautiful fit! We’ve fit an exponential curve to our cumulative user counts so that we can project the counts into the future.

Here’s the full SQL for all the steps together:

with

daily_new_usersas(

selectcreated_at::date dt,count(1) daily_ct

fromuserswherecreated_at>'2013-10-01'groupby1

),

daily_user_volumeas(

select

dt,

dt-'2013-10-01'asdt_id,-- integer version of date

84066-- users before October 2013

+sum(daily_ct) over (

orderbydt

rowsbetweenunbounded precedingandcurrentrow

)asuser_ct

fromdaily_new_users

),

estimate_bas(

selectsum(covar.s)/sum(var.s) b

from(

select(

dt_id-avg(dt_id::float8) over ())*(

log(user_ct)-avg(log(user_ct)) over ()

)ass

fromdaily_user_volume

) covar

join(

selectpow(dt_id-avg(dt_id::float8) over (),2)ass

fromdaily_user_volume

) var

ontrue

),

estimate_aas(

select

avg(log(user_ct))-avg(dt_id::float)*(

selectbfromestimate_b

) a

fromdaily_user_volume

),

predictionsas(

select

::date'2013-10-01'+i,

coalesce(user_ct,7111884)asuser_ct,-- last real count

pow(10, (selectafromestimate_a)+(

selectbfromestimate_b)*i

) estimate

from

-- make more dt_ids for the projection

generate_series(1,275,1) i

leftjoindaily_user_volume

oni=daily_user_volume.dt_id

)

select*frompredictions