SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.



Let’s estimate total users over time for a rapidly growing fictitious mobile app. The early days of this app were pretty messy, so we’ll chop them off and add them in as the starting sum, only plotting dates after October 2013. Instead of a r<a href="https://www.sisense.com/blog/range-joins-give-you-accurate-histories/">ange join to get the rolling sum, </a>we’ll use a window function. To keep things organized, we’ll put each step into a with query:



<pre class="wp-block-code"><code>with
daily_new_users as (
 select created_at::date dt, count(1) daily_ct 
 from users where created_at > '2013-10-01' group by 1),
daily_user_volume as (
 select 
 dt,
 dt - '2013-10-01' as dt_id, -- integer version of date
 84066 -- users before October 2013
 + sum(daily_ct) over (
 order by dt
 rows between unbounded preceding and current row
 ) 
 as user_ct
 from daily_new_users
)
select dt, daily_ct from daily_new_users;
select dt, user_ct from daily_user_volume</code></pre>



 Here are the two curves: 



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-user-counts-new.png" alt="Predicting new user counts" class="wp-image-78875"/></figure>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-user-counts-cumulative.png" alt="Predicting cumulative user counts" class="wp-image-78881"/></figure>



The daily_user_volume &nbsp;data has the look of an exponential growth curve, making it an ideal candidate for an exponential projection.



<h2 class="wp-block-heading">Linearizing Exponential Data </h2>



The easiest kind of regression is a <a href="https://en.wikipedia.org/wiki/Linear_regression" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">linear regression</a>. Of course, fitting a line to exponential data would yield a terrible fit. Instead, we can linearize the exponential data by taking the log of the data fit to a linear regression, and then inverse process on the future data points to get the predicted future growth.



The log’ed version of daily_user_volume &nbsp;is quite linear, so this will be a great fit:



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-user-counts-log.png" alt="User volume log scale" class="wp-image-78889"/></figure>



To make a linear regression, we need to find the best estimates for A and B (intercept and slope) that minimize the error in this formula:



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-slope-formula.png" alt="Predicting slope formula" class="wp-image-78895"/></figure>



We’ll use the <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Ordinary Least Squares</a> method for minimizing error of our estimates, which lets us solve for A and B like this:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-b-formula.png" alt="Predicting b formula" class="wp-image-78904"/></figure>



Solving for B first, we’ll define estimate_b &nbsp;as:



<pre class="wp-block-code"><code>with
...
estimate_b as (
 select sum(covar.s) / sum(var.s) b
 from (
 select (
 dt_id - avg(dt_id::float8) over ()) * (
 log(user_ct) - avg(log(user_ct)) over ()
 ) as s 
 from daily_user_volume) covar
 join (
 select pow(dt_id - avg(dt_id::float8) over (), 2) as s 
 from daily_user_volume
 ) var
 on true
),</code></pre>



Critically, we’re taking the log of user_ct to linearize those exponential data points!



Our window functions use over () so that the window is applied to the whole result set. It’s very convenient in situations like this, where you want to compare each row to an aggregation over every row.



Getting estimate_a &nbsp;is more straightforward:



<pre class="wp-block-code"><code>with
...
estimate_a as (
 select 
 avg(log(user_ct)) - avg(dt_id::float) * 
 (select b from estimate_b) a
 from daily_user_volume
),</code></pre>



Now that we have our A and B for the regression, it’s time to project forward.



<h3 class="wp-block-heading">Projecting and De-linearizing</h3>



With the estimates computed, we can simply generate the y-values for the current and future dates and then invert the logarithm! We’ll generate a series of dates that will start with and then exceed the&nbsp;dt&nbsp;in&nbsp;daily_user_volume, use them as x-values to predict the&nbsp;log(y), and invert the logarithm with&nbsp;pow.



<pre class="wp-block-code"><code>with
...
predictions as (
select
 '2013-10-01'::date + i, 
 coalesce(user_ct, 7111884) as user_ct, -- last real count
 pow(10, (select a from estimate_a) + (
 select b from estimate_b
 ) * i) estimate
from 
 -- make more dt_ids for the projection
 generate_series(1, 275, 1) i 
 left join daily_user_volume 
 on i = daily_user_volume.dt_id
),
select * from predictions</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/predicting-user-counts-predictions1.png" alt="Predicting user counts" class="wp-image-78986"/></figure>



Look at that beautiful fit! We’ve fit an exponential curve to our cumulative user counts so that we can project the counts into the future.



Here’s the full SQL for all the steps together:



<pre class="wp-block-code"><code>with
daily_new_users as (
 select created_at::date dt, count(1) daily_ct 
 from users where created_at > '2013-10-01' group by 1
),
daily_user_volume as (
 select 
 dt,
 dt - '2013-10-01' as dt_id, -- integer version of date
 84066 -- users before October 2013
 + sum(daily_ct) over (
 order by dt
 rows between unbounded preceding and current row
 ) as user_ct
 from daily_new_users
),
estimate_b as (
 select sum(covar.s) / sum(var.s) b
 from (
 select (
 dt_id - avg(dt_id::float8) over ()) * (
 log(user_ct) - avg(log(user_ct)) over ()
 ) as s 
 from daily_user_volume
 ) covar
 join (
 select pow(dt_id - avg(dt_id::float8) over (), 2) as s 
 from daily_user_volume
 ) var
 on true
),
estimate_a as (
 select 
 avg(log(user_ct)) - avg(dt_id::float) * (
 select b from estimate_b
 ) a
 from daily_user_volume
),
predictions as (
select
 '2013-10-01'::date + i, 
 coalesce(user_ct, 7111884) as user_ct, -- last real count
 pow(10, (select a from estimate_a) + (
 select b from estimate_b) * i
 ) estimate
from 
 -- make more dt_ids for the projection
 generate_series(1, 275, 1) i 
 left join daily_user_volume 
 on i = daily_user_volume.dt_id
)
select * from predictions</code></pre>

Predicting Exponential Growth with SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article