Periscope Data
The world’s fastest visualization platform for data analysts.

Extrapolating Data with Day-of-Week Effects

May 14, 2015

Predicting the Future, Monday through Sunday

At Periscope, we need to plan ahead to make sure we have enough server capacity. This means predicting future server loads.

To get there, we can extrapolate our average load as a prediction while being sure to account for weekend vs. weekday traffic patterns. Ideally, we’d plot a predicted daily server load with day-of-week effects.

This chart is our goal:

We’ll start with extrapolating weekly data, then add in day-of-week effects for daily predictions.

Week Totals

First we’ll grab the weekly totals:

select
  date_trunc('week', created_at)::date as week,
  count(1) as total
from requests
group by 1

This produces the following chart:

Average Weekly Growth

Now we’ll calculate the average weekly growth with a lag window function to reference the previous week for the growth rate. For the average growth rate, we’ll use a geometric mean. Rather than a simple avg(...), the geometric mean is exp(avg(ln(...))).

We also exclude the current week since it doesn’t have complete data.

with weekly_totals as (
  select
    date_trunc('week', created_at)::date as week,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), avg_weekly_growth as (
  select exp(avg(ln(growth))) as growth
  from (
    select 
      total::float / lag(total, 1) over (order by week) as growth
    from weekly_totals
  ) t
)
select * from avg_weekly_growth

This shows traffic rising by about 6% per week. We can use this rate as the basis for weekly extrapolations.

Weekly Predictions

We now need to extrapolate the average historical rate to the future. We’ll assume a consistent monthly percentage growth rate and use an exponential model.

We’ll add more with clauses for the most recent week, and a list of the future_dates we want to predict server utilization for.

In Postgres we could use generate_series to create the future dates. For Redshift compatibility, though, we’ll instead re-use the dates from an existing table we know is densely packed, and shift the dates forward six weeks.

With those with clauses, we’ll union together the historical and predicted data. The prediction formula is the avg_weekly_growth raised to the number of elapsed weeks since last_week, times last_week’s total.

with weekly_totals as (
  select
    date_trunc('week', created_at)::date as week,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), avg_weekly_growth as (
  select exp(avg(ln(growth))) as growth
  from (
    select 
      total::float 
        / lag(total, 1) over (order by week) as growth
    from weekly_totals
  ) t
), last_week as (
  select * 
  from weekly_totals 
  order by week desc 
  limit 1 
), future_dates as (
  select distinct 
    (date_trunc('week', created_at) 
      + interval '6 weeks')::date as week
  from requests
  where 
    date_trunc('week', created_at) 
      + interval '6 weeks' > date_trunc('week', now())
)
select 
  week, 
  'historical'
  total 
from weekly_totals
union (
  select 
    future_dates.week, 
    'prediction'
    last_week.total 
      * pow(avg_weekly_growth.growth, 
            datediff('week', last_week.week, future_dates.week))
  from future_dates, last_week, avg_weekly_growth
  order by 1 asc
)

This produces the following chart:

Day-of-Week Trends

We can use the predicted weekly growth as our baseline, and separately calculate how much higher or lower each day of the week tends to be vs. the weekly average. Combining these two will give us a daily estimate that follows overall growth and reflects day-of-week effects.

We need a table mapping day of the week to the day-of-week multiplier. We’d expect weekends to have a multiplier less than 1 since weekend traffic is lower than average, and weekdays to have a multiplier greater than one.

with weekly_totals as (
  select
    date_trunc('week', created_at) as week,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), daily_totals AS (
  select 
    date_trunc('day', created_at) as day,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), dow_multiplier AS (
  select
    extract(dow from daily_totals.day) as dow,
    exp(avg(ln(
      daily_totals.total 
      / (weekly_totals.total::float / 7)))) as multiplier
  from daily_totals
  join weekly_totals
    on date_trunc('week', daily_totals.day) = weekly_totals.week
  group by 1
  order by 1
)
select * from dow_multiplier

These numbers are in range of our expectations. Weekend days are about 60% of the average, while each week day is 10-20% above the average.

Daily Predictions

Now to combine the weekly and day-of-week factors, and see both the overall growth trend and daily variance:

We evaluated the accuracy by predicting the two most recent weeks based upon past data. We only evaluated weekday accuracy, as weekends are volatile and not our peak traffic. The average absolute error was 2.7%, which is plenty accurate for our capacity planning.

This query is a doozy, so let’s break it down. We still need the weekly average growth rate we calculated above:

with weekly_totals as (
  select
    date_trunc('week', created_at)::date as week,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), avg_weekly_growth as (
  select exp(avg(ln(growth))) as growth
  from (
    select 
      total::float / lag(total, 1) over (order by week) as growth
    from weekly_totals
  ) t
)

We also need the daily request totals and day-of-week trends:

, daily_totals AS (
  select 
    date_trunc('day', created_at)::date as day,
    count(1) as total
  from requests
  where date_trunc('week', created_at) < date_trunc('week', now())
  group by 1
), dow_multipliers AS (
  select
    extract(dow from daily_totals.day) as dow,
    exp(avg(ln(
      daily_totals.total 
      / (weekly_totals.total::float / 7)))) as multiplier
  from daily_totals
  join weekly_totals
    on extract(dow from daily_totals.day) = weekly_totals.week
  group by 1
  order by 1
)

Now we need another simple with clause — last_week in our query — to use as the base for our predictions.

, last_week as (
  select *
  from weekly_totals
  order by week desc
  limit 1
)

In order to re-use the same trick we did for finding future weeks, we need a list of future days we can create predictions for:

, future_dates as (
  select distinct 
    (date_trunc('day', created_at) 
      + interval '30 days')::date as week
  from requests.
  where 
    date_trunc('week', created_at) 
      + interval '30 days' > date_trunc('week', now())
)

Finally, we can union together the historical data and predictions. Each prediction is a product of last week’s daily average, times the correct dow_multiplier, times the avg_weekly_growth raised to the power of the number of weeks since last_week.

select 
  day
  'historical'
  total
from daily_requests
union (
  select 
    future_dates.day,
    'prediction',
     last_week.total / 7
      * dow_multipliers.multiplier
      * pow(avg_weekly_growth.growth,
            datediff('week', last_week.week, 
                             future_dates.day))
  from future_dates
  join dow_multipliers 
    on dow_multipliers.dow = extract(dow from future_dates.day)
  cross join last_week, avg_weekly_growth
  order by 1 asc
)

We now have a rough prediction for server capacity planning that can be implemented in half a page of SQL. There are more robust models that could be used, but this model is both simple and accurate enough for our purposes.

Now that we’re done, go forth and extrapolate your daily data!

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?