Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

Impact or No? -- Simple Event Study in SQL Part I

March 24, 2016

Timeseries charts can be quite powerful for depicting changes of a metric over time. For instance, this chart shows how revenue has grown over time:

From a cursory glance, it appears that revenue has really skyrocketed after our mobile launch on 12/24/14, but how can we be sure of the impact? In this blog, we dissect and quantify the effect of a single event on timeseries data.

Timeseries Data

timeseries is a collection of measurements for one metric over time. Timeseries data comes in a variety of shapes and sizes. For instance, the height of one person collected over time is expected to increase with each data point dependent on the previous data points; whereas stock prices for Apple are expected to be dynamic and hard to predict.

If the past is a good predictor of the future, we can model timeseries data using an autoregressive model, which means tomorrow’s value depends linearly on today’s value plus some imperfectly predictable term.

Autoregressive models can be used to model revenue since, as an example, tomorrow’s revenue is probably very similar to today’s revenue with some noise added in. What complicates this analysis is the impact of a singular event on revenue.

Mobile Launch Scenario

Revisiting our earlier example, it’s really hard to precisely measure the impact that our mobile launch has had on revenue. However, we can actually build out an autoregressive model to measure this impact and then use the data to calculate the effect of the launch on revenue.

To measure the impact of a particular event, we will use a dummy variable D to indicate whether or not the mobile launch has happened yet. D will take on the value 1 when the mobile launch has already happened or 0 when it hasn’t. Thus, to generate the complete dataset for this series, we use the following in Postgres/Redshift SQL:

with revenue as 
  (
  select d
    date
    , sum(price) as dailyrev
    , case when date >= '2014-12-24' 
  then 1
  else 0
    end as D
  from purchases
  group by 1, 3
)

We first build an AR(1) model – an auto-regressive model which uses the first lag of the metrics to predict its value. As with most timeseries models, we will be using the log of returns, where the return r is calculated as the proportional change in our metric x:

Doing the same in SQL, we create the CTE:

with
...
, log_revenue as
  (
  select date, log(dailyrev) as logrev
  , log(lag(dailyrev) over (order by date)) 
as laglogrev
  , D 
  from revenue
)

Putting this together with our dummy variable, we get the following AR(1) model:

Calculating Coefficients for the Regression

Now that we have the model, we want to calculate the coefficients for all the beta values by running a regression analysis. To do this, we can use the ordinary least squares method to estimate these values:

Within SQL, we can create a few more CTEs — one for doing all the sigma calculations and one each for each beta:

with
...
, sigmas as 
  (
  select
  sum(D*D)-sum(D)*sum(D)/count(*) as SigSqD 
  , sum(laglogrev*logrev)-sum(laglogrev)*sum(logrev)
    /count(*) as Sigy1y
  , sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*) as Sigy1D
  , sum(D*logrev)-sum(D)*sum(logrev)/count(*) as SigDy
  , sum(laglogrev*laglogrev)-sum(laglogrev)*sum(laglogrev)
/count(*) as SigSqy1
  , pow(sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*),2
as SqSigy1D from log_revenue
  )
, beta1 as 
  (
  select 
    (SigSqD*Sigy1y-Sigy1D*SigDy)/(SigSqy1*SigSqD-SqSigy1D) 
      as beta1 
  from 
    sigmas
  )
, beta2 as 
  (
  select (SigSqy1*SigDy-Sigy1D*Sigy1y)/(SigSqy1*SigSqD-SqSigy1D) 
    as beta2 
  from 
    sigmas
  )
, beta0 as 
  (
  select 
    avg(logrev)-(select beta1 from beta1) * avg(laglogrev) 
      - (select beta2 from beta2)*avg(D)
      as beta0
  from 
    log_revenue
  )

Now that we have the finalized regression coefficients, we can draw a nice trendline through the data:

with
...
, finalized_regression as 
  (
  select 
    log_revenue.*
    , beta1
    , beta2
    , beta0
  from 
log_revenue
, beta0
, beta1
, beta2
  )
, predicted_revenue as (
  select 
    finalized_regression.date
    , finalized_regression.logrev as log_revenue
    , (beta0 + beta1*log_revenue.laglogrev + beta2*log_revenue.D) 
      as predicted_log_revenue 
  from finalized_regression 
  join log_revenue using (date)
)
select * 
from predicted_revenue

Looks like the trendline is a great fit!

Translating an Event into $$$

The beta values from our above regression turned out to be:

  • beta0 = 2.609
  • beta1 = 0.724
  • beta2 = 0.553

So what do these numbers tell us about our mobile launch on 12/24/14? The key number we’re looking at is beta2, which is the coefficient in front of the dummy variable we created to indicate post-launch metrics. Since we were examining the logarithm of revenue returns, we can simply take the exponential to find its impact:

e^(0.553*1) = 1.738x

From the above analysis, launching the mobile version of the game produced 74% higher daily revenue than pre-launch! The day after our launch, our revenue is $25k higher according to our model.

In part 2 of our event study series, we will discuss validating the results of this regression, along with checking for statistical significance.


Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter