Common Statistical Operations
Impact or No? -- Simple Event Study in SQL
Timeseries charts can be quite powerful for depicting changes of a metric over time. For instance, this chart shows how revenue has grown over time:
From a cursory glance, it appears that revenue has really skyrocketed after our mobile launch on 12/24/14, but how can we be sure of the impact? In this blog, we dissect and quantify the effect of a single event on timeseries data.
A timeseries is a collection of measurements for one metric over time. Timeseries data comes in a variety of shapes and sizes. For instance, the height of one person collected over time is expected to increase with each data point dependent on the previous data points; whereas stock prices for Apple are expected to be dynamic and hard to predict.
If the past is a good predictor of the future, we can model timeseries data using an autoregressive model, which means tomorrow’s value depends linearly on today’s value plus some imperfectly predictable term.
Autoregressive models can be used to model revenue since, as an example, tomorrow’s revenue is probably very similar to today’s revenue with some noise added in. What complicates this analysis is the impact of a singular event on revenue.
Mobile Launch Scenario
Revisiting our earlier example, it’s really hard to precisely measure the impact that our mobile launch has had on revenue. However, we can actually build out an autoregressive model to measure this impact and then use the data to calculate the effect of the launch on revenue.
To measure the impact of a particular event, we will use a dummy variable D to indicate whether or not the mobile launch has happened yet. D will take on the value 1 when the mobile launch has already happened or 0 when it hasn’t. Thus, to generate the complete dataset for this series, we use the following in Postgres/Redshift SQL:
with revenue as
, sum(price) as dailyrev
, case when date >= '2014-12-24'
end as D
group by 1, 3
We first build an AR(1) model – an auto-regressive model which uses the first lag of the metrics to predict its value. As with most timeseries models, we will be using the log of returns, where the return r is calculated as the proportional change in our metric x:
Doing the same in SQL, we create the CTE:
, log_revenue as
select date, log(dailyrev) as logrev
, log(lag(dailyrev) over (order by date))
Putting this together with our dummy variable, we get the following AR(1) model:
Calculating Coefficients for the Regression
Now that we have the model, we want to calculate the coefficients for all the beta values by running a regression analysis. To do this, we can use the ordinary least squares method to estimate these values:
Within SQL, we can create a few more CTEs — one for doing all the sigma calculations and one each for each beta:
, sigmas as
sum(D*D)-sum(D)*sum(D)/count(*) as SigSqD
/count(*) as Sigy1y
, sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*) as Sigy1D
, sum(D*logrev)-sum(D)*sum(logrev)/count(*) as SigDy
/count(*) as SigSqy1
as SqSigy1D from log_revenue
, beta1 as
, beta2 as
, beta0 as
avg(logrev)-(select beta1 from beta1) * avg(laglogrev)
- (select beta2 from beta2)*avg(D)
Now that we have the finalized regression coefficients, we can draw a nice trendline through the data:
, finalized_regression as
, predicted_revenue as (
, finalized_regression.logrev as log_revenue
, (beta0 + beta1*log_revenue.laglogrev + beta2*log_revenue.D)
join log_revenue using (date)
Looks like the trendline is a great fit!
Translating an Event into $$$
The beta values from our above regression turned out to be:
- beta0 = 2.609
- beta1 = 0.724
- beta2 = 0.553
So what do these numbers tell us about our mobile launch on 12/24/14? The key number we’re looking at is beta2, which is the coefficient in front of the dummy variable we created to indicate post-launch metrics. Since we were examining the logarithm of revenue returns, we can simply take the exponential to find its impact:
e^(0.553*1) = 1.738x
From the above analysis, launching the mobile version of the game produced 74% higher daily revenue than pre-launch! The day after our launch, our revenue is $25k higher according to our model.