###### Common Statistical Operations

# Impact or No? -- Simple Event Study in SQL

Timeseries charts can be quite powerful for depicting changes of a metric over time. For instance, this chart shows how revenue has grown over time:

From a cursory glance, it appears that revenue has really skyrocketed after our mobile launch on 12/24/14, but how can we be sure of the impact? In this blog, we dissect and quantify the effect of a single event on timeseries data.

### Timeseries Data

A timeseries is a collection of measurements for one metric over time. Timeseries data comes in a variety of shapes and sizes. For instance, the height of one person collected over time is expected to increase with each data point dependent on the previous data points; whereas stock prices for Apple are expected to be dynamic and hard to predict.

If the past is a good predictor of the future, we can model timeseries data using an autoregressive model, which means tomorrow’s value depends linearly on today’s value plus some imperfectly predictable term.

Autoregressive models can be used to model revenue since, as an example, tomorrow’s revenue is probably very similar to today’s revenue with some noise added in. What complicates this analysis is the impact of a singular event on revenue.

### Mobile Launch Scenario

Revisiting our earlier example, it’s really hard to precisely measure the impact that our mobile launch has had on revenue. However, we can actually build out an autoregressive model to measure this impact and then use the data to calculate the effect of the launch on revenue.

To measure the impact of a particular event, we will use a dummy variable D to indicate whether or not the mobile launch has happened yet. D will take on the value 1 when the mobile launch has already happened or 0 when it hasn’t. Thus, to generate the complete dataset for this series, we use the following in Postgres/Redshift SQL:

withrevenueas

(

selectd

date

,sum(price)asdailyrev

,casewhendate>='2014-12-24'

then1

else0

endasD

frompurchases

groupby1,3

)

We first build an AR(1) model – an auto-regressive model which uses the first lag of the metrics to predict its value. As with most timeseries models, we will be using the log of returns, where the return r is calculated as the proportional change in our metric x:

Doing the same in SQL, we create the CTE:

with

...

, log_revenueas

(

selectdate, log(dailyrev)aslogrev

, log(lag(dailyrev) over (orderbydate))

aslaglogrev

, D

fromrevenue

)

Putting this together with our dummy variable, we get the following AR(1) model:

### Calculating Coefficients for the Regression

Now that we have the model, we want to calculate the coefficients for all the beta values by running a regression analysis. To do this, we can use the ordinary least squares method to estimate these values:

Within SQL, we can create a few more CTEs — one for doing all the sigma calculations and one each for each beta:

with

...

, sigmasas

(

select

sum(D*D)-sum(D)*sum(D)/count(*)asSigSqD

,sum(laglogrev*logrev)-sum(laglogrev)*sum(logrev)

/count(*)asSigy1y

,sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*)asSigy1D

,sum(D*logrev)-sum(D)*sum(logrev)/count(*)asSigDy

,sum(laglogrev*laglogrev)-sum(laglogrev)*sum(laglogrev)

/count(*)asSigSqy1

, pow(sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*),2)

asSqSigy1Dfromlog_revenue

)

, beta1as

(

select

(SigSqD*Sigy1y-Sigy1D*SigDy)/(SigSqy1*SigSqD-SqSigy1D)

asbeta1

from

sigmas

)

, beta2as

(

select(SigSqy1*SigDy-Sigy1D*Sigy1y)/(SigSqy1*SigSqD-SqSigy1D)

asbeta2

from

sigmas

)

, beta0as

(

select

avg(logrev)-(selectbeta1frombeta1)*avg(laglogrev)

-(selectbeta2frombeta2)*avg(D)

asbeta0

from

log_revenue

)

Now that we have the finalized regression coefficients, we can draw a nice trendline through the data:

with

...

, finalized_regressionas

(

select

log_revenue.*

, beta1

, beta2

, beta0

from

log_revenue

, beta0

, beta1

, beta2

)

, predicted_revenueas(

select

finalized_regression.date

, finalized_regression.logrevaslog_revenue

, (beta0+beta1*log_revenue.laglogrev+beta2*log_revenue.D)

aspredicted_log_revenue

fromfinalized_regression

joinlog_revenueusing(date)

)

select*

frompredicted_revenue

Looks like the trendline is a great fit!

### Translating an Event into $$$

The beta values from our above regression turned out to be:

= 2.609**beta0**= 0.724**beta1**= 0.553**beta2**

So what do these numbers tell us about our mobile launch on 12/24/14? The key number we’re looking at is * beta2*, which is the coefficient in front of the dummy variable we created to indicate post-launch metrics. Since we were examining the logarithm of revenue returns, we can simply take the exponential to find its impact:

e^(0.553*1) = 1.738x

From the above analysis, launching the mobile version of the game produced 74% higher daily revenue than pre-launch! The day after our launch, our revenue is $25k higher according to our model.