Calculating Significance of A/B Tests in Redshift
“A/B test” and “statistically significant” have quickly become part of the standard business vocabulary as running tests has become more popular in growing businesses.
When reviewing an A/B test for significance, it’s typical to run a query and then plug the numbers into a significance calculator.
But with Amazon Redshift’s new user defined functions, we can calculate significance right inside our query!
Math for calculating significance
We’ll start with a walk through of the math, then we’ll dive into the actual implementation.
We assume you are familiar with the basics of A/B testing math, but if you are new to it or need a refresher, take a look at Amazon’s explainer.
There are many options for calculating significance, but here we’ll use the simplest and most common method — the normal approximation.
Normal approximation works by estimating the standard errors of the conversion rates as if they are from a normal distribution, instead of the actual binomial distribution.
We then need to know how many standard errors apart from each other the control and experiment conversion rates are. We can average the standard errors, and divide the difference in conversion rates by the average standard error to get a Z-score, which can be mapped to a final probability.
If you need more accuracy than the method above offers, and don’t mind increased complexity, there are many alternative methods for estimating confidence intervals of binomial distributions.
That’s all the math we need, now let’s implement it.
Code for calculating significance
While we could easily calculate Z-scores in SQL, mapping from a Z-score to a probability is not straightforward. Fortunately, Redshift’s user defined functions have access to many numeric libraries. SciPy can map from a Z-score to a probability, which is the last piece we need to write our function.
Here’s the user defined function:
create or replace function
stable as $$
from scipy.stats import norm
def standard_error(sample_size, successes):
p = float(successes) / sample_size
return ((p * (1 - p)) / sample_size) ** 0.5
def zscore(size_a, successes_a, size_b, successes_b):
p_a = float(successes_a) / size_a
p_b = float(successes_b) / size_b
se_a = standard_error(size_a, successes_a)
se_b = standard_error(size_b, successes_b)
numerator = (p_b - p_a)
denominator = (se_a ** 2 + se_b ** 2) ** 0.5
return numerator / denominator
exp_zscore = zscore(control_size, control_conversion,
$$ language plpythonu;
Once we run this create or replace function in a Redshift console, we can test it out. Imagine we have an experiment with 1000 users in the control with 100 conversions, and 1000 users in the treatment with 125 conversions.
We could measure the signifiance by calling: signifiance(1000, 100, 1000, 125). Here’s a full example:
'first_experiment' as name,
significance(1000, 100, 1000, 125)
'second_experiment' as name,
significance(500, 30, 500, 38)
Now we can easily calculate significance directly from our database.
With great power, comes great responsibility
Now that we could check significance of our A/B tests all the time, we need to be aware of the dangers of checking too much. The technical term is “repeated significance testing errors”.
If we repeatedly check for significance, we’ll increase the risk of checking when the experiment looks significant because of a random wobble. If we also launch the experiment as soon as it first looks significant, we’ll accidentally launch a lot of wobbles.
An explanation of this can be found in How Not To Run An A/B Test by Evan Miller.
We can protect ourselves from this risk by not launching an experiment as soon as it looks significant. If we choose how large of an experiment we need, then only check significance at the end of the experiment, we’ll be in the clear.
Go launch some experiments
Good luck on evaluating your A/B tests results. Happy shipping!