Periscope Data
The world’s fastest visualization platform for data analysts.

Calculating Significance of A/B Tests in Redshift

October 15, 2015

“A/B test” and “statistically significant” have quickly become part of the standard business vocabulary as running tests has become more popular in growing businesses.

When reviewing an A/B test for significance, it’s typical to run a query and then plug the numbers into a significance calculator.

But with Amazon Redshift’s new user defined functions, we can calculate significance right inside our query!

Math for calculating significance

We’ll start with a walk through of the math, then we’ll dive into the actual implementation.

We assume you are familiar with the basics of A/B testing math, but if you are new to it or need a refresher, take a look at Amazon’s explainer.

There are many options for calculating significance, but here we’ll use the simplest and most common method — the normal approximation.

Normal approximation works by estimating the standard errors of the conversion rates as if they are from a normal distribution, instead of the actual binomial distribution.

We then need to know how many standard errors apart from each other the control and experiment conversion rates are. We can average the standard errors, and divide the difference in conversion rates by the average standard error to get a Z-score, which can be mapped to a final probability.

If you need more accuracy than the method above offers, and don’t mind increased complexity, there are many alternative methods for estimating confidence intervals of binomial distributions.

That’s all the math we need, now let’s implement it.

Code for calculating significance

While we could easily calculate Z-scores in SQL, mapping from a Z-score to a probability is not straightforward. Fortunately, Redshift’s user defined functions have access to many numeric libraries. SciPy can map from a Z-score to a probability, which is the last piece we need to write our function.

Here’s the user defined function:

create or replace function 
  significance(control_size integer, 
               control_conversion integer, 
               experiment_size integer, 
               experiment_conversion integer)
  returns float
  stable as $$
    from scipy.stats import norm
    def standard_error(sample_size, successes):
      p = float(successes) / sample_size
      return ((p * (1 - p)) / sample_size) ** 0.5
    def zscore(size_a, successes_a, size_b, successes_b):
      p_a = float(successes_a) / size_a
      p_b = float(successes_b) / size_b
      se_a = standard_error(size_a, successes_a)
      se_b = standard_error(size_b, successes_b)
      numerator = (p_b - p_a)
      denominator = (se_a ** 2 + se_b ** 2) ** 0.5
      return numerator / denominator
    def percentage_from_zscore(zscore):
      return norm.sf(abs(zscore))
    exp_zscore = zscore(control_size, control_conversion, 
                        experiment_size, experiment_conversion)
    return percentage_from_zscore(exp_zscore)
  $$ language plpythonu;

This code implements the math from the Amazon article on A/B testing math. The call to scipy.stats.norm.sf lib is from SciPy’s norm library.

Once we run this create or replace function in a Redshift console, we can test it out. Imagine we have an experiment with 1000 users in the control with 100 conversions, and 1000 users in the treatment with 125 conversions.

We could measure the signifiance by calling: signifiance(1000, 100, 1000, 125). Here’s a full example:

  'first_experiment' as name, 
  significance(1000, 100, 1000, 125)
  'second_experiment' as name, 
  significance(500, 30, 500, 38)

Now we can easily calculate significance directly from our database.

With great power, comes great responsibility

Now that we could check significance of our A/B tests all the time, we need to be aware of the dangers of checking too much. The technical term is “repeated significance testing errors”.

If we repeatedly check for significance, we’ll increase the risk of checking when the experiment looks significant because of a random wobble. If we also launch the experiment as soon as it first looks significant, we’ll accidentally launch a lot of wobbles.

An explanation of this can be found in How Not To Run An A/B Test by Evan Miller.

We can protect ourselves from this risk by not launching an experiment as soon as it looks significant. If we choose how large of an experiment we need, then only check significance at the end of the experiment, we’ll be in the clear.

Go launch some experiments

Good luck on evaluating your A/B tests results. Happy shipping!

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?