“A/B test” and “statistically significant” have quickly become part of the standard business vocabulary as running tests has become more popular in growing businesses.



When reviewing an A/B test for significance, it’s typical to run a query and then plug the numbers into a significance calculator.



But with <a href="https://www.sisense.com/blog/redshift-user-defined-functions-python/">Amazon Redshift’s user defined functions</a>, we can calculate significance right inside our query!



<h2 class="wp-block-heading">Math for calculating significance</h2>



We’ll start with a walk through of the math, then we’ll dive into the actual implementation.



We assume you are familiar with the basics of A/B testing math, but if you are new to it or need a refresher, take a look at Amazon’s explainer.



There are many options for calculating significance, but here we’ll use the simplest and most common method — <a href="https://revisionmaths.com/advanced-level-maths-revision/statistics/normal-approximations" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">the normal approximation</a>.



Normal approximation works by estimating the standard errors of the conversion rates as if they are from a normal distribution, instead of the actual <a href="https://en.wikipedia.org/wiki/Binomial_distribution" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">binomial distribution</a>.



We then need to know how many standard errors apart from each other the control and experiment conversion rates are. We can average the standard errors, and divide the difference in conversion rates by the average standard error to get a <a href="https://en.wikipedia.org/wiki/Standard_score" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Z-score</a>, which can be mapped to a final probability.



If you need more accuracy than the method above offers, and don’t mind increased complexity, there are many alternative methods for <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">estimating confidence intervals of binomial distributions</a>.



That’s all the math we need, now let’s implement it.



<h2 class="wp-block-heading">Code for calculating significance</h2>



While we could easily calculate Z-scores in SQL, mapping from a Z-score to a probability is not straightforward. Fortunately, Redshift’s user defined functions have access to many numeric libraries. <a href="https://www.scipy.org/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">SciPy</a> can map from a Z-score to a probability, which is the last piece we need to write our function.



Here’s the user defined function:



<pre class="wp-block-code"><code>create or replace function 
 significance(control_size integer, 
 control_conversion integer, 
 experiment_size integer, 
 experiment_conversion integer)
 returns float
 stable as $$
 from scipy.stats import norm
 def standard_error(sample_size, successes):
 p = float(successes) / sample_size
 return ((p * (1 - p)) / sample_size) ** 0.5
 def zscore(size_a, successes_a, size_b, successes_b):
 p_a = float(successes_a) / size_a
 p_b = float(successes_b) / size_b
 se_a = standard_error(size_a, successes_a)
 se_b = standard_error(size_b, successes_b)
 numerator = (p_b - p_a)
 denominator = (se_a ** 2 + se_b ** 2) ** 0.5
 return numerator / denominator
 def percentage_from_zscore(zscore):
 return norm.sf(abs(zscore))
 exp_zscore = zscore(control_size, control_conversion, 
 experiment_size, experiment_conversion)
 return percentage_from_zscore(exp_zscore)
 $$ language plpythonu;</code></pre>



This code implements the math from the Amazon article on A/B testing math. The call to scipy.stats.norm.sf lib is from SciPy’s norm library.



Once we run this create or replace function in a Redshift console, we can test it out. Imagine we have an experiment with 1000 users in the control with 100 conversions, and 1000 users in the treatment with 125 conversions.



We could measure the significance by calling: significance(1000, 100, 1000, 125). Here’s a full example:



<pre class="wp-block-code"><code>select 
 'first_experiment' as name, 
 significance(1000, 100, 1000, 125)
union 
select 
 'second_experiment' as name, 
 significance(500, 30, 500, 38)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-12.png" alt="Significance chart" class="wp-image-78355"/></figure>



Now we can easily calculate significance directly from our database.



<h2 class="wp-block-heading">With great power, comes great responsibility</h2>



Now that we could check significance of our A/B tests all the time, we need to be aware of the dangers of checking too much. The technical term is “repeated significance testing errors”.



If we repeatedly check for significance, we’ll increase the risk of checking when the experiment looks significant because of a random wobble. If we also launch the experiment as soon as it first looks significant, we’ll accidentally launch a lot of wobbles.



An explanation of this can be found in<a href="https://www.evanmiller.org/how-not-to-run-an-ab-test.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)"> How Not To Run An A/B Test </a>by Evan Miller.



We can protect ourselves from this risk by not launching an experiment as soon as it looks significant. If we choose how large of an experiment we need, then only check significance at the end of the experiment, we’ll be in the clear.



<h2 class="wp-block-heading">Go launch some experiments</h2>



Good luck on evaluating your A/B tests results. Happy shipping!



Sisense for Cloud Data Teams and <a href="https://www.sisense.com/get/sisense-and-aws/">Amazon Web Services</a> combine to provide the fastest and easiest way to deliver scalable, high-performance, and secure cloud analytics.

Calculating Significance of A/B Tests in Redshift

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article