<h2 class="wp-block-heading">Statistical overconfidence: Dangerous and easy</h2>



<p>Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?</p>



<p>The straightforward answer is $8k * 12 = $96k. But how confident should you be? Will your conversion rate always be so close to 5%? You could pad the estimate ±20% for safety, guessing at $77k to $115k. If $77k would cover all your expenses, should you feel secure?</p>



<p>This is a question of <a href="https://en.wikipedia.org/wiki/Binomial_distribution" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">binomial probability</a>. Using our favorite binomial confidence interval calculator, the 95% confidence interval for your conversion rate is about 2.5% to 9%.</p>



<p>With a confidence interval that wide, you should expect to make somewhere between $48k and $172k. Yikes! You could end up with half of your simple guess, and that’s if your business doesn’t change. </p>



<h2 class="wp-block-heading">Automating statistics: Calculating confidence intervals in SQL</h2>



<p>These confidence intervals are very informative, but turning to a calculator for every metric is tedious. If you’ve got hundreds of metrics across dozens of dashboards, it’s downright unsustainable.</p>



<p>Fortunately, the math for calculating confidence interval is simple to implement:</p>



<h2 class="wp-block-heading">The Normal Approximation Interval formula for binomial confidence intervals</h2>



<pre class="wp-block-code"><code>n = number of users
x = number of conversions
p = probability of conversion = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)</code></pre>



<p><em>See <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Normal approximation interval on wikipedia</a>. Note the 1.96 constant specifies a 95% interval on a <a href="https://en.wikipedia.org/wiki/One-_and_two-tailed_tests" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">two-tailed normal distribution</a>.</em></p>



<h3 class="wp-block-heading">Implementing the formula in SQL</h3>



<p>Let’s start with a table of the total number of users, and how many converted. Any data that represents a rate — conversions per user, server errors per request, etc. — will also work.</p>



<pre class="wp-block-code"><code>select 
  count(1) as n, 
  sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-19.png" alt="Users table" class="wp-image-78764"/></figure>



<p>With our basic data in hand, we want to implement the above formula in SQL. To keep things clear, we wrap each step of the calculation separately:<br></p>



<ol><li>Calculate the conversation rate, p.</li><li>Using p, calculate the standard error, se.</li><li>Compute the low and high confidence intervals.</li><li>Include the original p conversion rate as our mid estimate.</li></ol>



<pre class="wp-block-code"><code>select 
  rates.n as users, 
  rates.x as conversions, 
  p - se * 1.96 as low, 
  intervals.p as mid, 
  p + se * 1.96 as high 
from (
  select 
    rates.*, 
    sqrt(p * (1 - p) / n) as se -- calculate se
  from (
      select conversions.*, 
      x / n::float as p -- calculate p
    from ( 
      -- Our conversion rate table from above
      select 
        count(1) as n, 
        sum(case when converted then 1 else 0 end) as x
      from users
      group by date_trunc('month', created_at);
    ) conversions
  ) rates
) intervals</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-02-20.png" alt="adjusted table" class="wp-image-78770"/></figure>



<p>You might be wondering why we’re seeing 8% on the high end, rather than the 9% mentioned in the introduction. We used the Adjusted Wald method in the introduction, which produces more accurate estimates for small amounts of data.</p>



<h2 class="wp-block-heading">A refinement for little data: The Adjusted Wald method</h2>



<p>The math explained above, though quite accurate with hundreds of users and a healthy conversion rate, becomes increasingly biased with less data or extremely high or low rates. A rule of thumb is to avoid using it with fewer than 5 conversions or 100 users.</p>



<p>One way to adjust for these shortcomings is to use a more robust <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">binomial proportion confidence</a> interval technique like the <a href="http://www.measuringux.com/adjustedwald.htm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Adjusted Wald method</a>. In short, it adds a bit of fuzziness to the estimated probability to smooth out the extremely high or low rates which are more common with few data points.</p>



<p>Given the z-score needed to reach a certain confidence level (1.96 for a 95% confidence), add 0.5 * z^2 to the number of conversions, and z^2 to the number of users. This is roughly +2 and +4 for the 1.96 z-score for 95%.</p>



<pre class="wp-block-code"><code>select 
  rates.n as users, 
  rates.x as conversions, 
  p - se * 1.96 as low, 
  intervals.p as mid, 
  p + se * 1.96 as high 
from (
  select 
    rates.*, 
    sqrt(p * (1 - p) / n) as se -- calculate se
  from (
    select 
      conversions.*, 
      (x + 1.92) / (n + 3.84)::float as p -- calculate p
    from ( 
      -- Our conversion rate table from above
      select 
        count(1) as n, 
        sum(case when converted then 1 else 0 end) as x
      from users
      group by date_trunc('month', created_at);
    ) conversions
  ) rates
) intervals</code></pre>



<p> The important adjustment is here, where we add the constants to the numerator and denominator when calculating&nbsp;p: </p>



<pre class="wp-block-code"><code>(x + 1.92) / (n + 3.84)::float as p -- calculate p</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-03-15.png" alt="Conversions table" class="wp-image-78788"/></figure>



<p>This isn’t a magical solution to not enough data: If you have an expected 1% conversion rate and only 100 users, this adjustment will triple the estimated conversion rate, giving you a confidence interval of 0-6%. More data is the answer. At 10 conversions and 1,000 users, the interval shrinks to 0.5% to 1.9%.</p>



<p>In general, the more data you have, the more statistical approaches like these will be helpful to you.</p>



<h2 class="wp-block-heading">Who are we?</h2>



<p>We make a tool that makes data analysis on large SQL databases fast and easy. You could use our Snippets feature to implement this logic once, and apply it to any dataset.</p>



<p>If you have a database with many millions or billions of rows, and running hundreds of analyses is getting slow and cumbersome, we think you’ll really love it. Sign up for a free demo. We can also set you up with a free trial on the same day!</p>


How to Calculate Confidence Intervals in SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article