<h2 class="wp-block-heading">Reusing Calculations</h2>



Reusing parts of computations has long been a wart in SQL. Let’s say you want to compute confidence intervals for your signup rate. You’ll have to use the standard error calculation twice: once for the upper bound, and once for the lower bound.



Your options are to just do the math twice, or to put the calculation in an ugly — and possibly <a href="https://www.sisense.com/blog/different-types-of-databases-for-modern-data-challenges/">inefficient</a> — subquery.



This is where lateral joins shine. Postgres’s <a href="https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#LATERAL_JOIN" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">latest join type</a> solves this problem by letting you make the calculation in one join, and then reference the results of that calculation in subsequent joins.



<h2 class="wp-block-heading">Computed Columns with Duplication or Subqueries</h2>



Let’s start with our input data, a rollup table of signups by month:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Month-Signups-tables.png" alt="Signups by month" class="wp-image-75765"/></figure>



Now let’s compute the confidence intervals around our signup rate each month. The math is:



<pre class="wp-block-code"><code>n = number of users
x = number of signups
p = probability of signup = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)</code></pre>



If we just recalculate the signup rate and standard error each time, the naive SQL is:



<pre class="wp-block-code"><code>select
 month,
 signups, 
 visitors,
 (signups / visitors) as avg,
 (signups / visitors) 
 - (1.94 * sqrt((signups / visitors) 
 * (1 - (signups / visitors)) / visitors)) as low,
 (signups / visitors) 
 + (1.94 * sqrt((signups / visitors) 
 * (1 - (signups / visitors)) / visitors)) as high
from signups_by_month</code></pre>



What a mess! This is way too hard to read, and we’re obviously duplicating a bunch of work in the low and high calculations.



Nonetheless, the results are handy:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Month-Signups-table-2.png" alt="Monthly signups 2" class="wp-image-75770"/></figure>



We can see those same results visually:



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Monthly-signup-rate.png" alt="" class="wp-image-75775"/></figure>



Let’s pick apart the issues in our naive SQL. We calculate the probability five times and the standard error twice. That’s way too much duplication for a simple calculation.



We can remove needless duplication by adding subqueries and having each layer build upon previous calculations:



<pre class="wp-block-code"><code>select 
 signups, 
 visitors, 
 p as avg,
 p - 1.94 * se as low, 
 p + 1.94 * se as high
from (
 select 
 *, 
 sqrt(p * (1 - p) / visitors) as se
 from (
 select 
 *, 
 signups / visitors as p
 from signups_by_month
 ) probability
) standard_error</code></pre>



This query is probably more efficient because it only does each stage of the calculation once. But despite removing all the duplication, it’s arguably even harder to read.



Until now, these were our only two options for a calculation like this. But as of Postgres 9.3, there’s a better way!



<h2 class="wp-block-heading">Computed Columns with Lateral Joins</h2>



A <a href="https://www.postgresql.org/docs/9.4/queries-table-expressions.html#QUERIES-LATERAL" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">lateral join</a> is a join that allows subqueries in the right-hand side of the join to reference columns defined in the left-hand side of the join.



For our confidence interval calculation, we’ll define probability first, then reference probability when defining standard error, and then reference standard error when defining lower and upper bounds.



Here’s the new query:



<pre class="wp-block-code"><code>select 
 month, 
 signups, 
 visitors, 
 p as avg, 
 low, 
 high
from signups_by_month,
lateral (select signups / visitors as p) probability,
lateral (select sqrt(p * (1 - p) / visitors) as se) std_error,
lateral (select p - 1.94 * se as low) lower_bound,
lateral (select p + 1.94 * se as high) upper_bound</code></pre>



Careful readers will notice we’re using a <a href="https://www.tutorialspoint.com/sql/sql-cartesian-joins.htm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">cartesian join</a>. This is a join of the form from a, b, c, which is shorthand for from a join b on true join c on true. It creates a row for every possible combination of rows in the joined table. In our case, where each joined table computes a single value, this just has the effect of appending calculations and then reusing them for the next calculation.



The final version is almost as easy to read as a mathematical formula! We simply define a variable, use it in the next calculation, rinse, and repeat.



All in all, a much cleaner, easier query. Thanks, Postgres!

Reuse Calculations in the Same Query with Lateral Joins

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article