Want to make data analysis fast for everyone?Join Us!
Reusing parts of computations has long been a wart in SQL. Let’s say you want to compute confidence intervals for your signup rate. You’ll have to use the standard error calculation twice: once for the upper bound, and once for the lower bound.
Your options are to just do the math twice, or to put the calculation in an ugly — and possibly inefficient — subquery.
This is where lateral joins shine. Postgres’s latest join type solves this problem by letting you make the calculation in one join, and then reference the results of that calculation in subsequent joins.
Computed Columns with Duplication or Subqueries
Let’s start with our input data, a roll-up table of signups by month:
Now let’s compute the confidence intervals around our signup rate each month. The math is:
n = number of users x = number of signups p = probability of signup = (x / n) se = standard error of p = sqrt((p * (1 - p)) / n) confidence interval = p ± (1.96 * se)
If we just recalculate the signup rate and standard error each time, the naive SQL is:
What a mess! This is way too hard to read, and we’re obviously duplicating a bunch of work in the
Nonetheless, the results are handy:
In Periscope, we can see those same results visually:
Let’s pick apart the issues in our naive SQL. We calculate the probability five times and the standard error twice. That’s way too much duplication for a simple calculation.
We can remove needless duplication by adding subqueries and having each layer build upon previous calculations:
This query is probably more efficient because it only does each stage of the calculation once. But despite removing all the duplication, it’s arguably even harder to read.
Until now, these were our only two options for a calculation like this. But as of Postgres 9.3, there’s a better way!
Computed Columns with Lateral Joins
A lateral join is a join that allows subqueries in the right-hand-side of the join to reference columns defined in the left-hand-side of the join.
For our confidence interval calculation, we’ll define probability first, then reference probability when defining standard error, and then reference standard error when defining lower and upper bounds.
Here’s the new query:
Careful readers will notice we’re using a cartesian join. This is a join of the form
from a, b, c, which is shorthand for
from a join b on true join c on true. It creates a row for every possible combination of rows in the joined table. In our case, where each joined table computes a single value, this just has the effect of appending calculations and then reusing them for the next calculation.
The final version is almost as easy to read as a mathematical formula! We simply define a variable, use it in the next calculation, rinse and repeat.
All in all, a much cleaner, easier query. Thanks Postgres!