Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

Beyond Random() — Normal Distributions in SQL

September 11, 2014

Leaving random() Behind

If you’ve ever had to generate sample data in your database, odds are you’ve started with random() or its equivalent. It’s fast and it’s reliable. Unfortunately, it’s also evenly distributed. You’re just as likely to get 0.5 as 0.9. Most real-world data is not like this.

After an even distribution, your most basic distribution of random numbers is a normal distribution. In this post, we’ll show you how to use the Marsaglia Polar Method to convert random()’s uniform distribution into a normal distribution that looks like this:

Two Random Numbers Are Better Than One

The Marsaglia method converts a pair of uniformly distributed random numbers into a pair of normally distributed random numbers.

First we need a list of random numbers. We’ll use generate_series for brevity. If you’re on Redshift or MySQL, check out the alternatives to generate_series. The Marsaglia method requires numbers between -1 and 1. We’ll generate 100,000 of them:

with
numbers as (
  select
    2 * random() - 1 as x,
    2 * random() - 1 as y
  from generate_series(0, 100000)
),

The Marsaglia method is a type of rejection sampling, which is a fancy way of saying that it throws away inputs that don’t fit the distribution.

Since it’s based on polar coordinates, the sum of the squares of the numbers needs to be less than one. This ends up filtering out about 𝜋/4 (21%) of the data, so make you you generate enough numbers at the outset.

polar_points as (
  select 
    x, 
    y, 
    x*x + y*y as s
  from numbers 
  where x*x + y*y < 1
),

Now we know we have the right set of inputs, so we can turn them into normally distributed pairs. We’ll compute the new pairs separately, and then union the lists together to get one long list of normally distributed numbers.

The formulas are straightforward:

Translating these to SQL, we get:

marsaglia as (
  select 
    x * sqrt((-2 * ln(s))/s) as
  from polar_points
  union select 
    y * sqrt((-2 * ln(s))/s) as
  from polar_points
)

 

Bringing It All Together

To make a pretty graph of your distribution like the one above, combine the previous steps and then bucket and count the numbers (by rounding them) like so:

with
numbers as (
  select
    2 * random() - 1 as x,
    2 * random() - 1 as y
  from generate_series(0, 100000)
),
polar_points as (
  select 
    x, 
    y, 
    x*x + y*y as s
  from numbers 
  where x*x + y*y < 1
),
marsaglia as (
  select 
    x * sqrt((-2 * ln(s))/s) as
  from polar_points
  union select y * sqrt((-2 * ln(s))/s) as
  from polar_points
)
select 
  round(n::numeric, 1), 
  count(1
from marsaglia group by 1 order by 1

Now your randomly generated test data can look a lot more normal!


Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter