###### Common Statistical Operations

# HyperLogLog in Pure SQL

Earlier we showed how to make count distinct 50x faster with subqueries. Using probabilistic counting we’ll make count distinct even faster, trading a little accuracy for the increase in speed.

We’ll optimize a very simple query, which calculates the daily distinct sessions for 5,000,000 gameplays (~150,000/day):

selectdate(created_at),count(distinctsession_id)

fromgameplays

The original query takes 162.2s. The **HyperLogLog version is 5.1x faster** (31.5s) with a 3.7% error, and uses a small fraction of the RAM.

### Why HyperLogLog?

Databases often implement * count(distinct)* in two ways: When there are few distinct elements, the database makes a hashset in RAM and then counts the keys. When there there are too many elements to fit in RAM, the database writes them to disk, sorts the file, and then counts the number of element groups. The second case — writing the intermediate data to disk — is

*very*slow. Probabilistic counters are designed to use as little RAM as possible, making them ideal for large data sets that would otherwise page to disk.

The HyperLogLog Probabilistic Counter is an algorithm for determining the approximate number of distinct elements in a set using minimal RAM. Distincting a set that has 10 million unique 100-character strings can take over gigabyte of RAM using a hash table, while HyperLogLog uses less than a megabyte (the “log log” in HyperLogLog refers to its space efficiency). Since the probabilistic counter can stay entirely in RAM during the process, it’s much faster than any alternative that has to write to disk and usually faster than alternatives using a lot more RAM.

### Hashing

The core of the HyperLogLog algorithm relies on one simple property of uniform hashes: The probability of the position of the leftmost set bit in a random hash is 1/2n, where n is the position. We call the position of the leftmost set bit the most significant bit, or MSB.

Here are some hash patterns and the positions of their MSBs:

We'll use the MSB position soon, so here it is in SQL:

select

31-floor(log(2, hashtext(session_id)&~(1<<31))))

asbucket_hash

fromgameplays

* **`Hashtext`** is an undocumented hashing function in postgres. It hashes strings to 32-bit numbers. We could use `md5` and convert it from a hex string to an integer, but this is faster. * We use **`~(1 << 31)`** to clear the leftmost bit of the hashed number. Postgres uses that bit to determine if the number is positive or negative, and we only want to deal with positive numbers when taking the logarithm. * The **`floor(log(2,...))`** does the heavy lifting: The integer part of base-2 logarithm tells us the position (from the right) of the MSB. Subtracting that from 31 gives us the position of the MSB from the left, starting at 1. With that line we've got our MSB per-hash of the `session_id` field! ### Bucketing The maximum MSB for our elements is capable of crudely estimating the number of distinct elements in the set. If the maximum MSB we've seen is 3, given the probabilities above we'd expect around 8 (i.e. 23) distinct elements in our set. Of course, this is a terrible estimate to make as there are many ways to skew the data. The HyperLogLog algorithm divides the data into evenly-sized buckets and takes the harmonic mean of the maximum MSBs of those buckets. The harmonic mean is better here since it discounts outliers, reducing the bias in our count. Using more buckets reduces the error in the distinct count calculation, at the expense of time and space. The function for determining the number of buckets needed given a desired error is:

We'll aim for a +/- 5% error, so plugging in 0.05 for the error rate gives us 512 buckets. Here's the SQL for grouping MSBs by date and bucket:

select

date(created_at)ascreated_date,

hashtext(session_id)&(512-1)asbucket_num,

31-floor(log(2,min(hashtext(session_id)&~(1<<31))))

asbucket_hash

fromsessions

groupby1,2orderby1,2

* The **`hashtext(...) & (512 - 1)`** gives us the rightmost 9 bits , 511 in binary is 111111111), and we're using that for the bucket number. * The **`bucket_hash`** line uses a min inside the logarithm instead of something like this **`max(31 - floor(log(...)))`** so that we can compute the logarithm once - greatly speeding up the calculation. Now we've got 512 rows for each date - one for each bucket - and the maximum MSB for the hashes that fell into that bucket. In future examples we'll call this select **`bucketed_data`**.

### Counting

It's time to put together the buckets and the MSBs. The paper linked above has a lengthy discussion on the derivation of this function, so we'll only recreate the result here. The new variables are **`m`** (the number of buckets, 512 in our case) and **`M`** (the list of buckets indexed by j, the rows of SQL in our case). The denominator of of this equation is the harmonic mean mentioned earlier:

In SQL, it looks like this:

select

created_date,

((pow(512,2)*(0.7213/(1+1.079/512)))/

((512-count(1))+sum(pow(2,-1*bucket_hash))))::int

asnum_uniques,

512-count(1)asnum_zero_buckets

frombucketed_data

groupby1orderby1

* We add in **`(512 - count(1))`** to account for missing rows. If no hashes fell into a bucket it won't be present in the SQL, but by adding 1 per missing row to the result of the sum we achieve the same effect. * The **`num_zero_buckets`** is pulled out for the next step where we account for sparse data. Almost there! We have distinct counts that will be right most of the time - now we need to correct for the extremes. In future examples we'll call this select **`counted_data`**. ### Correcting The results above work great when most of the buckets have data. When a lot of the buckets are zeros (missing rows), then the counts get a heavy bias. To correct for that we apply the formula below only when the estimate is likely biased, with this equation:

And the SQL for that looks like this:

select

counted_data.created_date,

casewhennum_uniques<2.5*512andnum_zero_buckets>0then

((0.7213/(1+1.079/512))*(512*

log(2, (512::numeric)/num_zero_buckets)))::int

elsenum_uniquesendasapprox_distinct_count

fromcounted_data

orderby1

Now putting it all together:

select

counted_data.created_date,

case

whennum_uniques<2.5*512andnum_zero_buckets>0then

((0.7213/(1+1.079/512))*(512*

log(2, (512::numeric)/num_zero_buckets)))::int

elsenum_uniquesendasapprox_distinct_count

from(

select

created_date,

((pow(512,2)*(0.7213/(1+1.079/512)))/

((512-count(1))+sum(pow(2,-1*bucket_hash))))::int

asnum_uniques,

512-count(1)asnum_zero_buckets

from(

select

date(created_at)ascreated_date,

hashtext(session_id)&(512-1)asbucket_num,

31-floor(log(2,min(hashtext(session_id)&~(1<<31))))

asbucket_hash

fromgameplays

groupby1,2

)asbucketed_data

groupby1orderby1

)ascounted_dataorderby1

And that's the HyperLogLog probabilistic counter in pure SQL! ### Bonus: Parallelizing The HyperLogLog algorithm really shines when you're in an environment where you can count distinct in parallel. The results of the **`bucketed_data`** step can be combined from multiple nodes into one superset of data, greatly reducing the cross-node overhead usually required when counting distinct elements across a cluster of nodes. You can also preserve the results of the **`bucketed_data`** step for later, making it nearly free to update the distinct count of a set on the fly!

Hash MSB Position Hashes like this

1xxxxx 1 50%

01xxxx 2 25%

001xxx 3 12.5%

0001xx 4 6.25%