<p>Earlier, we showed how to make <a href="https://www.sisense.com/blog/use-subqueries-to-count-distinct-50x-faster/">count distinct 50x faster with subqueries</a>. Using probabilistic counting, we’ll make count distinct even faster, trading a little accuracy for the increase in speed.</p>



<p>We’ll optimize a very simple query, which calculates the daily distinct sessions for 5,000,000 gameplays (~150,000/day):</p>



<pre class="wp-block-code"><code>select date(created_at), count(distinct session_id)
from gameplays</code></pre>



<p>The original query takes 162.2s. The HyperLogLog version is 5.1x faster (31.5s) with a 3.7% error and uses a small fraction of the RAM.</p>



<h2 class="wp-block-heading">Why HyperLogLog?</h2>



<p>Databases often implement <code>count(distinct)</code> in two ways: When there are a few distinct elements, the database makes a HashSet in RAM and then counts the keys. When there are too many elements to fit in RAM, the database writes them to disk, sorts the file, and then counts the number of element groups. The second case — writing the intermediate data to disk — is very slow. Probabilistic counters are designed to use as little RAM as possible, making them ideal for large data sets that would otherwise page to disk.</p>



<p>The <a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">HyperLogLog Probabilistic Counter</a> is an algorithm for determining the approximate number of distinct elements in a set using minimal RAM. Distincting a set that has 10 million unique 100-character strings can take over a gigabyte of RAM using a hash table, while HyperLogLog uses less than a megabyte (the “log log” in HyperLogLog refers to its space efficiency). Since the probabilistic counter can stay entirely in RAM during the process, it’s much faster than any alternative that has to write to disk and usually faster than alternatives using a lot more RAM.</p>



<h2 class="wp-block-heading">Hashing</h2>



<p>The core of the HyperLogLog algorithm relies on one simple property of uniform hashes: The probability of the position of the leftmost set bit in a random hash is 1/2n, where n is the position. We call the position of the leftmost set bit the most significant bit, or MSB.</p>



<p>Here are some hash patterns and the positions of their MSBs:</p>



<p>We&#8217;ll use the MSB position soon, so here it is in SQL:</p>



<pre class="wp-block-code"><code>select
  31 - floor(log(2, hashtext(session_id) &amp; ~(1 &lt;&lt; 31))))
   as bucket_hash
from gameplays</code></pre>



<p>Hashtext is an undocumented hashing function in Postgres. It hashes strings to 32-bit numbers. We could use <code>md5</code> and convert it from a hex string to an integer, but this is faster. We use <code>~(1 &lt;&lt; 31)</code> to clear the leftmost bit of the hashed number. Postgres uses that bit to determine if the number is positive or negative, and we only want to deal with positive numbers when taking the logarithm. The <code>floor(log(2,...))</code> does the heavy lifting: The integer part of base-2 logarithm tells us the position (from the right) of the MSB. Subtracting that from 31 gives us the position of the MSB from the left, starting at 1. With that line, we&#8217;ve got our MSB per-hash of the session_id field! </p>



<h2 class="wp-block-heading">Bucketing </h2>



<p>The maximum MSB for our elements is capable of crudely estimating the number of distinct elements in the set. If the maximum MSB we&#8217;ve seen is 3, given the probabilities above we&#8217;d expect around 8 (i.e. 23) distinct elements in our set. Of course, this is a terrible estimate to make as there are many ways to skew the data. The HyperLogLog algorithm divides the data into evenly-sized buckets and takes the harmonic mean of the maximum MSBs of those buckets. </p>



<p>The harmonic mean is better here since it discounts outliers, reducing the bias in our count. Using more buckets reduces the error in the distinct count calculation, at the expense of time and space. The function for determining the number of buckets needed given the desired error is:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-18.png" alt="bucket function" class="wp-image-78724"/></figure>



<p>We&#8217;ll aim for a +/- 5% error, so plugging in 0.05 for the error rate gives us 512 buckets. Here&#8217;s the SQL for grouping MSBs by date and bucket:</p>



<pre class="wp-block-code"><code>select 
  date(created_at) as created_date,
  hashtext(session_id) &amp; (512 - 1) as bucket_num,
  31 - floor(log(2, min(hashtext(session_id) &amp; ~(1 &lt;&lt; 31))))
   as bucket_hash
from sessions
group by 1, 2 order by 1, 2</code></pre>



<p> The <code>hashtext(...) &amp; (512 - 1)</code> gives us the rightmost 9 bits , 511 in binary is 111111111), and we&#8217;re using that for the bucket number. The bucket_hash line uses a min inside the logarithm instead of something like this: <code>max(31 - floor(log(...)))</code> so that we can compute the logarithm once &#8211; greatly speeding up the calculation. Now we&#8217;ve got 512 rows for each date &#8211; one for each bucket &#8211; and the maximum MSB for the hashes that fell into that bucket. In future examples we&#8217;ll call this select bucketed_data.</p>



<h2 class="wp-block-heading">Counting</h2>



<p>It&#8217;s time to put together the buckets and the MSBs. The paper linked above has a lengthy discussion on the derivation of this function, so we&#8217;ll only recreate the result here. The new variables are m (the number of buckets, 512 in our case) and M (the list of buckets indexed by j, the rows of SQL in our case). The denominator of this equation is the harmonic mean mentioned earlier:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-02-19.png" alt="Harmonic mean" class="wp-image-78730"/></figure>



<p>In SQL, it looks like this:</p>



<pre class="wp-block-code"><code>select
  created_date,
  ((pow(512, 2) * (0.7213 / (1 + 1.079 / 512))) / 
  ((512 - count(1)) + sum(pow(2, -1 * bucket_hash))))::int 
    as num_uniques,
  512 - count(1) as num_zero_buckets
from bucketed_data
group by 1 order by 1</code></pre>



<p>We add in <code>(512 - count(1))</code> to account for missing rows. If no hashes fell into a bucket it won&#8217;t be present in the SQL, but by adding 1 per missing row to the result of the sum we achieve the same effect. The <code>num_zero_buckets</code> is pulled out for the next step where we account for sparse data. </p>



<p>Almost there! We have distinct counts that will be right most of the time &#8211; now we need to correct for the extremes. In future examples, we&#8217;ll call this select counted_data. </p>



<h2 class="wp-block-heading">Correcting </h2>



<p>The results above work great when most of the buckets have data. When a lot of the buckets are zeros (missing rows), then the counts get a heavy bias. To correct for that we apply the formula below only when the estimate is likely biased, with this equation:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-03-14.png" alt="Correcting formula" class="wp-image-78736"/></figure>



<p> And the SQL for that looks like this: </p>



<pre class="wp-block-code"><code>select
  counted_data.created_date,
  case when num_uniques &lt; 2.5 * 512 and num_zero_buckets > 0 then
    ((0.7213 / (1 + 1.079 / 512)) * (512 * 
      log(2, (512::numeric) / num_zero_buckets)))::int
  else num_uniques end as approx_distinct_count
from counted_data
order by 1</code></pre>



<p>Now, putting it all together: </p>



<pre class="wp-block-code"><code>select
  counted_data.created_date,
  case 
    when num_uniques &lt; 2.5 * 512 and num_zero_buckets > 0 then
      ((0.7213 / (1 + 1.079 / 512)) * (512 * 
        log(2, (512::numeric) / num_zero_buckets)))::int
  else num_uniques end as approx_distinct_count
from (
  select
    created_date,
    ((pow(512, 2) * (0.7213 / (1 + 1.079 / 512))) / 
    ((512 - count(1)) + sum(pow(2, -1 * bucket_hash))))::int
      as num_uniques,
    512 - count(1) as num_zero_buckets
  from (
    select 
      date(created_at) as created_date,
      hashtext(session_id) &amp; (512 - 1) as bucket_num,
      31 - floor(log(2, min(hashtext(session_id) &amp; ~(1 &lt;&lt; 31))))
        as bucket_hash
    from gameplays
    group by 1, 2
  ) as bucketed_data
  group by 1 order by 1
) as counted_data order by 1</code></pre>



<p>And that&#8217;s the HyperLogLog probabilistic counter in pure SQL! </p>



<h2 class="wp-block-heading">Bonus: Parallelizing </h2>



<p>The HyperLogLog algorithm really shines when you&#8217;re in an environment where you can count distinct in parallel. The results of the bucketed_data step can be combined from multiple nodes into one superset of data, greatly reducing the cross-node overhead usually required when counting distinct elements across a cluster of nodes. You can also preserve the results of the bucketed_data step for later, making it nearly free to update the distinct count of a set on the fly!</p>



<pre class="wp-block-code"><code>Hash     MSB Position     Hashes like this
1xxxxx   1                50
01xxxx   2                25%
001xxx   3                12.5%
0001xx   4                6.25%</code></pre>


HyperLogLog in Pure SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article