<h2 class="wp-block-heading">Maintaining cache consistency</h2>



<p>Maintaining a data cache optimized to serve different queries can help speed them up, and benefits from economies of scale. </p>



<p>Maintaining this cache leads us to a critical question: How do I know if the cache is still valid? Put another way: <strong>How do I know if the data in a table in database A matches the data in a table in database B?</strong></p>



<p>Enter <a href="https://en.wikipedia.org/wiki/Hash_function" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">hashing</a>, a general technique to detect if two datasets are the same. We can use a hash to validate that our cache is fresh without needing to understand any application-specific logic.<br></p>



<p>What we need now is a query that returns the exact same hash given the same table structure and data on all databases.</p>



<h2 class="wp-block-heading">The algorithm</h2>



<p>For each row:</p>



<ol><li>Take the MD5 of each column. Use a space for NULL values.</li><li>Concatenate those results, and MD5 this result.</li><li>Split into 4 8-character hex strings.</li><li>Convert into 32-bit integers and sum.</li></ol>



<p>We choose MD5 as our hash function because it’s fast and supported on all databases.</p>



<p>We break the column hashes into integer-sized chunks and sum them to get a single value (4 bigints) in order to save memory and transfer time. Ideally we’d hash all the individual column hashes, but this isn’t possible on all databases.</p>



<p>Finally, note that we must convert the columns into the same format before encoding in step (1) to ensure cross-database consistency.</p>



<h2 class="wp-block-heading">In Postgres</h2>



<p>Taking the MD5 of a column looks like this:</p>



<pre class="wp-block-code"><code>md5("column name"::text)</code></pre>



<p>Some extra massaging may be required for more complex types. Examples of integers, text columns, and datetime columns are below.</p>



<p>Now we’ll layer on spaces for NULL values:</p>



<pre class="wp-block-code"><code>coalesce(md5("column name"::text), ' ')</code></pre>



<p>Concatenating and hashing those results is a simple matter:</p>



<pre class="wp-block-code"><code>select md5(
  coalesce(md5("column name"::text), ' ') || 
  coalesce(md5("second column name"::text), ' ')
) as "hash"
from "my_table"."my_schema"</code></pre>



<p>We then wrap this all in a subquery so we can split the result into 4 8-character hex strings, which are each converted into 32-bit integers and summed.</p>



<p>As we add that in, we get the final query:</p>



<pre class="wp-block-code"><code>select
  sum(('x' || substring(hash, 1, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 9, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 17, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 25, 8))::bit(32)::bigint)
from (
  select md5 (
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table";
) as t;</code></pre>



<p>Note the ‘x’ prepended to the hash strings, which tells Postgres to interpret them as hex strings when casting to a number.</p>



<h2 class="wp-block-heading">In Redshift</h2>



<p>Redshift supports the handy strtol function, making our hash-string-to-integer conversion a bit easier. Otherwise, the full query is the same:</p>



<pre class="wp-block-code"><code>select
  sum(trunc(strtol(substring(hash, 1, 8), 16))),
  sum(trunc(strtol(substring(hash, 9, 8), 16))),
  sum(trunc(strtol(substring(hash, 17, 8), 16))),
  sum(trunc(strtol(substring(hash, 25, 8), 16)))
from (
  select md5(
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table"
) as t;</code></pre>



<h2 class="wp-block-heading">In MySQL</h2>



<p>MySQL sports a few changes from the Postgres and Redshift variants:</p>



<p>First, the syntax for casting many of the columns to helpful strings is different, e.g. for datetimes:</p>



<pre class="wp-block-code"><code>floor(unixtimestamp("datetime column"))</code></pre>



<p>Second, an explicit concat call is required to concatenate the column hashes, since we’re missing Postgres’s || syntax.</p>



<p>Finally, we use conv to get our 32-bit numbers, and cast to cast them to integers:</p>



<pre class="wp-block-code"><code>cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)</code></pre>



<p>Putting it all together, we get this final query:</p>



<pre class="wp-block-code"><code>select 
  sum(cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 9, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 17, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 25, 8), 16, 10) as unsigned)) 
from (
  select md5(
    concat(
      coalesce(md5("integer column"), ' '),
      coalesce(md5(floor(
        unix_timestamp("datetime column")
      )), ' '), 
      coalesce(md5("string column"), ' '), 
      coalesce(md5(cast("boolean column" as integer)), ' ')
    )
  ) as "hash"
  from "my_table"
) as t;</code></pre>



<h2 class="wp-block-heading">In practice</h2>



<p>The end result will be four bigints representing the state of the table. Changing any row will change the results.</p>



<p>All this being said, we can’t recommend that you write this all by hand! We’ve open-sourced a <a href="https://gist.github.com/compnski/a89a5e53eb308671bd6e" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">simple Go script</a> to build the hash query given a database type and a list of columns names and types.</p>


Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article