Periscope Data
The world’s fastest visualization platform for data analysts.

Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL

June 3, 2014

Maintaining cache consistency

One of the main things Periscope does to speed up queries is maintain a cache of customer data. The cache is optimized to serve certain kinds of queries, and benefits from economies of scale.

Maintaining this cache leads us to a critical question: How do I know if the cache is still valid? Put another way: How do I know if the data in a table in database A matches the data in a table in database B?

Enter hashing, a general technique to detect if two datasets are the same. We can use a hash to validate that our cache is fresh without needing to understand any application-specific logic.

What we need now is a query that returns the exact same hash given the same table structure and data on all databases.

The algorithm

For each row:

  1. Take the MD5 of each column. Use a space for NULL values.
  2. Concatenate those results, and MD5 this result.
  3. Split into 4 8-character hex strings.
  4. Convert into 32-bit integers and sum.

We choose MD5 as our hash function because it’s fast and supported on all databases.

We break the column hashes into integer-sized chunks and sum them to get a single value (4 bigints) in order to save memory and transfer time. Ideally we’d hash all the individual column hashes, but this isn’t possible on all databases.

Finally, note that we must convert the columns into the same format before encoding in step (1) to ensure cross-database consistency.

In Postgres

Taking the MD5 of a column looks like this:

md5("column name"::text)

Some extra massaging may be required for more complex types. Examples of integers, text columns, and datetime columns are below.

Now we’ll layer on spaces for NULL values:

coalesce(md5("column name"::text), ' ')

Concatenating and hashing those results is a simple matter:

select md5(
  coalesce(md5("column name"::text), ' ') || 
  coalesce(md5("second column name"::text), ' ')
) as "hash"
from "my_table"."my_schema"

We then wrap this all in a subquery so we can split the result into 4 8-character hex strings, which are each converted into 32-bit integers and summed.

As we add that in, we get the final query:

select
  sum(('x' || substring(hash, 1, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 9, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 17, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 25, 8))::bit(32)::bigint)
from (
  select md5 (
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table";
) as t;

Note the ‘x’ prepended to the hash strings, which tells Postgres to interpret them as hex strings when casting to a number.

In Redshift

Redshift supports the handy strtol function, making our hash-string-to-integer conversion a bit easier. Otherwise the full query is the same:

select
  sum(trunc(strtol(substring(hash, 1, 8), 16))),
  sum(trunc(strtol(substring(hash, 9, 8), 16))),
  sum(trunc(strtol(substring(hash, 17, 8), 16))),
  sum(trunc(strtol(substring(hash, 25, 8), 16)))
from (
  select md5(
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table"
) as t;

In MySQL

MySQL sports a few changes from the Postgres and Redshift variants:

First, the syntax for casting many of the columns to helpful strings is different, e.g. for datetimes:

floor(unixtimestamp("datetime column"))

Second, an explicit concat call is required to concatenate the column hashes, since we’re missing Postgres’s || syntax.

Finally, we use conv to get our 32-bit numbers, and cast to cast them to integers:

cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)

Putting it all together, we get this final query:

select 
  sum(cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 9, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 17, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 25, 8), 16, 10) as unsigned)) 
from (
  select md5(
    concat(
      coalesce(md5("integer column"), ' '),
      coalesce(md5(floor(
        unix_timestamp("datetime column")
      )), ' '), 
      coalesce(md5("string column"), ' '), 
      coalesce(md5(cast("boolean column" as integer)), ' ')
    )
  ) as "hash"
  from "my_table"
) as t;

In practice

The end result will be four bigints representing the state of the table. Changing any row will change the results.

All this being said, we can’t recommend that you write this all by hand! We’ve open-sourced a simple Go script to build the hash query given a database type and a list of columns names and types.

Your other options include letting your DB or DB host handle replication, and, of course, signing up for Periscope and letting us cache data for you automatically.

Happy analyzing!

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?