No items found.

Count Distinct Compared on Top 4 SQL Databases

The Gauntlet

After reading our last post, Use Subqueries to Count Distinct 50X Faster, some good folks at Hacker News and r/programming asked to see our results compared across RDBMS’s. A few hypotheses were thrown out, including:

  • Postgres’s query planner is bad at count(distinct ...)
  • MySQL’s query planner is bad at subqueries, so these optimizations would have the opposite effect.
  • MS SQL Server and Oracle’s planners are very good, and ought to generate optimal plans even for the naive query.

Truly, the gauntlet had been thrown, and we are here to answer. We ran the queries on Postgres 9.3MySQL 5.6SQL Server 2012 SE 11.0, and Oracle SE1 11.2.

The Setup

All three databases were hosted at Amazon RDS on m1.xlarge boxen with 15GB of RAM. The dataset features 14M time_on_site_logs, 1200 dashboards and 1700 distinct dashboard_id, user_id pairstime_on_site_logs is indexed on iduser_id and dashboard_id; dashboards is indexed on id.

Each query was run 6 times on each database. The first run was discarded and the subsequent five were averaged.

The Results

For consistency we used the same queries as our original post:

  • Query 1 naively joins and then groups-and-counts-distinct.
  • Query 2 uses a subselect to group-and-count-distinct before the join.
  • Query 3 explicitly computes distincts before grouping-and-counting.

            MySQL    Postgres    SQL Server    Oracle
Query 1     27.5s    348s        4.47s         6.09s
Query 2     25.9s    10.6s       2.13s         3.70s
Query 3     13.3s    7.13s       2.21s         3.71s

Holy Commercial Databases, Batman!

Our hats off to Microsoft and Oracle for superior performance across the board. Interestingly, both came up with naive query plans for the naive query, scanning both tables before joining:

Oracle Naive Explain

Thanks to Oracle’s SQLDeveloper for the graphics. SQL Server’s plans were similar.

Yet performance was excellent compared to MySQL and Postgres despite the naive plans.

On both SQL Server and Oracle, plans for queries 2 and 3 were identical, computing the distincts in time_on_site_logs before joining and sorting:

Oracle Fancy Explain

On Oracle and SQL Server, adding a subselect to your count distinct is worth a 1.6-2X improvement. In all three examples, SQL Server outperformed Oracle by about 25-40%.

MySQL and Subselects

We did not see the hypothesized slowdown in MySQL when introducing subselects. It’s worth noting, though, that our subselect was written specifically to be faster. If MySQL just executes the queries as written, we should see equal or better performance because of subselects.

In cases where subselects are introduced for readability or other reasons, they might still slow MySQL down.

Postgres and Count Distinct

The biggest news in our comparison is Postgres’s remarkable slowdown on the naive query. Thursday’s advertised 50X speedup is repeatable today, but only on Postgres.

Speeding up your count distincts with subqueries will net you a more modest, yet significant, 2X improvement on other databases.


Want to discuss this article? Join the Periscope Data Community!

Sean Cook
Sean has chased the best tech from San Diego to NYC to SF working in GIS, data science and marketing. You can find him alternatively creating dazzling new analyses and driving content strategy at Periscope Data.