Count Distinct Compared on Top 4 SQL Databases
After reading our last post, Use Subqueries to Count Distinct 50X Faster, some good folks at Hacker News and r/programming asked to see our results compared across RDBMS’s. A few hypotheses were thrown out, including:
- Postgres’s query planner is bad at count(distinct ...)
- MySQL’s query planner is bad at subqueries, so these optimizations would have the opposite effect.
- MS SQL Server and Oracle’s planners are very good, and ought to generate optimal plans even for the naive query.
Truly, the gauntlet had been thrown, and we are here to answer. We ran the queries on Postgres 9.3, MySQL 5.6, SQL Server 2012 SE 11.0, and Oracle SE1 11.2.
All three databases were hosted at Amazon RDS on m1.xlarge boxen with 15GB of RAM. The dataset features 14M time_on_site_logs, 1200 dashboards and 1700 distinct dashboard_id, user_id pairs. time_on_site_logs is indexed on id, user_id and dashboard_id; dashboards is indexed on id.
Each query was run 6 times on each database. The first run was discarded and the subsequent five were averaged.
For consistency we used the same queries as our original post:
- Query 1 naively joins and then groups-and-counts-distinct.
- Query 2 uses a subselect to group-and-count-distinct before the join.
- Query 3 explicitly computes distincts before grouping-and-counting.
MySQL Postgres SQL Server Oracle
Query 1 27.5s 348s 4.47s 6.09s
Query 2 25.9s 10.6s 2.13s 3.70s
Query 3 13.3s 7.13s 2.21s 3.71s
Holy Commercial Databases, Batman!
Our hats off to Microsoft and Oracle for superior performance across the board. Interestingly, both came up with naive query plans for the naive query, scanning both tables before joining:
Thanks to Oracle’s SQLDeveloper for the graphics. SQL Server’s plans were similar.
Yet performance was excellent compared to MySQL and Postgres despite the naive plans.
On both SQL Server and Oracle, plans for queries 2 and 3 were identical, computing the distincts in time_on_site_logs before joining and sorting:
On Oracle and SQL Server, adding a subselect to your count distinct is worth a 1.6-2X improvement. In all three examples, SQL Server outperformed Oracle by about 25-40%.
MySQL and Subselects
We did not see the hypothesized slowdown in MySQL when introducing subselects. It’s worth noting, though, that our subselect was written specifically to be faster. If MySQL just executes the queries as written, we should see equal or better performance because of subselects.
In cases where subselects are introduced for readability or other reasons, they might still slow MySQL down.
Postgres and Count Distinct
The biggest news in our comparison is Postgres’s remarkable slowdown on the naive query. Thursday’s advertised 50X speedup is repeatable today, but only on Postgres.
Speeding up your count distincts with subqueries will net you a more modest, yet significant, 2X improvement on other databases.