Amazon Redshift Software Updates Improve Query Speed by 2.4x
At Periscope Data, we support data teams by helping them and their line-of-business partners manage the entire analytics process. From ingestion and storage, to modeling and analysis, through to visualization and reporting. We manage thousands of Redshift nodes across hundreds of Redshift clusters so our customers can run tens of millions of analytics queries each day. One of the best ways to help our customers be more successful is to enable them to run even more queries each day, and the latest update from Redshift does just that.
Last year, we compared Redshift’s dc1 nodes to their newer dc2 nodes and saw a 5x increase in query speed on the new hardware. But performance is more than just hardware; the software that runs on top of the hardware is as important as the hardware itself. To that end, Amazon has continued to release software updates to improve performance on Redshift — so we ran a test to quantify the level of improvement.
For the test, we used two clusters: one was a 2-node dc2.large cluster with the latest software, the other was the same hardware with software from 6 months ago (courtesy of the Redshift team). We copied one of Periscope Data’s existing datasets to both clusters and sampled the past week's queries from the STL_QUERY table as our test set, to simulate a week’s worth of queries. In total, we ran approximately 6,000 queries on each cluster. To ensure we were recreating a real-world scenario, we used queries which varied in nature — some included joins, some used aggregate functions and some were simply selecting all data in a table.
We measured the average query time on both clusters at 5-minute intervals. Here’s what we saw:
The current Redshift software performed significantly better than the older software — query times were more consistent and subject to fewer spikes (both in volume and magnitude) than the older software. If we look at the average query time for each cluster, we found this performance boost:
The set of queries run on the older software took an average of 22 seconds per query, whereas the same set of queries on the current software took an average of 9 seconds per query — equating to a 2.4x increase in average performance.
When we looked at the distribution of query times, we also noticed a higher volume of queries being returned sub-second on the current software than on the older software:
The current software also showed a tighter distribution of query times overall. To visualize this, we use a box plot which identifies the 25th, 50th and 75th percentiles of our query times as a box with any outlying query times denoted as points:
Even though the 25th, 50th and 75th percentile query times were about the same, the newer software experienced a significantly lower maximum query time than its older counterpart. We visualized the 50th, 75th, 90th, 95th and 99th percentiles to illustrate this difference:
This tells us that the top 5% longest-running queries experience improvements of 2.5x on the latest software — and queries in the top 1% nearly doubled that at 4.6x.
We compared Redshift’s latest software to its software from 6 months ago and found the latest software improved average query speed by 2.4x, which we attribute to a more consistent query experience with fewer spikes in query time.
Without changing anything, Redshift customers can now run 2.4x more analyses per day. We’re happy to see the Redshift team continue to improve and iterate on the product because it means spending less time running queries and more time analyzing critical business questions to make decisions.