Want to make data analysis fast for everyone?Join Us!
At Periscope Data we are huge fans of AWS Redshift because it empowers our users to run queries blazingly fast. Beyond the success of Redshift the product is the ecosystem that has grown around the platform: dozens of companies helping you deploy and optimize your Redshift clusters and a myriad of resources for maximizing your query speed.
Here are our six favorite blog posts on optimizing Redshift performance to come out of 2016.
Behind the Scenes with the Redshift Team
To kick things off, this post on throughput and vacuuming performance from Maor Kleider, a Senior PM on the AWS Redshift team. It serves as a great introduction on how columnar data stores make for great analytics databases, and as a reminder that your queries will get faster without your intervention from all the work the engineering team puts into Redshift.
Staying up to date on the patches announced in the developer forum helps provide insight into what happens during your cluster’s maintenance window.
It just misses our list with a publish date of December 2015, but absolutely check out this additional post from the AWS team on performance tuning techniques, a comprehensive introduction to performance optimizations on your Redshift cluster .
A Full Comparison of Redshift and BigQuery
The awesome folks at Panoply.io are building a comprehensive analytics infrastructure to make data management simple. When they set off to build their data management layer they had an old-fashioned bake-off between their two finalists, Redshift and BigQuery. Not only does this post win the award for best blog cover art, it provides a third-party comparison between two modern data solutions driven by real world business needs. Read how they compared performance, price, and usability to make their decision.
Spoiler alert: they came up with the same conclusion as we did when comparing Redshift and Postgres on a fixed budget. Google BigQuery is a stellar option when you need a pay-as-you-go pricing model on gigantic datasets that can lead to huge cost savings, but Redshift holds the top spot always-on cloud-service data warehouses.
Still undecided? You can find several insightful comparisons between Redshift and BigQuery in this post on Quora.
Why did Airbnb switch from Amazon Redshift to Presto?
Airbnb was one of the first major players to speak publicly about Redshift in 2013, but reportedly switched to Presto for ad-hoc queries in 2015. This Quora response from the pervasive Kiyoto Tamura provides an outsider’s speculation on why Airbnb made the migration from Redshift to Presto.
Not only does his response utilize the beautiful turn of phrase
data lake, it provides insight into rising costs and data management overhead when operating at Airbnb’s multi-petabyte scale.
For a deeper dive into Airbnb’s architecture mindset read their overview here.
Lessons Learned Loading Data into Redshift
Our friends at Amplitude turbo-charge your web and mobile analytics by storing event logs in Redshift. Earlier this year they made the transition from storing events in a single large table into distributing events into different tables partitioned by event type; they tell their story in this blog post.
For excellent tips on loading data efficiently scroll down to
Redshift Lessons Learned section in the conclusion.
Improving Performance through Effective Zone Maps
Redshift introduced interleaved sort-keys in 2015, providing increased flexibility for optimizing the prefix meta-data stored in zone maps that permit Redshift to bypass blocks of data when performing scans.
You may now be wondering what a zone map is, and how the heck do you know if it needs changing? This post on performance gains from analyzing zone maps from 47 Lining provides an introduction to reviewing alerts in a query’s explain plan within the Redshift console.
Redshift at Lightning Speed with Better Dist and Sort Keys
We cap off this list by deviating from our list of blog posts and returning to Panoply.io for this fantastic slide deck on data layout optimizations:
Redshift at Lightning Speed.
In thirty quick slides they provide an introduction to working with changing data and several great strategies when working with
dist keys, which can have huge impacts on your query run times.
After your brush up on the Query Plan language around distributions styles from the Amazon Developer Guide’s section on Designing Tables, we particularly recommend slides 16 through 24 for a visual overview to cement the importance of setting good dist keys.
For more guidance on deploying and optimizing your Redshift cluster, check out The Lazy Analyst’s Guide to Amazon Redshift. What are your favorite Redshift references? We’d love to hear about the blog post that lead to 10x-ing your Redshift performance @PeriscopeData.