Periscope is all about speed — we’re always looking for new ways to make our customers’ queries run faster.
Over the last few years, we’ve tried all kinds of databases in search of more speed, including $15k of custom hardware. Of everything we’ve tried, Amazon Redshift won out each time.
As we’ve worked with Redshift, we’ve gathered a lot of knowledge and developed strong opinions around optimal setups.
And we’ve gathered all that knowledge into The Lazy Analyst’s Guide to Amazon Redshift!
Why We Chose Redshift
We've used Redshift to speed up our customers' queries by 150x. We were able to do this with features like Redshift’s columnar storage and distributed architecture.
Many databases store data by row, requiring you to read a whole table to sum a column for example.
Redshift stores its data organized by column. Since columns are stored separately, Redshift can ignore columns that aren't referenced by the current query. The less data you need to read from disk, the faster your query will run.
Redshift stores each table's data in thousands of chunks called blocks. Each block can be read in parallel. This allows the Redshift cluster to use all of its resources on a single query.
When reading from disk, a Redshift cluster can achieve much higher IOPS since each node reads from a different disk and the IOPS sums across the cluster.
We've talked about why Redshift is fast — now it's time we showed you our proof.
To make the comparison as fair as possible, we benchmarked the largest RDS Postgres box (DB.R3.8XLarge) against a similarly priced and spec'd Redshift cluster (16 DW2.Large nodes).
Both our RDS Postgres box and our Redshift cluster used default settings and each costs about $3000/month:
We ran each test query 3 times on an otherwise idle setup. The reported time is the average of the second two executions.
Each query was run against a
transactions table that's comprised of:
- 1 billion rows
- 50 million unique users in
- 10 thousand unique products in
- Timestamps spanning one year in
- And a dozen extra columns representing various attributes of the transaction
The RDS Postgres version of this table had indexes on
The Redshift table used
user_id as the
(user_id, created_at) as the
sort key, and the compression encodings recommended by
Both tables were
vacuumed before running any queries.
Many of our customers look at metrics like Daily Revenue, Daily Active Users, and Daily ARPU. On average, Redshift was 500x faster than RDS Postgres:
Here are the metrics queries we tested:
-- Daily Revenue select date(created_at), sum(amount) from transactions group by 1 -- Daily Active Users select date(created_at), count(distinct user_id) from transactions group by 1 -- Daily ARPU select date(created_at), sum(amount) / count(distinct user_id) from transactions group by 1
Whether it's 30-day retention or unique sessions, many analytics queries rely on being able to count the distinct number of elements in a set very fast. On average, Redshift was 200x faster than RDS Postgres for these queries.
Here are the distincting queries we tested:
-- Users per Product select product_id, count(distinct user_id) from transactions group by 1 -- Products per User select user_id, count(distinct product_id) from transactions group by 1 -- Products per Date select date(created_at), count(distinct product_id) from transactions group by 1
Convinced Redshift is the database for you? Head over to our next section and we'll show you how to set up your cluster.