Introduction

Periscope is all about speed — we’re always looking for new ways to make our customers’ queries run faster.

Over the last few years, we’ve tried all kinds of databases in search of more speed, including $15k of custom hardware. Of everything we’ve tried, Amazon Redshift won out each time.

As we’ve worked with Redshift, we’ve gathered a lot of knowledge and developed strong opinions around optimal setups.

And we’ve gathered all that knowledge into The Lazy Analyst’s Guide to Amazon Redshift!

Why We Chose Redshift

We've used Redshift to speed up our customers' queries by 150x. We were able to do this with features like Redshift’s columnar storage and distributed architecture.

Columnar Storage

Many databases store data by row, requiring you to read a whole table to sum a column for example.

Redshift stores its data organized by column. Since columns are stored separately, Redshift can ignore columns that aren't referenced by the current query. The less data you need to read from disk, the faster your query will run.

Distributed Architecture

Redshift stores each table's data in thousands of chunks called blocks. Each block can be read in parallel. This allows the Redshift cluster to use all of its resources on a single query.

When reading from disk, a Redshift cluster can achieve much higher IOPS since each node reads from a different disk and the IOPS sums across the cluster.

Benchmarks

We've talked about why Redshift is fast — now it's time we showed you our proof.

We benchmarked Amazon Redshift against Amazon RDS Postgres and found Redshift to be 100-1000 times faster on common analytics queries.

The Specs

To make the comparison as fair as possible, we benchmarked the largest RDS Postgres box (DB.R3.8XLarge) against a similarly priced and spec'd Redshift cluster (16 DW2.Large nodes).

Both our RDS Postgres box and our Redshift cluster used default settings and each costs about $3000/month:

We ran each test query 3 times on an otherwise idle setup. The reported time is the average of the second two executions.

Each query was run against a transactions table that's comprised of:

  • 1 billion rows
  • 50 million unique users in user_id
  • 10 thousand unique products in product_id
  • Timestamps spanning one year in created_at
  • And a dozen extra columns representing various attributes of the transaction

The RDS Postgres version of this table had indexes on created_at, user_id, and product_id.

The Redshift table used user_id as the dist key, (user_id, created_at) as the sort key, and the compression encodings recommended by analyze compression.

Both tables were analyzed and vacuumed before running any queries.

Metrics Queries

Many of our customers look at metrics like Daily Revenue, Daily Active Users, and Daily ARPU. On average, Redshift was 500x faster than RDS Postgres:

Here are the metrics queries we tested:

-- Daily Revenue
select date(created_at), sum(amount)
from transactions group by 1

-- Daily Active Users
select date(created_at), count(distinct user_id)
from transactions group by 1

-- Daily ARPU
select date(created_at), sum(amount) / count(distinct user_id)
from transactions group by 1

Distinct Queries

Whether it's 30-day retention or unique sessions, many analytics queries rely on being able to count the distinct number of elements in a set very fast. On average, Redshift was 200x faster than RDS Postgres for these queries.

Here are the distincting queries we tested:

-- Users per Product
select product_id, count(distinct user_id)
from transactions group by 1

-- Products per User
select user_id, count(distinct product_id)
from transactions group by 1

-- Products per Date
select date(created_at), count(distinct product_id)
from transactions group by 1

Convinced Redshift is the database for you? Head over to our next section and we'll show you how to set up your cluster.

Next Section: Cluster Configuration