Performance Tips

Redshift is usually very fast. But even with all the power of a large Redshift cluster, queries can still get slow.

This might be because the query is generating so much temporary data that it needs to write its intermediate state to disk. Or perhaps the query shares a common core with several other queries that also need to run, and resources are wasted recomputing that common data for every query.

The most common issue for query underperformance is when they do not use the tables' sort and dist keys. These two keys are the closest things to indices in Redshift, so skipping them can cost a lot of time.

In this section we'll show you how to find wasteful queries, materialize common data that many queries share, and choose the right sort and dist keys.

Materialized Views

The best way to make your SQL queries run faster is to have them do less work. A great way to do less work is to query a materialized view that's already done the heavy lifting.

Materialized views are particularly nice for analytics queries, where many queries do math on the same basic atoms, data changes infrequently (often as part of hourly or nightly ETLs), and those ETL jobs provide a convenient home for view creation and maintenance logic.

Redshift doesn't yet support materialized views out of the box, but with a few extra lines in your import script (or a tool like Periscope), creating and maintaining materialized views as tables is a breeze.

Lifetime Daily ARPU (average revenue per user) is common metric and often takes a long time to compute. Let's speed it up with materialized views.

Calculating Lifetime Daily ARPU

This common metric shows the changes in how much money you're making per user over the lifetime of the your product.

Lifetime ARPU (Date) = Sum of purchases up to Date
Unique user count up to Date

For that we'll need a purchases table and a gameplays table, and the lifetime accumulated values for each date. Here's the SQL for calculating lifetime gameplays:

with
lifetime_gameplays as (
  select
    dates.d,
    count(distinct gameplays.user_id) as count_users
  from (
    select distinct date(created_at) as d
    from gameplays
  ) as dates
  inner join gameplays
    on date(gameplays.created_at) <= dates.d
  group by d
),

The range join in the correlated subquery lets us recalculate the distinct number of users for each date.

Here's the SQL for lifetime purchases in the same format:

lifetime_purchases as (
  select
    dates.d,
    sum(price) as sum_purchases
  from (
    select distinct date(created_at) as d
    from purchases
  ) as dates
  inner join purchases
    on date(purchases.created_at) <= dates.d
  group by d
)

Now that the setup is done, we can calculate lifetime daily ARPU:

with
lifetime_gameplays as (...),
lifetime_purchases as (...)

select
  lifetime_gameplays.d as date,
  round(
      lifetime_purchases.sum_purchases /
      lifetime_gameplays.count_users
    , 2
  ) as arpu
from lifetime_purchases inner join lifetime_gameplays
  on lifetime_purchases.d = lifetime_gameplays.d
order by lifetime_gameplays.d

That's a monster query and it takes minutes to run on a database with 2B gameplays and 3M purchases. That's way to slow, especially if we want to quickly slice by dimensions like what platform the game was played on. Plus, similar lifetime metrics will need to recalculate the same data over and over again!

Easy View Materialization on Redshift

Conveniently, we wrote our query in a format that makes it obvious which parts can be extracted into materialized views: lifetime_gameplays and lifetime_purchases.

We'll fake view materialization in Redshift by creating tables, and Redshift makes it easy to create tables from snippets of SQL:

create table lifetime_purchases as (
  select
    dates.d,
    sum(price) as sum_purchases
  from (
    select distinct date(created_at) as d
    from purchases
  ) as dates
  inner join purchases
    on date(purchases.created_at) <= dates.d
  group by d
)

Do the same thing for lifetime_gameplays, and and calculating Lifetime Daily ARPU now takes less than a second to complete!

Redshift is especially great for this kind of optimization because data on a cluster usually changes infrequently, often as a result of hourly or nightly ETLs.

Remember to drop and recreate these tables every time you upload data to your Redshift cluster to keep them fresh. Or create views in Periscope instead, and we'll keep them up to date automatically!

Disk Based Temporary Tables

Columnar stores like Amazon Redshift achieve high speeds by reading only the columns needed to complete a query. The best speedups are achieved when these columns and the intermediate results fit in RAM. These speedups degrade if the intermediate results exceed the available RAM and get written to disk.

Your query is likely exceeding the available RAM if it causes spikes in your disk usage graph:

The disk space spikes as temporary tables are created and destroyed, slowing our queries in the process.

Redshift keeps detailed statistics on each query execution, available in the system views svl_query_report and svl_query_summary.

These tables are keyed on query with the ID found in the redshift console or by selecting from svl_qlog. The svl_query_summary view is a summarized version of svl_qlog.

To find recent queries that are hitting disk, run:

select query, substring
from svl_qlog join svl_query_summary using(query)
where starttime > date(getdate()) - interval '1 day'
  and is_diskbased = 't';

Digging In

To dig in, we'll open up our Redshift console and check the run time for our sample query.

To see the execution details, run:

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 127387
order by workmem desc

The workmem is the upper bound on memory the query planner requires for each step, not the amount actually used. If workmem exceeds the available RAM the step will use disk for intermediate results. This is indicated by the is_diskbased column.

The amount of RAM available is based on the number of query slots allocated to this query. By default, Redshift uses 5 query slots and allocates one slot per query. Many of the steps are using disk so we need to optimize this query.

Mitigation

There are always two approaches to solving memory constraints: add more memory or use less. Adding more memory is expensive and using less can be difficult or even impossible. Some queries will always need to use disk.

For the queries that could fit in RAM, here are some mitigation strategies:

Update Database Statistics

workmem is only the estimate of space needed by the query and is based on the statistics the database collects about each table.

Make sure to run analyze on your tables to keep their internal statistics up to date. This reduces extraneous padding from outdated row count estimates.

Making More RAM Available

Adding more memory is done by either adding more nodes to your cluster or by increasing the wlm_query_slot_count.

This can be done per session using set wlm_query_slot_count or per user using parameter groups.

Increasing the wlm_query_slot_count from 1 to 5 gives this query access to all of the cluster's RAM. The workmem has increased for all steps, and most are no longer disk-based.

Requiring all of the query slots means that this query needs to wait until all 5 slots are available before it can run. This waiting can eliminate the performance improvements of the additional RAM on a busy cluster.

More efficient queries

This query counts the distinct number of users who have post or visit events each day. It takes 75 seconds and four of its steps are disk-based. Instead of increasing the query slots, let's make the query more efficient.

select
  date(event_time) as date,
  count(distinct case when event_type = 'Post'
    then user_id else null end) as posts_created,
  count(distinct case when event_type = 'Visit'
    then user_id else null end) as visits
from events
where event_time >= '2014-05-21'
group by 1

The best way to improve the performance of any query is to reduce the amount of data that is stored and read during its execution. This query only cares about vists with an event_type of 'Post' or 'Visit'. Everything else will resolve to null and not be counted in the distinct.

Adding a where clause to filter for only these two event types greatly speeds up this query. It runs in about 5 seconds and doesn't hit the disk at all.

Sort Keys and Dist Keys

Like a lot of folks in the data community, we've been impressed with Redshift. Yet at first, we couldn't figure out why performance was so variable on seemingly-simple queries.

The key is carefully planning each table's sort key and distribution key.

A table's distkey is the column on which it's distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.

A table's sortkey is the column by which it's sorted within each node.

A Naive Table

Our 1B-row activity table is set up this way:

create table activity (
  id integer primary key,
  created_at_date date,
  device varchar(30)
);

A common query is: How much activity was there on each day, split by device?

select created_at_date, device, count(1)
from activity
group by created_at_date, activity
order by created_at_date;

In Periscope, this would give you a graph like this:

Periscope Activity

On a cluster with 8 dw2.large nodes, this query takes 10 seconds. To understand why, let's turn to Redshift's handy CPU Utilization graph:

Slow CPU

That is a ton of CPU usage for a simple count query!

The problem is our table has no sortkey and no distkey. This means Redshift has distributed our rows to each node round-robin as they're created, and the nodes aren't sorted at all.

As a result, each node must maintain thousands of counters — one for each (date, device) pair. Each time it counts a row, it looks up the right counter and increments it. On top of that, the leader must aggregate all the counters. This is where all of our CPU time is going.

Smarter Distribution and Sorting

Let's remake our table with a simple, intentional sortkey and distkey:

create table activity (
  id integer primary key,
  created_at_date date sortkey distkey,
  device varchar(30)
);

Now our table will be distributed according to created_at_date, and each node will be sorted by created_at_date. The same query runs on this table in 8 seconds, a solid 20% improvement.

Because each node is sorted by created_at_date, it only needs one counter per device. As soon as the node is done iterating over each date, the values in each device counter are written out to the result, because the node knows it will never see that date again.

Even better, because dates are unique to each node, the leader doesn't have to do any math over the results. It can just concatenate them and send them back.

Our approach is validated by lower CPU usage across the board:

Mid CPU

The Specifics

But what if there were a way to require only one counter? Fortunately Redshift allows multi-key sorting:

create table activity (
  id integer primary key,
  created_at_date distkey,
  device varchar(30)
)
sortkey (created_at_date, device);

Our query runs on this table in 5 seconds, a 38% improvement over the previous table, and a 2X improvement from the naive query!

Once again, the CPU chart will show us how much work is required:

Fast CPU

As promised, only one counter is required. As soon as a node is done with a (date, device) pair it can be written to the result, because that pair will never be seen again.

Of course, choosing a table's sortkey for one query might seem like overkill. But choosing intentionally always beats the naive case. And if you have a few dimensions you use a lot in your analysis, it's worth including them.

Next Section: Maintenance