Changing Dist and Sort Keys on Giant Tables in Redshift
Keeping your Redshift fast requires maintenance. As your query patterns change you’ll want to update the dist and sort keys to improve performance.
And if you’re appending to or updating a table, you need to vacuum it every so often to resort the data and remove dead space.
Smaller is Easier
All of this is easy on small tables. vacuums are fast and changing keys is as easy as creating a new table and using select into to transfer the data.
But this doesn’t work on large tables. Either there simply isn’t enough extra space to duplicate the data, or the vacuum locks the table for far too long.
One Periscope customer recently came across this issue. They had an events table that was far too large to work with as a single table. So we broke it up into many smaller tables and wrapped them in a view.
Break up Your Table
First, create the smaller tables (we’re using date ranges since it’s an events table):
create table events_20140101_20140201
And insert the relevant data into the sub-table:
insert into events_20140601_20140701 (
select * from events
where event_time between '2014-06-01'
Delete the data from events once it’s been transferred and repeat the process until events is a manageable size.
Now that we have many smaller tables, rename events to events_current and only add new data to events_current. When events_current gets large, archive data from it to a dated events table.
Create a View and Merge Your Sub-tables
Finally, create events as a view using union all to merge the sub-tables:
create or replace view events as (
select * from events_current
select * from events_20140601_20140701
select * from events_20140701_20140801
select * from events_20140901_20141001
select * from events_20141001_20141101
select * from events_20141101_20141201
Queries working with events will work just like before, and the smaller tables make the maintenance much easier since you can work with them one at a time.
Changing dist and sort keys is back to being a simple select into, and you never need to vacuum the dated sub-tables if they never get changed.