Redshift Maintenance 101Redshift Maintenance 101
Keeping your Redshift clusters running well requires maintenance. Updating and deleting data creates dead rows that need to be vacuumed, and even append-only tables need to be resorted if the append order is not consistent with the sort key.
The best compression encodings for your tables can change as the data changes, and you’ll want to resize your cluster before it gets too full to run queries.
When Not to Vacuum
Most guidance around vacuuming says to do it as often as necessary. When in doubt, we recommend nightly. But vacuum operations can be very expensive on the cluster, greatly reducing query performance. You can skip vacuuming tables in certain situations:
Data is loaded in sort key order
When new rows are added to a Redshift table, they’re appended to the end of the table in an “unsorted region”. For most tables, this means you have a bunch of rows at the end of the table that need to be merged into the sorted region of the table by a vacuum.
You don’t need to vacuum when appending rows in sort key order: If you’re adding new rows to an events table that is sorted by the event’s time, the rows are already sorted when they’re added! In this case, you don’t need to resort this table with a vacuum because it’s never unsorted.
A lot of data is unsorted
If it’s been a long time since you vacuumed the table or if you’ve appended a ton of unsorted data, it can be faster to copy the table than to vacuum it.
You can recreate the table with all the same columns, compression encodings, and dist and sort keys with create table like:
create table events_copy (like events);
insert into events_copy (select * from events);
drop table events;
alter table events_copy rename to events
A lot of data was deleted
Unlike Postgres, the default vacuum operation in Redshift is vacuum full. This operation reclaims dead rows and resorts the table.
If you’ve recently deleted a lot of rows from a table, you might just want to get the space back. You can use a delete-only vacuum to compact the table without spending the time to resort the remaining rows:
vacuum delete only events
You can see how many rows were deleted or resorted from the most recent vacuums by querying svv_vacuum_summary:
select * from svv_vacuum_summary
where table_name = 'events'
And it’s always a good idea to analyze a table after a major change to its contents:
Rechecking Compression Settings
When you copy data into an empty table, Redshift chooses the best compression encodings for the loaded data. As data is added and deleted from that table, the best compression encoding for any column might change.
What used to make sense as a bytedict might now be better off as a delta encoding if the number of unique values in the column has grown substantially.
To see the current compression encodings for a table, query pg_table_def:
select "column", type, encoding
where tablename = 'events'
And to see what Redshift recommends for the current data in the table, run analyze compression:
analyze compression events
Then simply compare the results to see if any changes are recommended.
Redshift doesn’t currently have a way to alter the compression encoding of a column. You can add a new column to the table with the new encoding, copy over the data, and then drop the old column:
alter table events add column device_id_new integer delta;
update events set device_id_new = device_id;
alter table events drop column device_id;
alter table events rename column device_id_new to device_id;
Monitoring Disk Space
If your cluster gets too full, queries will start to fail because there won’t be enough space to create the temp tables used during query execution. Vacuums can also fail if there isn’t enough free space to store the intermediate data while it’s getting resorted.
To keep an idea on how much space is available in your cluster via SQL, query stv_partitions:
select sum(used)::float / sum(capacity) as pct_full
And to see individual table sizes:
select t.name, count(tbl) / 1000.0 as gb
select distinct datname id, name
join pg_database on pg_database.oid = db_id
join stv_blocklist on tbl=t.id
group by t.name order by gb desc
And then you can either drop unnecessary tables or resize your cluster to have more capacity!