To keep your Redshift cluster running at peak performance, you'll occasionally need to run some maintenance tasks.
If you're familiar with other kinds of databases, you already know about vacuuming. You'll need to vacuum your redshift tables if their data has changed, or if you've added a lot of new rows.
And as the data of your tables changes, you'll want to recheck the compression settings. If a table isn't using the best possible compression, it'll take up too much space and slow queries down.
In this section we'll show you when (and when not) to vacuum, how to recheck compression settings, and how to keep an eye on disk usage so you know when to upgrade to a larger Redshift cluster.
When Not to Vacuum
Most guidance around
vacuuming says to do it as often as necessary. When in doubt, we recommend nightly. But
vacuum operations can be very expensive on the cluster, greatly reducing query performance. You can skip
vacuuming tables in certain situations:
Data is loaded in sort key order
When new rows are added to a Redshift table, they're appended to the end of the table in an "unsorted region". For most tables, this means you have a bunch of rows at the end of the table that need to be merged into the sorted region of the table by a
You don't need to vacuum when appending rows in
sort key order: If you're adding new rows to an events table that is sorted by the event's time, the rows are already sorted when they're added! In this case, you don't need to resort this table with a
vacuum because it's never unsorted.
A Lot of Data is Unsorted
If it's been a long time since you
vacuumed the table or if you've appended a ton of unsorted data, it can be faster to copy the table than to
You can recreate the table with all the same columns, compression encodings, and
sort keys with
create table like:
create table events_copy (like events); insert into events_copy (select * from events); drop table events; alter table events_copy rename to events
A Lot of Data was Deleted
Unlike Postgres, the default
vacuum operation in Redshift is
vacuum full. This operation reclaims dead rows and resorts the table.
If you've recently deleted a lot of rows from a table, you might just want to get the space back. You can use a delete-only
vacuum to compact the table without spending the time to resort the remaining rows:
vacuum delete only events
You can see how many rows were deleted or resorted from the most recent vacuums by querying
select * from svv_vacuum_summary where table_name = 'events'
And it's always a good idea to
analyze a table after a major change to its contents:
Rechecking Compression Settings
copy data into an empty table, Redshift chooses the best compression encodings for the loaded data. As data is added and deleted from that table, the best compression encoding for any column might change.
What used to make sense as a
bytedict might now be better off as a
delta encoding if the number of unique values in the column has grown substantially.
To see the current compression encodings for a table, query
select "column", type, encoding from pg_table_def where tablename = 'events'
And to see what Redshift recommends for the current data in the table, run
analyze compression events
Then simply compare the results to see if any changes are recommended.
Redshift doesn't currently have a way to alter the compression encoding of a column. You can add a new column to the table with the new encoding, copy over the data, and then drop the old column:
alter table events add column device_id_new integer delta; update events set device_id_new = device_id; alter table events drop column device_id; alter table events rename column device_id_new to device_id;
Monitoring Disk Space
If your cluster gets too full, queries will start to fail because there won't be enough space to create the temp tables used during query execution.
Vacuums can also fail if there isn't enough free space to store the intermediate data while it's getting resorted.
To keep an idea on how much space is available in your cluster via SQL, query
select sum(used)::float / sum(capacity) as pct_full from stv_partitions
And to see individual table sizes:
select t.name, count(tbl) / 1000.0 as gb from ( select distinct datname id, name from stv_tbl_perm join pg_database on pg_database.oid = db_id ) t join stv_blocklist on tbl=t.id group by t.name order by gb desc
And then you can either drop unnecessary tables or resize your cluster to have more capacity!