How to ETL Data into and out of Amazon Redshift
Loading and Unloading
A key question about any data warehouse, often overlooked in the purchasing process, is how the data gets in and out. While performance is obviously critical, you will be loading data in and dumping data out all day long. A simple, scalable process is critical.
Here at Periscope, we use Amazon Redshift as a high-speed cache of customer data, so we have a lot of battle scars from getting our loads and unloads running smoothly. These are our recommendations for a silky-smooth, terabyte-scale pipeline into and out of Redshift.
Getting Data In: The COPY Command
Let’s take this step by step:
The first couple of lines are straightforward. Immediately after COPY comes the schema name and table name. Immediately after FROM comes the path to your data’s manifest file in S3.
The WITH CREDENTIALS line contains your AWS account credentials. These credentials must have read access to the S3 bucket in question.
The next three keywords clarify some things about the data:
- REGION specifies the AWS region of your S3 bucket. The default is the AWS region of your Redshift cluster.
- MANIFEST specifies that the path after FROM is to a manifest file.
- GZIP indicates that the data is gzipped.
The final three keywords serve to maximize the success rate of the import:
- ACCEPTANYDATE allows any date format to be parsed in datetime columns.
- TRUNCATECOLUMNS will truncate text columns rather than error if the data is too long.
- ACCEPTINVCHARS will replace invalid Unicode characters with ? rather than erroring.
You can adjust these toggles to taste, but in our experience, failed loads are quite frustrating. We recommend some flexibility on the data rather than endless ETL headaches.
Getting Data Out: The UNLOAD Command
Nearly as common as getting data in is getting data out. Sometimes, the results of hard computations done in Redshift are necessary for serving systems. Other times, a large export is needed for analysis in Excel or other tools.
To get lots of data out, you’ll want the UNLOAD command. Here’s an example:
UNLOAD ('select * from my_table')
NULL AS '\\N'
Immediately after the UNLOAD keyword, enter the query whose results you want to export, as a string surrounded by parentheses.
Similarly to COPY, you must use WITH CREDENTIALS to specify credentials that may write to your S3 bucket.
The next three keywords modify the format of the export itself:
- MANIFEST includes a file listing the dumped files. Redshift will export two files per node (one per slice), and a master list can be helpful for reloading via COPY as well as for other programs reading the data.
- GZIP compresses the files, making them much easier to work with.
- ALLOWOVERWRITE proceeds with the export even if the file already exists.
The last two keywords deal with the data:
- ESCAPE puts an escape backslash before newlines, quotes, etc.
- NULL AS uses a special format for null values instead of writing whitespace.
That’s all there is to it! We hope this has been helpful. Go forth and copy large amounts of data.