Importing and Exporting Data

Now that you have your cluster, it's time to load some data in! As Redshift is primarily used as a data warehouse, the data usually comes from another system. In this section, we'll cover imports from primary serving databases like MySQL and Postgres, as well as more general load strategies and query patterns.

Frequently the outputs of data processing or modeling done in Redshift will be used in other systems. Later on in the section, we'll cover the best ways to get data out of Redshift in bulk.

Creating an IAM User

The easiest way to get data into Redshift begins with uploading CSVs to Amazon S3. In the AWS Identity and Access Management (IAM) console, create an account with access to an S3 bucket.

Create a new user and save the Access Key Id and Access Key Secret. These keys are needed to let Redshift retrieve files from your S3 bucket.

Now your user needs an inline policy to define what resources it can access. The JSON blob below is a sample policy we created with the policy generator for S3 resources.

Change “my-s3-bucket” in the JSON blob to your S3 bucket name and add it as a Custom Policy under Inline Policies.

{"Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-s3-bucket/*"
      ],
      "Sid": "Stmt1234567"
    }
  ]}

Then make sure your bucket and Redshift cluster are in the same region, or you'll incur data transfer charges.

Once your bucket is ready, it's time to create a table.

Creating the Table

Redshift doesn't support as many column types as most row-store databases. For supported types and their value ranges, take a look at Redshift's documentation on data types.

With a simple table, the column type translation is pretty straightforward. For JSON, BINARY, and other column types not supported by Redshift, you can store them as NULL to keep the schemas consistent, or varchar(max) if you need the data.

Any of the no delimiter or missing column errors can indicate an unescaped newline or control character. Truncated lines that show in the dump file can can indicate an unescaped NULL which Redshift cannot process, even in quotes.

Getting Data In: The COPY Command

To get data into Redshift, start by staging it in Amazon S3. Once your data's there, you'll take advantage of the all-purpose COPY command. Here's an example:

COPY my_schema.my_table 
  FROM 's3://bucket_name/path/to/my_manifest'
  WITH CREDENTIALS 
    'aws_access_key_id=<my_access_key>;
    aws_secret_access_key=<my_secret_key>'
  REGION 'us-east-1'
  MANIFEST
  GZIP
  ACCEPTANYDATE
  TRUNCATECOLUMNS
  ACCEPTINVCHARS

Let's take this step by step:

The first couple of lines are straightforward. Immediately after COPY comes the schema name and table name. Immediately after FROM comes the path to your data's manifest file in S3.

The WITH CREDENTIALS line contains your AWS account credentials. These credentials must have read access to the S3 bucket in question.

The next three keywords clarify some things about the data:

  • REGION specifies the AWS region of your S3 bucket. The default is the AWS region of your Redshift cluster.
  • MANIFEST specifies that the path after FROM is to a manifest file.
  • GZIP indicates that the data is gzipped.

The final three keywords serve to maximize the success rate of the import:

  • ACCEPTANYDATE allows any date format to be parsed in datetime columns.
  • TRUNCATECOLUMNS will truncate text columns rather than error if the data is too long.
  • ACCEPTINVCHARS will replace invalid Unicode characters with ? rather than erroring.

You can adjust these toggles to taste, but in our experience, failed loads are quite frustrating. We recommend some flexibility on the data rather than endless ETL headaches.

stl_load_errors

As you load tables, you might run into an error or two. The Redshift stl_load_errors table contains most of the recent errors that occurred during a COPY. Since stl_load_errors is a very wide table, we recommend you use \x auto to enable the extended display.

For example, this handy query will show you the most recent two errors:

select starttime, filename, err_reason, line_number,
  colname, type, col_length, position, raw_field_value,
  raw_line, err_code
from stl_load_errors
order by starttime desc limit 2;

If you don't see your COPY command in the results, check the Loads tab in the AWS console or try again. The errors are fairly useful, but if you don't see what you're looking for, look up the err_code value in Amazon's documentation.

Most of the bad data format issues can be resolved by going through the options in our troubleshooting section above. Some errors, though, point to deeper problems in the file.

Dumping from MySQL or Postgres

To get our data out of MySQL, we start by escaping control characters and delimiters with a slash and separating fields with a comma.

Note that with these escape characters, MySQL will output NULLs as \N. Use mysqldump to get data out of MySQL.

mysqldump --tab . 
--fields-escaped-by=\\ 
--fields-terminated-by=, 
dbname tablename

This creates a dump file. Upload your file to S3, create the table in Redshift, and load the data in using the following command:

COPY schema.table FROM 's3://path/to/dump.csv'
  WITH CREDENTIALS 'aws_access_key_id=<your-access-key-goes-here>;
  aws_secret_access_key=<your-secret-key-goes-here>'
  NULL as '\\N'
  ESCAPE;

Dumping from Postgres

Postgres can dump csv-like files from psql using \copy, which creates a local file. Just run this command in psql:

\copy table to 'filename' csv header null as '\N'

Upload your dump file to S3, create the table in Redshift, and load the data with the following command:

COPY schema.table FROM 's3://path/to/dump.csv'
  WITH CREDENTIALS 'aws_access_key_id=<your-access-key-goes-here>;
  aws_secret_access_key=<your-secret-key-goes-here>'
  CSV;

Troubleshooting

A few common gotchas you might run into while loading:

  • For invalid characters, add ACCEPTINVCHARS to the COPY command. ACCEPTINVCHARS replaces invalid characters with a "?" per byte, so the length is unchanged.

  • For out-of-range datetimes, use ACCEPTANYDATE. These are common when exporting from MySQL, since "00/00/00 00:00:00" is valid there, but not in Redshift.

  • If your CSV has column headers, use IGNOREHEADER 1.

  • For out-of-range numbers, NULL ('\0') characters, or other data that Redshift cannot ingest, you have to fix it at the source.

Now you've got your data in Redshift.

Exporting Data

Getting Data Out: The UNLOAD Command

Nearly as common as getting data in is getting data out. Sometimes, the results of hard computations done in Redshift are necessary for serving systems. Other times, a large export is needed for analysis in Excel or other tools.

To get lots of data out, you'll want the UNLOAD command. Here's an example:

UNLOAD ('select * from my_table')
  TO 's3://bucket_name/path/to/my_filename_prefix'
  WITH CREDENTIALS
    'aws_access_key_id=<my_access_key>;
    aws_secret_access_key=<my_secret_key>'
  MANIFEST
  GZIP
  ALLOWOVERWRITE
  ESCAPE
  NULL AS '\\N'

Immediately after the UNLOAD keyword, enter the query whose results you want to export, as a string surrounded by parentheses.

Similarly to COPY, you must use WITH CREDENTIALS to specify credentials that may write to your S3 bucket.

The next three keywords modify the format of the export itself:

  • MANIFEST includes a file listing the dumped files. Redshift will export two files per node (one per slice), and a master list can be helpful for reloading via COPY as well as for other programs reading the data.
  • GZIP compresses the files, making them much easier to work with.
  • ALLOWOVERWRITE proceeds with the export even if the file already exists.

The last two keywords deal with the data:

  • ESCAPE puts an escape backslash before newlines, quotes, etc.
  • NULL AS uses a special format for null values instead of writing whitespace.

That's all there is to it! We hope this has been helpful. Go forth and copy large amounts of data.

Next Section: Helpful Admin Queries