How to Get Started With AWS Spectrum in Minutes
It’s no secret that we’re huge fans of AWS Redshift. We’ve written popular guides to setting it up, tested it against other databases, deployed it for hundreds of customers as their analytics warehouse solution, and manage more than 1,000 nodes. For always-on analytics, it provides great performance at a great price. When a new product is released that has the ability to interact with Redshift, we notice.
Redshift Spectrum is Amazon’s newest database technology, allowing exabyte-scale data in S3 to be accessed through Redshift. This is not simply file access; Spectrum uses Redshift's brain. It deploys workers by the thousands to filter, project and aggregate your data before sending the minimum amount of data needed back to your Redshift cluster to finish your query and deliver your output. If your data is stored in a columnar-friendly format—such as Parquet or RCFile—Spectrum will use a full columnar model, providing radically increased performance over text files.
Engineers and analysts will find Spectrum useful in a number of scenarios:
- Large, infrequently used datasets can be stored more economically in S3 than in Redshift. For instance, clickstream tables often run to the tens of billions of rows. With Spectrum, you can store your most recent data in Redshift and your full history in S3.
- You can collaborate using the same data across multiple Redshift clusters. Often several databases use copies of the same data retrieved from a central datastore. A copy in analytics and copy in finance for example. Spectrum allows all users to have a single, canonical source of truth across all their relational databases. More impressively, Spectrum allows unlimited concurrency on the dataset.
- ETL processes from the central datastore into Redshift databases become incredibly easy. Cron jobs and Python ETL scripts can be replaced by SQL scripts copying data from S3 to Redshift directly with a simple create table statement. This empowers analysts to perform work traditionally managed by engineers.
All of these could be transformational for an organization, especially for companies already operating on Redshift. To give one example: The ability to move data from S3 to Redshift will be hugely impactful for a number of our own customers, many of whom use our Redshift data warehouse to run all of their analytics.
We set up Spectrum from the ground floor, starting with creating a new Redshift cluster and S3 bucket. This guide should be used as a complement to Amazon’s own.
Set up Spectrum for the First Time
Using Spectrum begins with Redshift, and the first thing to do is set up a Redshift database. (Use our guide if you would like more details on this step.) Spectrum’s speed is based on the size of your Redshift cluster. Each node in a cluster has a fixed number of “slices,” or share of that node’s I/O. Each slice sends multiple requests to the Redshift Spectrum layer, so the more slices you have, the more compute parallelism you’ll get from Spectrum.
I set up my Redshift instance in the US West (Oregon) region. The Redshift cluster needs no special functionality inherently, but will be inheriting permissions from its Identity and Access Management (IAM) role. IAM roles allow a database to have its own user identity. In this case it’s used to grant our Redshift cluster access to an S3 bucket. You can navigate to the IAM management section on the AWS console and create a new role, giving it the following two permissions:
Once created, grab your role’s Amazon Resource Name (ARN).
Return to the Redshift management console, select your newly created database, and then click Manage IAM roles, where you can associate your role with your cluster.
The next step is to create your new S3 bucket for data storage. Spectrum will only work when buckets and databases are in the same region, so we set this up in the US West (Oregon) region as well. Now you can upload data. Some important notes here:
- Spectrum works on directories, not files. A directory will be selected, and all files in the directory will be treated as a table. This means the files must have identical columns and data types.
- Breaking up one table into multiple files allows for parallel processing. So, for example, you can write daily logs in individual files but work with them as a single table. This makes adding new rows extremely simple—all you need to do is add a file of new data and rerun the query. No data loading is needed and you can query your data as soon as it’s in S3!
- If you use Parquet- or RCFile-formatted files, you can have access to Spectrum’s columnar capabilities, significantly enhancing performance.
- Spectrum can read compressed files. This can save large amounts of space in S3, reducing costs.
Once your S3 bucket is created, it must be associated with your IAM role as well. Navigate to the Permissions tab, and select Bucket Policy, where you’ll find a text editor.
Use the below permissions, replacing the Principle and Resource with your own role and S3 bucket.
The first step to using Spectrum is to define your external schema. This is a command run a single time to allow Redshift to access S3.
create external schema spectrum
from data catalog
create external database if not exists;
You now can add directories in S3 to this schema. We’re including a table using Amazon’s sample data, but the column definitions and bucket location can be easily replaced with your own. If you encounter any challenges at this step or in the resulting table, try creating all columns as char(64) to evaluate.
create external table spectrum.sales(
row format delimited
fields terminated by '\t'
stored as textfile -- or better, ‘stored as parquet’
And you’re done! When your table is created you can immediately begin to query it!
select count(*) from spectrum.sales
Spectrum in Periscope Data
All of this work can be quickly tied together and used in Periscope Data. The only additional note is that Periscope Data executes code in “transaction blocks,” which must be closed with a commit statement.
Here is a screenshot of the above work put into action. The commented portion of the code only has to be run once to generate the table and schema, and then you can begin building charts with your S3 data.
A Powerful New Tool
Spectrum creates a powerful new way to interact with your data, creating a true data lake model that is simple, fast, and easy to maintain. It extends the functionality of your relational database to scale to truly massive amounts of data, allows rapid data ingestion, and reduces the complexity of ETL processes. With this step-by-step walkthrough, you will be able to start exploring Spectrum more for yourself.
We certainly aren't finished with Spectrum yet. Stay tuned for more, including performance testing across a range of scenarios and optimization techniques for Spectrum queries in upcoming blog posts!