The first step to using Redshift is to set up your cluster. The most important choices you'll make during cluster setup are the types of nodes to use (we recommend Dense Compute) and the network security settings for your cluster.
Your security settings will determine how to connect you your cluster once it's running. If you choose to make your Redshift publicly accessible, you'll need to whitelist the IPs in your cluster's network security group. If your cluster has a private IP in a VPC, you'll need to set up and connect through a bastion host.
In this section we'll show you how to get your cluster up and running and then how to connect to it.
Setting up your cluster
Setting up a Redshift cluster is easy. The details of connecting to your Redshift cluster vary depending on how you set it up, but the basics are the same.
First, decide what type of node you'll use — Dense Compute or Dense Storage.
Compute nodes have more ECU and memory per dollar than storage nodes, but come with far less storage. We highly value speed at Periscope, so we’ve found these to be the most effective. The more data you are querying, the more compute you need to keep queries fast.
Storage nodes can work well if you have too much data to fit on SSD nodes within your budget, or you want to store a lot more data than you expect to query.
Number of Nodes
Now you need to figure out how many nodes to use. This depends somewhat on your dataset, but for single query performance, the more the merrier.
The size of your data will determine the smallest cluster you can have. Compute nodes only come with 160GB drives. Even if your row count is in the low billions, you may still require 10+ nodes.
The last step is network setup. Clusters in US East (North Virginia) do not require a VPC, while the rest do. For any production usage, we suggest using a VPC, as you’ll get better network connectivity to your EC2 instances.
A default VPC is created if one doesn’t exist. If you want to access Redshift from outside of AWS, then add a public IP by setting Publicly Accessible to true. Whether you want a public IP on your cluster is up to you — the rest of this post explains how to connect to both public and private IPs.
In either case, take note of the VPC Security group. You’ll need to allow access to the cluster through it later.
We’ll start with the simplest cluster setup possible — a cluster in Virginia not in any VPC. This kind of setup is best used for prototyping.
Once the cluster boots, the Configuration tab in the AWS Redshift console will show you the endpoint address.
Before connecting, we need to allow the IP in the Cluster Security Group. Click the link, then click Add Connection Type. The default is your current IP.
Now connect directly to your cluster:
psql -h \ periscope-test.us-east-1.redshift.amazonaws.com \ -p 5439 -U periscope dev
And we’re in!
If your cluster is in a VPC with a public IP, there's one more step: Head to the VPC's security group for this clusters, and whitelist port 5439 for your IP address.
VPC: Public IP
If your cluster is in a VPC with a public IP (whether it's an Elastic IP or not), the setup is almost identical to the steps above.
Follow the same steps, then go to the VPC security group for this cluster, and allow port 5439 from your IP.
VPC: Private IP
In a VPC with a private IP setup, only connections from inside the network are allowed. This is the most common setup we've seen.
There are a few ways to connect to these clusters. The easiest is to SSH into a box in the VPC with a public IP address, often called a Bastion host, and run psql there. The SSH program creates an encrypted connection which lets you run commands and forward network data on remote machines.
To use graphical tools or a client-side
copy command, you’ll need a way to forward traffic through the Bastion host. If you already have a VPN running in AWS, you can connect through there. Just make sure the VPN instance is in a security group that can talk to the Redshift cluster.
Otherwise, you can create your own instance to forward through. Either Windows or Linux will work as your Bastion host, but Linux is much easier to set up.
Linux SSH Server
Log in by clicking the Connect button in the AWS console and following these instructions:
Create a user account.
-m creates a home directory, which you need to store your public key for connecting.
$ sudo useradd periscope -m -s /bin/false
Become the user to install your public key.
-s sets a shell that quits, so the user can forward ports, but not run commands.
$ sudo su -s /bin/bash periscope $ mkdir ~/.ssh $ cat - >> ~/.ssh/authorized_keys
Then paste your public key, press
Enter, then press
Ctrl-d. Alternatively, you can copy the file there.
Permissions are very important for the
authorized_keys file. Its contents allow someone to connect to this machine as your user, so it's only valid if editing is restricted to your user.
Make sure that only your user has write permissions to your
home directory and
.ssh folder. For good measure, remove all permissions from the
$ chmod 600 ~/.ssh/authorized_keys $ chmod go-w ~/ ~/.ssh
If you want to lock down only that tunnel, you can in the
no-pty, permitopen="foo.us-east1.amazonaws.com:5439" ssh-rsa AAAAB3NzaC1y...Rdo/R [email protected]
no-pty is a step beyond using
/sbin/false as the shell — it restricts the user from even opening a virtual terminal, so you have to connect with
For more information and help with troubleshooting, visit the Ubuntu community site.
Windows SSH Server
Get started by installing freeSSHd and running it as an administrator. Then click the icon in the system tray to open your settings.
In the Server Status tab, make sure the SSH server is running. In the Users tab, create a user. Set Authorization to Public Key and make sure to allow Tunneling.
In the Tunneling tab, enable Local Port Forwarding. In the Authentication tab, set Public Key Authentication to Required, then open the public key folder.
Copy your public key to a file with the same name as the user. The name has to match exactly, so take out any file extension.
Make sure the public key is in the correct folder and has the correct name. You may also need to restrict it to administrator only. If your changes don’t seem to be taking effect, make sure you are running as an administrator.
SSH has an option called local port forwarding, which causes your SSH client to open a port on your computer and forward any network traffic received to the server. In this case, the server forwards that connection to the database.
On Mac/Linux, invoke local port forwarding with
-L local-port:redshift-host:redshift-port. You can choose any port greater than 1024 — for this example we chose 5439.
-nNT stops the ssh client from opening a shell, and allows you to background the connection by starting with an
& at the end.
$ ssh bastion.us-east1.amazonaws.com \ -L5439:foo.us-east-1.amazonaws.com:5439 -nNT
After the connection starts working, connect using localhost as the hostname and 5439 as the port.
Using psql type the following:
psql -h 127.0.0.1 -p 5439 database_name
If you're using SQL Workbench/J instead, use the following settings:
For more details on port forwarding, and cools tricks like the reverse tunnel, check the Ubuntu wiki.
Windows SSH Client
PuTTY is our go-to Windows SSH client. To set up tunneling in PuTTY, expand the SSH section of the menu on the left, then open the Tunnels menu.
Source port can be anything you'd like — we’ve chosen 5439 in this example. For Destination, use your Redshift hostname and port, separating the two with a colon. Click Add and save the profile by clicking Open.
Then connect, using SQL Workbench/J just as above.
And you’re in!