Usually, the easiest way to upload a CSV is to use Sisense&#8217;s CSV functionality. It&#8217;s fast, easy, lets you join the data with all your databases, and automatically casts types. However, sometimes it’s useful to interact directly with a Redshift cluster — usually for complex data transformations and modeling in Python. When interacting directly with a database, it can be a pain to write a create table statement and load your data. When the table is wide, you have two choices while writing your create table — spend the time to figure out the correct data types, or lazily import everything as text and deal with the type casting in SQL. The first is slow, and the second will get you in trouble down the road.



You can see a great example of this looking at Stack Overflow’s <a href="https://insights.stackoverflow.com/survey/2017">survey results</a> for industry data. They’re available in a CSV format that’s a daunting 158 columns wide. If you want to load the data into Redshift and rather than be generous with the data types, use the proper columns, you can speed up the load process by writing a Python script.



<h2 class="wp-block-heading">Importing Libraries and Reading Data in Python</h2>



The first step is to load the data, import libraries, and load the data into a CSV reader object. The CSV library will be used to iterate over the data, and the AST library will be used to determine data type.



The process requires a few lists. &#8220;Longest&#8221; will be a list of the longest values in character length to specify varchar column capacity, &#8220;headers&#8221; will be a list of the column names, and &#8220;type_list&#8221; will be the updating list of column types as we iterate over our data.



<pre class="wp-block-code"><code>import csv, ast, psycopg2

f = open('/path/to/survey/data/survey_data.csv', 'r')
reader = csv.reader(f)

longest, headers, type_list = [], [], []</code></pre>



<h2 class="wp-block-heading">Finding the Data Type</h2>



Once the data is in, we need to find the data type for each row. This means evaluating every value and cast to the most restrictive option, from decimalized numbers to integers, and from integers to strings.



The function dataType does this. First, it evaluates to see if the value is text or a number, and then for the appropriate type of number if needed. This function consumes both the new data, and the current best type to evaluate against.



<pre class="wp-block-code"><code>def dataType(val, current_type):
 try:
 # Evaluates numbers to an appropriate type, and strings an error
 t = ast.literal_eval(val)
 except ValueError:
 return 'varchar'
 except SyntaxError:
 return 'varchar'
 if type(t) in [int, long, float]:
 if (type(t) in [int, long]) and current_type not in ['float', 'varchar']:
 # Use smallest possible int type
 if (-32768 &lt; t &lt; 32767) and current_type not in ['int', 'bigint']:
 return 'smallint'
 elif (-2147483648 &lt; t &lt; 2147483647) and current_type not in ['bigint']:
 return 'int'
 else:
 return 'bigint'
 if type(t) is float and current_type not in ['varchar']:
 return 'decimal'
 else:
 return 'varchar'</code></pre>



It iterates over the rows in our CSV, calls the function above, and populates the lists.



<pre class="wp-block-code"><code>for row in reader:
 if len(headers) == 0:
 headers = row
 for col in row:
 longest.append(0)
 type_list.append('')
 else:
 for i in range(len(row)):
 # NA is the csv null value
 if type_list[i] == 'varchar' or row[i] == 'NA':
 pass
 else:
 var_type = dataType(row[i], type_list[i])
 type_list[i] = var_type
 if len(row[i]) > longest[i]:
 longest[i] = len(row[i])
f.close()</code></pre>



Then use those lists to write the SQL statement:



<pre class="wp-block-code"><code>statement = 'create table stack_overflow_survey ('

for i in range(len(headers)):
 if type_list[i] == 'varchar':
 statement = (statement + '\n{} varchar({}),').format(headers[i].lower(), str(longest[i]))
 else:
 statement = (statement + '\n' + '{} {}' + ',').format(headers[i].lower(), type_list[i])

statement = statement[:-1] + ');'</code></pre>



Finally, the output!



<pre class="wp-block-code"><code>create table stack_overflow_survey_data (
 respondent int,
 , professional varchar(56)
 , programhobby varchar(45)
 , country varchar(34)
 ....
 , expectedsalary decimal);</code></pre>



<h2 class="wp-block-heading">Finishing the Job</h2>



Of course, the job isn&#8217;t done—the data needs to get into Redshift! This can be done using the psycopg2 library (imported above). To use the copy command, first load the data to S3. The access key ID and secret access key can be found under users in your AWS console.



(Find additional details about the copy command used below in the post <a href="https://www.sisense.com/blog/etl-data-into-and-out-of-amazon-redshift/">How to ETL Data into and out of Amazon Redshift</a>.)



<pre class="wp-block-code"><code>conn = psycopg2.connect(
 host='mydb.mydatabase.us-west-2.redshift.amazonaws.com',
 user='user',
 port=5439,
 password='password',
 dbname='example_db')

cur = conn.cursor()

cur.execute(statement)
conn.commit()

sql = """copy stack_overflow_survey_data from 's3://an-example-bucket/survey_data.csv'
 access_key_id '&lt;access_key_id>'
 secret_access_key '&lt;secret_access_key>'
 region 'us-west-1'
 ignoreheader 1
 null as 'NA'
 removequotes
 delimiter ',';"""
cur.execute(sql)
conn.commit()</code></pre>



Now you can begin examining the Stack Overflow data.



Start with a few quick charts: Stack Overflow asked people what attributes are the most important when making technical hires—including knowledge of algorithms, communications skills, titles, and education.&nbsp;



Compare a couple of those questions with the average salary of the respondents:



<pre class="wp-block-code"><code>select
 importantHiringAlgorithms
 , avg(salary) as salary
 , count(*) as respondents
from
 stack_overflow_survey_data
group by
 1
order by
 1</code></pre>



You can see that the average respondent&#8217;s salary actually decreases with a focus on data structures and algorithms when hiring! In contrast, a heavier emphasis on communication skills is associated with higher salaries.



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/salary-table.png" alt="Salary table" class="wp-image-79740"/></figure>



Looking deeper, you can see why—students aren&#8217;t worried about communications skills yet.&nbsp;



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/communication-skills-770x208.png" alt="Salary table 2" class="wp-image-79734"/></figure>



While Stack Overflow didn&#8217;t ask for respondent&#8217;s titles, it’s reasonable to suspect that an emphasis on communication skills grows as people advance their careers. It turns out communications skills are highly valued. See what else you can find out using the neatly ordered data you’ve imported.

Using Python to Write a Create Table Statement and Load a CSV into Redshift

Hack Your Analyst Skillset

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article