More Rows, Not More Columns



Many a MySQL or Postgres DBA, with responsibility for serving a fast and responsive website, will design a schema for a survey that looks like this:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/User-ID-table.png" alt="User table" class="wp-image-75807"/></figure>



This is a great schema for serving a live site. You only need to show one user her responses, and in a row-based DB like MySQL or Postgres, select * from survey_responses where user_id = 5 is very fast.



When this table is moved to the data warehouse, however, tables like this make your data analyst crazy. The analyst would much rather see a table like this:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Table-2-1.png" alt="Table 2" class="wp-image-75812"/></figure>



For one thing, the analyst will tend to run queries that scan all the rows, like this one calculating the “yes” rate on Question 2:



<pre class="wp-block-code"><code>select 
 sum(
 case 
 when response is true then 1 
 else 0 
 end
 )::float / count(1) 
from survey_responses 
where question = 'question 2'</code></pre>



For another, the analyst will be using a data warehouse, which makes looking at all the rows in a couple of columns very efficient, but looking at all the columns — even only a couple hundred — quite expensive.



<h2 class="wp-block-heading">Making the transition</h2>



We know we want a row-based table design for serving our website, and a columnar design for analyzing the data. So how do we get from one to the other? How do we transpose the first table into the second table?



A naive solution is a set of unions:



<pre class="wp-block-code"><code>select user_id, 'question 1', question_1_response
from survey_responses where user_id = 1
union select user_id, 'question 2', question_2_response
from survey_responses where user_id = 1
union select user_id, 'question d', question_3_response
from survey_responses where user_id = 1
...</code></pre>



This approach has a number of drawbacks. First and foremost, it is a pain in the ass to write.



Moreover, it is expensive. It asks for every whole row n times if there are n questions. And since this transformation typically takes place on the data warehouse, asking for whole rows over and over again is especially brutal.



<h2 class="wp-block-heading">Row numbering and full joins</h2>



We’ll start with a simple list of numbers:



<pre class="wp-block-code"><code>with numbers as (
 select row_number() over (order by true) i from survey_responses
)</code></pre>



Now, for every row in our row-based table, we’ll create a row per question:



<pre class="wp-block-code"><code>expanded_responses as (
 select * from survey_responses join numbers on true
)</code></pre>



Notice survey_responses join numbers on true. This is a full join, meaning a row is created in the resulting table for every possible combination of rows survey_responses and numbers.



Our table so far looks like so:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Table-3.png" alt="Table 3" class="wp-image-75817"/></figure>



Notice the i column on the far right. We now have one row per user per question, which is about to come in very handy.



Now we will use our row numbering to select responses to question 1 from rows numbered 1, responses to question 2 from rows numbered 2, etc.:



<pre class="wp-block-code"><code>select 
 user_id,
 'question ' || i as question,
 case
 when i = 1 then question_1_response
 when i = 2 then question_2_response
 when i = 3 then question_3_response
 when i = 4 then question_4_response
 when i = 5 then question_5_response
 end as response
from expanded_responses</code></pre>



This gives us our fully-transposed table!



The row numbers gave us a handy way of picking which row to use for selecting the answer to each question. And since this is running on a columnar data warehouse, blowing out the row count is not a big deal.



Putting it all together, the query looks like this:



<pre class="wp-block-code"><code>with numbers as (
 select row_number() over (order by true) i from survey_responses
),
expanded_responses as (
 select * from survey_responses join numbers on true
)
select 
 user_id,
 'question ' || i as question
 case
 when i = 1 then question_1_response
 when i = 2 then question_2_response
 when i = 3 then question_3_response
 when i = 4 then question_4_response
 when i = 5 then question_5_response
 end as response
from expanded_responses</code></pre>



When transposing a 111-column, 132,000-row table on Amazon Redshift, this technique ran in 7 seconds. We’d love to be able to tell you how long the union-based approach took for comparison, but it crashed the cluster!



<h2 class="wp-block-heading">You guessed it!</h2>



There’s an easier way. Using Sisense for Cloud Data Teams, use the Transpose Table checkbox in the Settings tab to transpose your whole table in a single click:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Table-menu.png" alt="Table menu" class="wp-image-75822"/></figure>

Using Row Numbering And Full Joins To Transpose Wide Tables

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article