<h2 class="wp-block-heading">Non-Relational Data</h2>



Every once in a great while, the enterprising SQL analyst is confronted with data that is not relational in nature. Often this data has been imported from an event-tracking system or a NoSQL database. Such data commonly takes the form of comma-separated values.



For example, we may have data on our top-purchasing users per product that looks like this:



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/top_purchasing_users.png" alt="Top purchasing users" class="wp-image-80905"/></figure>



Dealing with this data can be a pain in the SQL. Let’s say we want to count the number of purchasers for each product in this table. Depending on your DB system, there are a number of ways to do it.



<h2 class="wp-block-heading">Postgres: Using regexp_split Functions</h2>



As always, Postgres’s solution is straightforward. Given a regular expression, functions are available that split the string to a table or to a <a href="https://www.postgresql.org/docs/9.4/arrays.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Postgres array.</a>



The table-based solution is more intuitive:



<pre class="wp-block-code"><code>select 
 product_name, 
 regexp_split_to_table(top_purchasing_users, ',')
from top_purchasers_per_product</code></pre>



This breaks each user out onto her own line:



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/regexp_split_users.png" alt="Regexp split users" class="wp-image-80911"/></figure>



From here it’s a simple matter of grouping and counting:



<pre class="wp-block-code"><code>select product_name, count(1) from (
 select 
 product_name, 
 regexp_split_to_table(top_purchasing_users, ',')
 from top_purchasers_per_product
) purchasers
group by 1</code></pre>



To avoid creating a table and aggregating it again, we can also aggregate to an array and then get the array’s length:



<pre class="wp-block-code"><code>select 
 product_name, 
 array_length(
 regexp_split_to_array(top_purchasing_users, ',')
 , 1
 )
from top_purchasers_per_product</code></pre>



Both strategies give us our hoped-for final result: number of top purchasers per product!



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/num_purchasers.png" alt="Number of top purchasers" class="wp-image-80917"/></figure>



<h2 class="wp-block-heading">MySQL and Redshift: Remove Commas and Compare String Lengths</h2>



MySQL and Redshift lack special functions to save the day, so we’ll fall back on a hack: Remove all the commas from the string, and see how much shorter it is!



Remarkably, the syntax on both these databases is exactly the same:



<pre class="wp-block-code"><code>select 
 product_name, 
 length(top_purchasing_users) 
 - length(replace(top_purchasing_users, ',', '')) + 1
from top_purchasers_per_product</code></pre>



length gives us the length of a string, and replace(top_purchasing_users, &#8216;,&#8217;, &#8221;) replaces commas with empty strings, effectively removing them from the string! If there are 3 commas in the string then there are 4 purchasers, so we add 1 to the result.



<h2 class="wp-block-heading">Bonus Round: Recursive CTEs in SQL Server</h2>



The comma-replacement trick works in SQL Server as well. But if, like us, you always enjoy an excuse to try a recursive solution, then read on!



In SQL Server, we can use a <a href="https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms186243(v=sql.105)?redirectedfrom=MSDN">Rec</a><a href="https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms186243(v=sql.105)?redirectedfrom=MSDN" target="_blank" rel="noreferrer noopener" aria-label="u (opens in a new tab)">u</a><a href="https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms186243(v=sql.105)?redirectedfrom=MSDN">rsive CTE</a> (also known as a with clause) to log the positions of the commas in the string. Let’s take a look:



<pre class="wp-block-code"><code>with comma_positions as (
 select 
 product_name, 
 top_purchasing_users, 
 charindex(',', top_purchasing_users) as comma_pos 
 from top_purchasers_per_product
 
 union all
 
 select 
 product_name, 
 top_purchasing_users, 
 comma_pos + charindex(',', 
 substring(product_name, comma_pos + 1, len(s))
 ) as comma_pos
 from comma_positions
 where charindex(',', 
 substring(product_name, comma_pos + 1, len(s))
 ) > 0
)</code></pre>



The line above union all is our base case: We start with the product name, the list of top purchasing users, and the position of the first comma.



The clause immediately after the union all is where the magic happens. Each time we recurse, we keep the product_name and top_purchasing_users intact. But when finding the comma position, we start our search at the previous recursion’s comma position!



substring(product_name, comma_pos + 1, len(s)) is the part of the string that starts right after the previous comma position, and the charindex surrounding it finds the first comma position in that substring. We reassign that comma position to comma_pos to set up the next recusion.



Finally, our where clause terminates the recursion if there are no more commas.



The resulting table looks like this:



<figure class="wp-block-image size-large"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/comma_positions.png" alt="Comma positions" class="wp-image-80923"/></figure>



From here, it’s a simple group-and-count to get the total number of commas:



<pre class="wp-block-code"><code>with comma_positions as (
 select 
 product_name, 
 top_purchasing_users, 
 charindex(',', top_purchasing_users) as comma_pos 
 from top_purchasers_per_product
 
 union all
 
 select 
 product_name, 
 top_purchasing_users, 
 comma_pos + charindex(',', 
 substring(product_name, comma_pos + 1, len(s))
 ) as comma_pos
 from comma_positions
 where charindex(',', 
 substring(product_name, comma_pos + 1, len(s))
 ) > 0
)
select product_name, count(1) + 1 
from comma_positions
group by product_name</code></pre>



As before, there’s one more purchaser than comma, so we add one to the result! Our resulting table is exactly as we expect:



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/num_purchasers1.png" alt="Number of purchasers" class="wp-image-80929"/></figure>



As always, Captain Picard wins the day.

Counting Comma-Delimited Values in Postgres, MySQL, Amazon Redshift and MS SQL Server

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article