Basic text analysis on <a href="https://en.wikipedia.org/wiki/N-gram" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">unigram and bigram frequencies</a> can be helpful when digging into datasets of unstructured text. The most frequent bigrams, or pairs of adjacent words, tell you which phrases are most common in your corpus.



We’ll use bigrams to find the most common phrases from users in our user_comments table.



<h2 class="wp-block-heading">Simple Lists of Words</h2>



The first step in making our bigrams is to convert our paragraphs of text into lists of words. We could use the handy regexp_split_to_table function like this:



<pre class="wp-block-code"><code>select
 regexp_split_to_table(
 lower(comments),
 E'[^a-z0-9_]+'
 )
from user_comments
order by id</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/bigram-1.png" alt="Regexp split to table" class="wp-image-76641"/></figure>



The E'[^a-z0-9_]+&#8217; regular expression parameter lets us split the comments on anything that isn’t a letter, number, or underscore. This takes care of punctuation and differences in spacing, helping to clean up the data.



<h3 class="wp-block-heading">Arrays of Words</h3>



Unfortunately, we cannot use regexp_split_to_table because it doesn’t give us a way to keep the words in order, which will be critical for constructing the bigrams later on. Instead we’ll convert the comments into arrays, and then work up to an ordered lists of words.



Making the comments into arrays or words is straightforward (we’ll be building on this CTE):



<pre class="wp-block-code"><code>with word_list as (
 select
 id as comment_id,
 string_to_array(
 regexp_replace(
 lower(comment),
 E'[^a-z0-9_]+', ' ', 'g'),
 ' ') as word_array
 from user_comments
)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-021.png" alt="word array" class="wp-image-76703"/></figure>



First we use regexp_replace<a href="https://www.postgresql.org/docs/current/functions-string.html"></a> to clean up the text, converting all the characters we don’t care about to spaces. The &#8216;g&#8217; at the end tells Postgres to replace all the matches, not just the first.



Then we use string_to_array<a href="https://www.postgresql.org/docs/current/functions-array.html"></a> with a space as its split parameter to convert the cleaned comments into arrays. At the same time we’ll select the id of the original comment as that will be helpful later.



<h2 class="wp-block-heading">Ordered Lists of Words</h2>



Now that we have our comments as arrays, we can break them out into rows and keep the order:



<pre class="wp-block-code"><code>word_indexes as (
 select
 comment_id,
 word_array,
 generate_subscripts(word_array, 1)
 as word_id
 from word_list
)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-031.png" alt="Ordered lists of words" class="wp-image-76708"/></figure>



We’re using <a href="https://www.postgresql.org/docs/current/functions-srf.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">generate_subscripts</a> to output one row for each word in the array containing the index of the word. It’s just a number, not the word itself, so we need to bring the word_array and comment_id values along for the ride.



Then we’ll use the array indexes outputted by generate_subscripts to pull out the word for each index:



<pre class="wp-block-code"><code>numbered_words as (
 select
 comment_id,
 word_array[word_id] word,
 word_id
 from word_indexes
)</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-041.png" alt="Array indexes" class="wp-image-76713"/></figure>



Now we have one line for each word containing its original comment_id, the word itself, and word_id, the word’s position within the array (and also the original comment).



<h2 class="wp-block-heading">Making Bigrams</h2>



From here it’s easy to make bigrams: we only need to join numbered_words to itself for each comment!



<pre class="wp-block-code"><code>select
 nw1.word,
 nw2.word
from numbered_words nw1
 join numbered_words nw2 on
 nw1.word_id = nw2.word_id - 1
 and nw1.comment_id = nw2.comment_id</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-051.png" alt="Bigram" class="wp-image-76718"/></figure>



Notice how we joined each comment on itself (nw1.comment_id = nw2.comment_id) since bigrams cannot span comments. And joining adjacent words is simply making sure their positions within the array are off by one: nw1.word_id = nw2.word_id &#8211; 1.



And with this list of bigrams, adding in the count(1) and group by gives us our bigram frequencies:



<pre class="wp-block-code"><code>select
 nw1.word,
 nw2.word,
 count(1)
from numbered_words nw1
 join numbered_words nw2 on
 nw1.word_id = nw2.word_id - 1
 and nw1.comment_id = nw2.comment_id
group by 1, 2
order by 3 desc</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-061.png" alt="Thank you" class="wp-image-76723"/></figure>



With these bigram frequencies you’ll be able to see which phrases are most frequent in your data!

Bigram Frequencies in Pure SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article