Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

Bigram Frequencies in Pure SQL

February 18, 2016

Basic text analysis on unigram and bigram frequencies can be helpful when digging into datasets of unstructured text. The most frequent bigrams, or pairs of adjacent words, tell you which phrases are most common in your corpus.

We’ll use bigrams to find the most common phrases from users in our user_comments table.

Simple Lists of Words

The first step in making our bigrams is to convert our paragraphs of text into lists of words. We could use the handy regexp_split_to_table function like this:

from user_comments
order by id

The E'[^a-z0-9_]+' regular expression parameter lets us split the comments on anything that isn’t a letter, number, or underscore. This takes care of punctuation and differences in spacing, helping to clean up the data.

Arrays of Words

Unfortunately, we cannot use regexp_split_to_table because it doesn’t give us a way to keep the words in order, which will be critical for constructing the bigrams later on. Instead we’ll convert the comments into arrays, and then work up to an ordered lists of words.

Making the comments into arrays or words is straightforward (we’ll be building on this CTE):

with word_list as (
    id as comment_id,
        E'[^a-z0-9_]+', ' ', 'g'),
    ' ') as word_array
  from user_comments

First we use regexp_replace to clean up the text, converting all the characters we don’t care about to spaces. The 'g' at the end tells Postgres to replace all the matches, not just the first.

Then we use string_to_array with a space as its split parameter to convert the cleaned comments into arrays. At the same time we’ll select the id of the original comment as that will be helpful later.

Ordered Lists of Words

Now that we have our comments as arrays, we can break them out into rows and keep the order:

word_indexes as (
    generate_subscripts(word_array, 1)
     as word_id
  from word_list

We’re using generate_subscripts to output one row for each word in the array containing the index of the word. It’s just a number, not the word itself, so we need to bring the word_array and comment_id values along for the ride.

Then we’ll use the array indexes outputted by generate_subscripts to pull out the word for each index:

numbered_words as (
    word_array[word_id] word,
  from word_indexes

Now we have one line for each word containing its original comment_id, the word itself, and word_id, the word’s position within the array (and also the original comment).

Making Bigrams

From here it’s easy to make bigrams: we only need to join numbered_words to itself for each comment!

from numbered_words nw1
  join numbered_words nw2 on
    nw1.word_id = nw2.word_id - 1
    and nw1.comment_id = nw2.comment_id

Notice how we joined each comment on itself (nw1.comment_id = nw2.comment_id) since bigrams cannot span comments. And joining adjacent words is simply making sure their positions within the array are off by one: nw1.word_id = nw2.word_id - 1.

And with this list of bigrams, adding in the count(1) and group by gives us our bigram frequencies:

from numbered_words nw1
  join numbered_words nw2 on
    nw1.word_id = nw2.word_id - 1
    and nw1.comment_id = nw2.comment_id
group by 1, 2
order by 3 desc

With these bigram frequencies you’ll be able to see which phrases are most frequent in your data!

Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter