Andreas Djokic
Andreas moved to San Francisco to chase his startup dreams. Following his backend work with algorithm design, he joined Periscope Data to help others find the fun side of data.

The High-Performance SQL Blog

Creating a Histogram in Redshift

December 3, 2015

One of the more popular questions our customers ask is how to make histograms in SQL. In this blog post, we’ll take a look at a couple of quick implementations of histograms before showing how you can automate the queries through a fun Periscope Data feature.

A Basic Histogram Example

The defining feature of a histogram is using bins for x-axis values to represent ranges, rather than having set values. Depending on our schema, we can use Amazon Redshift’s case expression to quickly and easily create bins.

Let’s start off with a basic example. We have a users table, and we want to create bins by age groups. We can use a case expression to sort our users into different bins by their ages.

Then, we can just use a simple count for our y-axis values, grouping by the bins we created. Our query would look something like this:

select
  case
    when age between 0 and 9
      then '0-9'
    when age between 10 and 19
      then '10-19'
    when age between 20 and 29
      then '20-29'
    when age between 30 and 39
      then '30-39'
    when age between 40 and 49
      then '40-49'
    else '50+' 
  end
  count(1)
from
  users
group by
  1

You can see that this translates into a bar chart, where our x-axis values represent age ranges by 10 year increments.

The major takeaway from this example is that you can create the range of values that the bins represent by modifying what values you are checking in the case expression. While we only checked for integer values between two numbers in this example, you can also check strings or results of entire subqueries!

Segmenting on the Bins

Let’s say we want our age bins to be segmented by gender as well. All that we would have to do is include our gender column in the query, and group by it. Our case expression has already done all of the work of creating the bins, and a simple group by allows us to segment the individual bins.

select
  case
    when age between 0 and 9
      then '0-9'
    when age between 10 and 19
      then '10-19'
    when age between 20 and 29
      then '20-29'
    when age between 30 and 39
      then '30-39'
    when age between 40 and 49
      then '40-49'
    else '50+' 
    end
    gender, 
    count(1)
from
  users
group by
  1, 2

This query retains the bins we created, and merely segments them by the new gender column that we added. The resulting histogram looks like this:

Automating Histogram Creation

Let’s take a look at another popular use case we’ve come across. Say that you already have a column that you can group by to create your bins. And now you want to create a histogram using the counts of the different values in that column.

There’s a quick and easy way to automate creating these kinds of histograms!

We recently introduced Parameterized SQL Snippets as a core feature of Periscope Data. These snippets let you pass values directly into them when used, and can add a new dynamic to your queries.

Ryan Iyengar from Ziprecruiter generously shared this Parameterized SQL Snippet he created to help generate histograms:

-- histogram(table_name, column)
with
  counts as (
    select
      [column]
      , count(*) as "count_columns"
    from
      [table_name]
    where
      [column] is not null
    group by
      1
  )
  , histogram as (
    select
      count_columns
      , count(*)
    from
      counts
    group by
      1
    order by 
      1
  )
select
  *
from
  histogram

Ryan’s snippet essentially takes the table and column that you pass to it, and generates a histogram of how often each of the counts occurs for that column.

Let’s try it out with our users table, and we’ll say that we’re looking for a histogram of how often the same zipcodes appear.

Our query would simply look like this:

[histogram(users, zipcode)]

Finally, you can look at the underlying SQL query to see how we’re interpreting the Parameterized SQL Snippet for building a histogram:

with
  counts as (
    select
      zipcode
      , count(*) as "count_columns"
    from
      users
    where
      zipcode is not null
    group by
      1
  )
  , histogram as (
    select
      count_columns
      , count(*)
    from
      counts
    group by
      1
    order by 
      1
  )
select
  *
from
  histogram
limit 5000

And now you can try making a histogram yourself in SQL!


Want to discuss this article? Join the Periscope Data Community!
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter