Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

SQL Joins For Data Analysis

November 19, 2015

It’s rare for one table to have all the information needed for a query. Across all our customers, the average chart joins 4 tables to get its result!

Many of these charts use the standard inner join. However, there are many other kinds of joins that are frequently used in data analysis. We’ll take a look at the most common joins we see:

  • Inner Joins
  • Left (and Right) Joins
  • Self-Range Joins

To illustrate each of these joins, we’ll work with a database for a fictitious concierge app used at the front office of an apartment complex. It stores information about the community’s residents and their activity.

Inner Joins

The inner join is the workhorse of SQL joins. This is the default and most common way of combining the data from two tables. When your data has a one-to-many or many-to-many relationship, use inner joins.

For example, we have two tables: residents and packages. Packages arrive for residents at the front office, and get logged in the app. To see which residents we should notify about their packages, we use an inner join:

select *
from residents
  join packages on
    residents.id = packages.residents_id

id     name     resident_id     carrier
1      Harry    1               UPS
3      Rya      7               FedEx
13     Jon      13              UPS
...    ...      ...             ...

This inner join will show us residents with packages. It will not show residents without packages, or packages without residents.

Left (and Right) Joins

The inner join above will not include residents without packages because inner joins require rows from both tables involved to have data satisfying the join condition.

Since residents without packages won’t have any rows in the packages table, those residents are excluded from the results of an inner join.

We can use a left join to include rows from the residents table even if they don’t have any matching rows in the packages table.

Left joins will include every row from the left (residents) table at least one time. If there are no matching results from the right (packages) table, the columns in packages will simply be null.

select *
from residents
  left join packages on
    residents.id = packages.residents_id

id     name     resident_id     carrier
1      Harry    1               UPS
2      Andreas  (null)          (null)
3      Rya      7               FedEx
...    ...      ...             ...

This left join includes all residents whether or not they have a package.

Likewise, if we wanted to see all packages including those without residents, we’d use a right join or swap the order of residents and packages in the select statement.

And we can use the fact that non-matching rows will return nulls to find packages without residents. This can be useful to find packages stamped with an incorrect resident id, because the residents.id field will be null:

select *
from residents
  right join packages on
    residents.id = packages.residents_id
where residents.id is null

id          name     resident_id     carrier
(null)      (null)   -5              UPS
(null)      (null)   (null)          FedEx

Range Joins

The range join is very common when computing lifetime metrics. In our case, we want to calculate the cumulative distinct number of residents receiving packages over time.

Our packages table has a record for each package showing when it was received, making a daily count of distinct package recipients very easy:

select date(received_at), count(distinct recipient_id)
from packages
group by 1

However, this doesn’t get us close to the lifetime metric. Simply making the results cumulative will double-count residents, which we don’t want.

Instead, we’ll use a range join to join each date in our dates table to the packages that came on or before it. Our dates table is just a list of dates, so you could also use generate series or a subquery to produce the same thing.

The important difference with this join is that it doesn’t use equality, it uses greater than / less than instead. This makes it possible for a different number of rows to join to each date:

select date(dates.d), count(distinct recipient_id)
from dates
  join packages on
    packages.received_at <= dates.d
group by 1

With this range join, each date will be joined to all the rows in the packages table that have a received_at on that date, or any previous date. Now the count(distinct recipient_id) will be cumulative and correct!

More Joins!

The joins above are only the most common joins. A future post will talk about cross joinsouter joins, and self joins and when they are most useful!


Want to discuss this article? Join the Periscope Data Community!
Tags: 
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?

Subscribe to our Newsletter