Periscope Data
The world’s fastest visualization platform for data analysts.

One Dimensional Clustering in Postgres

December 10, 2015

Many events tables contain low level information that isn’t always easy to reason about. In this post we’ll cluster low level delivery event data into trips, making further analysis much easier.

Suppose you operate a delivery business 10 minutes from the city, and the only information you have are the driver IDs and the delivery times:

You would like to determine the number of trips each driver took to complete their deliveries. To do so, we’ll cluster the drivers’ deliveries based on time of delivery. ​

It takes 10 minutes to go from the warehouse to the city. All deliveries made within 10 minutes of the previous delivery are part of the same trip, since there isn’t enough time to head back to the warehouse.

With the help of window functions, we can easily cluster the delivery times based on these 10 minutes gaps.

First, we want to determine the difference in time between deliveries for each driver using the lag window function.

with delivery_difference as (
  select
    driver_id
  , time_of_delivery
  , lag(time_of_delivery) over (
      partition by driver_id 
      order by driver_id, time_of_delivery
    ) as previous_delivery
  , extract(epoch from (time_of_delivery - 
      lag(time_of_delivery) over (
        partition by driver_id
        order by driver_id, time_of_delivery
      )
    )) as difference
from
  driver_delivery
)

Now that we have the difference in time between deliveries, we can cluster deliveries that are fewer than 10 minutes apart. To do this, we calculate if each row is the beginning of a new cluster or belongs to the current cluster.

clustering as (
  select
    *
    , case when difference > 600
        or difference is null then true
      else null
    end as new_cluster
  from
    delivery_difference
)

To assign a cluster ID for each row, we utilize count as part of a window function. Since count leaves out null values, it will only count incrementally when non-null values appear. This identifies each cluster by the same value.

assigned_clustering as (
  select
    *
    , count(new_cluster) over (
      order by driver_id, time_of_delivery
      rows unbounded preceding
    ) as cluster_id
  from
    clustering
)

Putting all of the above together, we’ve successfully clustered deliveries into trips. This allows us to finally calculate the deliveries per driver’s trip.

select
  driver_id
  , cluster_id
  , count(*)
from
  assigned_clustering
group by driver_id, cluster_id
order by driver_id, cluster_id
Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?