Periscope Data
The world’s fastest visualization platform for data analysts.

Multi-dimensional Clustering Using K-Means in Postgres SQL

December 17, 2015

In our previous blog post on one dimensional clustering, we used a known distance between two points to cluster the data in one dimension. However, data can be more complicated in many cases and may need to be clustered using multiple dimensions.

In this post, we calculate clusters with the k-means clustering algorithm. To optimize each of k clusters, we utilize recursive CTEs as discussed in the post on Solving the Traveling Salesman Problem.

Suppose you have a mobile game and you’re trying to figure out which of your users should see a new feature first. You decide to break your users into groups based on their spending habits.

Looking at the chart above, we can guess that users can be broken into two clusters based on some of the natural cutoffs pictured. Thus, we will try to use the k-means algorithm to find two clusters.

To implement the algorithm, we need to start with some initial values for each mean. These initial values can be chosen randomly from the existing data, which we can do by creating the following table that samples our data: ​

create table initial_guess as select * from
  select
    1.0*total_spend tot1
    , 1.0*number_of_purchases num1
  from
    data
  where random() < 0.5
  limit 1
) a cross join
    select
    1.0*total_spend tot2
    , 1.0*number_of_purchases num2
  from
    data
  where random() > 0.5 
  limit 1
) b

Now, we need to write a function to calculate the Euclidean distance between each data point and the current iteration’s means. This will determine which cluster (mean) the data point should be assigned to: ​

create or replace function sqr_euclid_dis(
  tot float, num float, meantot float, meannum float
  ) returns float as $$
begin
  return (tot-meantot)*(tot-meantot)+(num-meannum)*(num-meannum);
end
$$ language plpgsql

Next, we implement a recursive CTE to iteratively calculate the next set of means. We first assign each data point to the nearest cluster (mean), which is taken from our initial values. Then, we calculate the average value of each cluster of points to find means for the next iteration.

Since we want each row of the output to represent one iteration of the k-means algorithm, each row will need to output the iteration number, variables for the first mean and the second mean. We define that as (iter, tot1, num1, tot2, num2) in the recursive CTE. ​

with recursive kmeans(iter, tot1, num1, tot2, num2) as (
  select 1, *
  from 
    initial_guess
  union all
  select
    kmeans.iter + 1
    , (
      select avg(total_spend)
      from
        select 
          s.total_spend
          , s.number_of_purchases
        from
          select
            total_spend
            , number_of_purchases
            , case when sqr_euclid_dis(
                total_spend, number_of_purchases
                , kmeans.tot1, kmeans.num1) < 
                  sqr_euclid_dis(total_spend
                  , number_of_purchases, kmeans.tot2, kmeans.num2)
                then 1
                else 2
              end as cluster_id
          from 
            data
        ) s
        where cluster_id = 1
      ) l
    )
    , ( 
      select avg(number_of_purchases)
      from
        select 
          s.total_spend
          , s.number_of_purchases
        from
          select total_spend
            , number_of_purchases
            , case when sqr_euclid_dis(
                total_spend, number_of_purchases
                , kmeans.tot1, kmeans.num1) < 
                  sqr_euclid_dis(total_spend
                  , number_of_purchases, kmeans.tot2, kmeans.num2)
                then 1
                else 2
              end as cluster_id
          from
            data
        ) s
        where cluster_id = 1
      ) l
    )
    , ( 
      select avg(total_spend)
      from
        select 
          t.total_spend
          , t.number_of_purchases
        from
          select 
            total_spend
            , number_of_purchases
            , case when sqr_euclid_dis(
                total_spend, number_of_purchases
                , kmeans.tot1, kmeans.num1) < 
                  sqr_euclid_dis(total_spend
                  , number_of_purchases, kmeans.tot2, kmeans.num2)
                then 1
                else 2
              end as cluster_id
          from
            data
        ) t
        where cluster_id = 2
      ) m
    )
    , ( 
      select avg(number_of_purchases)
      from
        select 
          t.total_spend
          , t.number_of_purchases
        from
          select 
            total_spend
            , number_of_purchases
            , case when sqr_euclid_dis(
                total_spend, number_of_purchases
                , kmeans.tot1, kmeans.num1) < 
                  sqr_euclid_dis(total_spend
                  , number_of_purchases, kmeans.tot2, kmeans.num2)
                then 1
                else 2
              end as cluster_id
          from
            data
        ) t
        where cluster_id = 2
      ) m
    )
  from
    kmeans
  where kmeans.iter < 10
)

In the code above, we generated an anchor step using the initial guess. For each step, we add 1 to the iteration. Then we re-compute variables of the first mean by calculating the Euclidean distance between each data point and the first mean.

After assigning the closest mean (cluster_id) to each data point, we take an average of the data points within the same cluster. Then, we perform those same steps for the second mean. These averages become the new mean for the iteration.

Finally, we query the result from the last iteration: ​

select
  data.*
  , case when sqr_euclid_dis(total_spend
        , number_of_purchases, kmeans.tot1
        , kmeans.num1) < sqr_euclid_dis(
          total_spend, number_of_purchases
        , kmeans.tot2, kmeans.num2)
      then 1
      else 2
  end as cluster_id
from
  data
  , kmeans
where iter = 10

Success! We’ve assigned clusters to the data set using the k-means algorithm. Now you know which cluster of users should see the new feature first!

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?