No items found.
Tips and Tricks

Aggregating into Strings without String_agg or Array_agg in Amazon Redshift

Getting the Top Purchases for Each Product

From time to time, any analyst will want to know the “top n instances” of something. For example, as the holidays approach, a toy store may want to know who the top customers of certain products are, so they can prepare special marketing for those customers.

As usual, Postgres makes this easy with a couple of special-purpose functions: string_agg and array_agg.

However, those of us on other databases have to do without. In particular, Amazon Redshift doesn’t yet support these functions. In this post, we’ll show you how to use window functions and self joins to find the top 5 purchasers of each product.

Counting the Purchasers

Let’s start with a simple group-and-count that gives us a table of how many times any customer purchased any product:

select
  products.name as product_name,
  customers.name as customer_name,
  count(1) purchase_count
from products 
  join purchases on purchases.product_id = products.id
group by 1, 2

This gives us the simple table we were expecting:

Ranking Purchasers

Now, for each product and purchaser, we want to know that purchaser’s rank for that product. Are they the top purchaser? The second to the top?

The row_number window function makes this easy. Here’s how:

select
  product_name,
  customer_name,
  row_number() over (
    partition by product_name order by purchase_count desc
  ) as rank
from (
  select
    products.name product_name,
    customers.name customer_name,
    count(1) purchase_count
  from products 
    join purchases on purchases.product_id = products.id
  group by 1, 2
) purchase_counts

We’ve put our previous query into a subquery named purchase_counts.

Then, in our row_number window function, we partitioned by product_name because we want a rank per product, and we ordered by purchase_count desc so that the customer with the most purchases has rank 1.

This gives us a handy customer rank for each product, seen here:

Putting the Top 5 Purchasers into One Row

Now that we have a unique rank for each purchaser on each product, if we want the top 5 purchasers, all we have to do is join this table to itself 5 times!

Assuming we have the above results in a purchase_ranks table, here’s the query:

select 
  pr0.product_name,
  pr1.customer_name 
    || ', ' || pr2.customer_name 
    || ', ' || pr3.customer_name 
    || ', ' || pr4.customer_name 
    || ', ' || pr5.customer_name as top_purchasing_customers
from
  purchase_ranks pr0
  join purchase_ranks pr1 
    on pr0.product_name = pr1.product_name and pr1.rank = 1
  join purchase_ranks pr2 
    on pr0.product_name = pr2.product_name and pr2.rank = 2
  join purchase_ranks pr3 
    on pr0.product_name = pr3.product_name and pr3.rank = 3
  join purchase_ranks pr4 
    on pr0.product_name = pr4.product_name and pr4.rank = 4
  join purchase_ranks pr5 
    on pr0.product_name = pr5.product_name and pr5.rank = 5

The first key for each self join is product_name. That keeps each row about one product.

The second key selects which customer we are bringing in. The first self join brings the top purchaser by requiring pr1.rank = 1. The second self join brings in the second-to-the-top purchaser by requiring pr2.rank = 2, and so on.

In the select clause, we concatenate these customers’ names together, separated by commas. Here’s the result:

Paring down the Results

As you can see, since there are 5 self joins, there are 5 resulting identical rows for each product. To bring it down to one row per product, all we need to do is wrap our query in a select distinct:

select distinct product_name, top_purchasing_customers from (
  select 
    pr0.product_name,
    pr1.customer_name 
      || ', ' || pr2.customer_name 
      || ', ' || pr3.customer_name 
      || ', ' || pr4.customer_name 
      || ', ' || pr5.customer_name as top_purchasing_customers
  from
    purchase_ranks pr0
    join purchase_ranks pr1 
      on pr0.product_name = pr1.product_name and pr1.rank = 1
    join purchase_ranks pr2 
      on pr0.product_name = pr2.product_name and pr2.rank = 2
    join purchase_ranks pr3 
      on pr0.product_name = pr3.product_name and pr3.rank = 3
    join purchase_ranks pr4 
      on pr0.product_name = pr4.product_name and pr4.rank = 4
    join purchase_ranks pr5 
      on pr0.product_name = pr5.product_name and pr5.rank = 5
) products_with_top_purchasers

This gives us the final results we’re looking for!

Putting It all Together

Pulling out original purchase counts into a with clause, we get a final query that looks like this:

with purchase_ranks as (
  select
    product_name,
    customer_name,
    row_number() over (
      partition by product_name order by purchase_count desc
    ) as rank
  from (
    select
      products.name as product_name,
      customers.name as customer_name,
      count(1) as purchase_count
    from products 
      join purchases on purchases.product_id = products.id
    group by 1, 2
  ) purchase_counts
)
select distinct product_name, top_purchasing_customers from (
  select 
    pr0.product_name,
    pr1.customer_name 
      || ', ' || pr2.customer_name 
      || ', ' || pr3.customer_name 
      || ', ' || pr4.customer_name 
      || ', ' || pr5.customer_name as top_purchasing_customers
  from
    purchase_ranks pr0
    join purchase_ranks pr1 
      on pr0.product_name = pr1.product_name and pr1.rank = 1
    join purchase_ranks pr2 
      on pr0.product_name = pr2.product_name and pr2.rank = 2
    join purchase_ranks pr3 
      on pr0.product_name = pr3.product_name and pr3.rank = 3
    join purchase_ranks pr4 
      on pr0.product_name = pr4.product_name and pr4.rank = 4
    join purchase_ranks pr5 
      on pr0.product_name = pr5.product_name and pr5.rank = 5
) products_with_top_purchasers

Now that we have the top 5 purchasers for each product, we can tailor our marketing specifically to them for the holiday season!

Periscope Data and Amazon Web Services combine to provide the fastest and easiest way to deliver scalable, high-performance and secure cloud analytics.

Want to discuss this article? Join the Periscope Data Community!

Periscope Data
The world’s fastest visualization platform for data analysts.