Join our "Building Machine Learning Models" interactive demo with Amazon SageMaker in New York July 18th

# Rolling Classifications by X Previous Aggregations

Rolling averages are useful for comparing a specific point with those around it. However, in some cases we might want to instead classify and group a period’s data based on its contemporary periods.

For each period, we examine those that came before or after it, and in doing so, perform rolling classifications. In this post, we’ll show how to determine the window of rolling, and how to use the different periods to classify data.

### Defining the Window and Classifications

We’ll start off with an orders table that has two columns we care about: customer_id and order_date. For this example we’ll aggregate the order dates by quarters. Our problem is to find the number of users that fall into the following classifications for each quarter:

• “New Customer”: If a customer made a purchase in the same quarter that they were created
• “Active”: If a customer made a purchase in the current quarter and were not created in the same quarter
• “Pending”: If a customer has not made an order in the current quarter, but made an order in the previous two quarters
• “Non-Active”: If a customer made a purchase two quarters ago, but not in the previous quarter or the current quarter
• “Churned”: If a customer has not made a purchase in the current quarter or the last two quarters

Based on these rules, our window will always cover the previous two quarters from the current quarter being evaluated. For each entry in the orders table, we’ll calculate whether or not the same customer made a purchase in the previous two quarters.

### Grabbing All Possible Combinations

We’ll first want to create a table that, for each distinct customer_id, has rows for all the possible year and quarter combinations. To do that, we can extract the distinct customer_id’s, years, and quarters from our orders table into their own subqueries. Then we cross-join the three tables like so:

with year_table as
(
select distinct extract(year from order_date)
as year from orders
)
, quarter_table as
(
select distinct extract(qtr from  order_date)
as quarter from orders
)
, customer_table as
(
select distinct customer_id from orders
)
, all_possible_orders as
(
select * from customer_table, year_table, quarter_table
)

### Get Column for Current Quarter

In order to avoid repeated data where the same customer makes more than one purchase in a quarter, we’ll create a subquery from the orders table for distinct customer purchases.

distinct_orders as
(
select distinct
customer_id
, extract(year from order_date) as year
, extract(qtr from order_date) as quarter
from orders
)

Now that we have a table of orders without repeats in a quarter, we can join it to our all_possible_orders table. We’ll use a left join in order to not lose any of the necessary rows.

orders_with_current_quarter as
(
select
all_orders.customer_id
, all_orders.year
, all_orders.quarter
, distinct_orders.year as curr_year
distinct_orders.quarter as curr_quarter
from all_orders
left join distinct_orders on
all_orders.customer_id = distinct_orders.customer_id and
all_orders.year = distinct_orders.year and
all_orders.quarter = distinct_orders.quarter
)

The curr_year and curr_quarter columns will be null when there was not an order in that period, and will otherwise display the year and quarter for the user_id’s order. These columns will be used later for classifying the data, and are the key to checking the previous quarters.

### Calculating Previous Quarters

The next step is to add a column for each of the two previous quarters. We’ll start by adding a column for the previous quarter first, as the logic for the second would be similar. To do this, we’ll want to left join the orders_with_current_quarter table onto itself, lagging the quarters by one.

Creating the join condition is the trickiest part of the query, and we have to consider the two possible cases:

1. The previous quarter is in the same year (e.g. 2016 Quarter 2 → 2016 Quarter 1)
2. The previous quarter is in the past year (e.g. 2016 Quarter 1 → 2015 Quarter 4)

The first possibility is relatively straightforward, and we can compare the values directly:

t1.year = t2.year
and t1.quarter = t2.quarter + 1

The second possibility is a bit more complicated. We’ll have to use the modulus operator to account for the year potentially changing and the quarter resetting. To do this, we can use the modulus operator. Since we’re aggregating by quarters, our modulus check would be with four:

t1.year = t2.year + 1
and t1.quarter = (t2.quarter + 1) % 4
and t2.quarter = 4

Now we’ll apply this logic to the query we’ve built so far.

includes_one_previous_period as
(
select
t1.*
, t2.quarter as one_prev_period
from orders_with_current_quarter t1
left join orders_with_current_quarter t2 on
t1.user_id = t2.user_id and
(
(
t1.curr_year = t2.curr_year
and t1.curr_quarter = t2.curr_quarter + 1
) or
(
t1.curr_year = t2.curr_year + 1
and t1.curr_quarter = (t2.curr_quarter + 1) % 4
and t2.curr_quarter = 3
)
)
)

Now we have a table with columns for both the current quarter and the previous quarter for all orders. To add a column for two quarters ago, we write a query that’s nearly identical to the one above. We modify it by using the includes_one_previous_period table, and adding two instead of one in the quarter join conditions.

includes_two_previous_period as
(
select
t1.*
, t2.quarter as two_prev_period
from includes_one_previous_period t1
left join includes_one_previous_period t2 on
t1.user_id = t2.user_id and
(
(
t1.curr_year = t2.curr_year
and t1.curr_quarter = t2.curr_quarter + 2
)
or (
t1.curr_year = t2.curr_year + 1
and t1.curr_quarter = (t2.curr_quarter + 2) % 4
and t2.curr_quarter = 3
)
)
)

And our resulting table now has columns for the current quarter, one quarter ago, and two quarters ago.

We’re now one join condition away (We promise!) from being able to classify the data using our rules!

We now have the necessary columns to count the number of customers who fall into Categories 2-5 for each quarter. We’ll add one more column that lets us determine whether or not someone was a new customer in a specific Quarter. To do this, we can conveniently employ our customers table that has two columns of value: customer_id and created_at.

All we have to do is left join this onto our includes_two_previous_period subquery using the customer_id fields.

master_table as
(
select
t1.*
, extract(year from created_at) as customer_created_year
, extract(qtr from created_at) as customer_created_quarter
from includes_two_previous_period
left join customers on
includes_two_previous_period.customer_id
= customers.customer_id
)

Now we have everything we need to build our classification rules. The finish line is in sight!

### Putting It All Together

With all the necessary pieces, we can employ a case when statement to determine whether or not a row meets one of the 5 established classifications.

select
year
, quarter
, case
when year = customer_created_year and
quarter = customer_created_quarter
then New Customer
when current_year is not null
then Active
when current_year is null
and one_prev_period is not null
and two_prev_period is not null
then Pending
when current_year is null
and one_prev_period is null
and two_previs_period is not null
then Non-Active
when current_year is null
and one_prev_period is null
and two_prev_period is null
then Churned
end
, count(1)
from master_table
group by 1,2,3

And voila! We now have a beautiful chart that groups customers into different categories each quarter depending on their activity in past quarters.

The same logic can be applied to go even further back or forward to grab the periods needed for classification. Now that you can classify data by rolling through aggregated periods, the world is your oyster!…or at least your dataset.