Periscope Data
The world’s fastest visualization platform for data analysts.

Adding Summary Statistics to your SQL Table

November 25, 2015

At Periscope Data we continuously help people build, diagnose, and update their data. When approaching a table of data for the first time, it helps our understanding to see summarizing statistics.

By looking at such staples as the range, sum, and average, you can get a quick glance at the shape and distribution of the data before taking a deep dive.

Today we will demonstrate how we append this data to our SQL tables. ​ 

Understanding Blog Post Metrics

Whenever we write a blog post, we want to understand if readers are visiting the Periscope Data homepage and signing up for a trial. We have tables blog_pingshomepage_pings, and signup_pings that let us observe when users visit the homepage after reading a blog post.

We want to count visitors to the blog even if they did not visit the homepage. Since blog_pings includes everyone in homepage_pings, which in turn includes all of signup_pings, we use left join to merge these records without losing any of the rows.

select
  post
  , count(distinct blog_pings.cookie) as blog
  , count(distinct homepage_pings.cookie) as homepage
  , count(distinct signup_pings.cookie) as signups
from
    blog_pings
    left join homepage_pings on
      blog_pings.cookie = homepage_pings.cookie
    left join signup_pings on
      blog_pings.cookie = signup_pings.cookie
group by
  1

Running this query gives us the counts of visitors who visited the blog page, the home page, and those who signed up for a trial of Periscope Data.

This table makes it very easy to analyze the conversion rates of the most active and least active blog posts, but not how an average post performed just by scrolling through the table. We can add the average and standard deviation as headers to the table to guide the analyst when checking up on a blog post.

First, we bundle the previous query into a temporary table using with blog_signups as (...)

with blog_signups as (
  select
    post
    , count(distinct blog_pings.cookie) as blog
    , count(distinct homepage_pings.cookie) as homepage
    , count(distinct signup_pings.cookie) as signups
  from
      blog_pings
      left join homepage_pings on
        blog_pings.cookie = homepage_pings.cookie
      left join signup_pings on
        blog_pings.cookie = signup_pings.cookie
  group by
    1
)

Now we can pull statistics like the total, mean, and standard deviation from blog_signups.

select 'Total', sum(blog), sum(homepage),
  sum(signups) from blog_signups
union select 'Average', avg(blog), avg(homepage),
  avg(signups) from blog_signups
union select 'Standard Deviation', stddev(blog),
  stddev(homepage), stddev(signups) from blog_signups

Having these statistics is great, but it becomes tedious to check multiple data sources. We can address this by simply folding our statistics into the main table by unioning them together.

select 'Total', sum(blog), sum(homepage),
  sum(signups) from blog_signups
union select 'Average', avg(blog),
  avg(homepage), avg(signups) from blog_signups
union select 'Standard Deviation', stddev(blog),
   stddev(homepage), stddev(signups) from blog_signups
union select * from blog_signups

This will make sorting the table difficult, since the Total, Average, and Standard Deviation are on either ends of the range of the data. To help us in sorting the data, we can add an index column that we will use as a primary sorting key. First, we add an index to blog_signups:

with blog_signups as (
  select
    1 as index
    , post
    , count(distinct blog_pings.cookie) as blog
    , count(distinct homepage_pings.cookie) as homepage
    , count(distinct signup_pings.cookie) as signups
  from
      blog_pings
      left join homepage_pings on
        blog_pings.cookie = homepage_pings.cookie
      left join signup_pings on
        blog_pings.cookie = signup_pings.cookie
  group by
    1, 2
)

Then we index the statistics in our order preference:

select 4, 'Total', sum(blog), sum(homepage),
  sum(signups) from blog_signups
union select 3, 'Average', avg(blog), avg(homepage),
  avg(signups) from blog_signups
union select 2, 'Standard Deviation', stddev(blog),
  stddev(homepage), stddev(signups) from blog_signups

Now we wrap this entire query into another with block which we call signup_statistics:

with signup_statistics as (
  with blog_signups as (...)
  select 4, 'Total', sum(blog), sum(homepage),
    sum(signups) from blog_signups
  union select 3, 'Average', avg(blog), avg(homepage),
    avg(signups) from blog_signups
  union select 2, 'Standard Deviation', stddev(blog),
    stddev(homepage), stddev(signups) from blog_signups
  union select * from blog_signups
)
select
  post
  , blog
  , homepage
  , signups
from
  signup_statistics
order by
  index desc
  , signups desc

And there we have it! Now whenever an author wants to check up on their post, they will be able to more easily understand how their post faired against the distribution.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?