Periscope Data
The world’s fastest visualization platform for data analysts.

Understanding Outliers with Skew and Kurtosis in SQL

May 19, 2016

When you approach a distribution of data for the first time, it’s often helpful to pull out summary statistics to understand the domain of the data.

Mean and variance are certainly helpful for understanding the scope of the dataset, but to understand the shape of the data we often turn to generating the histogram and manually evaluating the curve of the distribution.

Distributions can be difficult to grok only by looking at the curve. Two additional summary statistics, skew and kurtosis, are a good next step for evaluating the shape of a distribution. ​

A Tale of Two Cities

Our motivating example will be analyzing housing prices from polygons I drew on Trulia. Our first example straddles the 101 covering parts of Palo Alto and East Palo Alto, and the second is the Pac Heights area defined by the streets Divisadero, Lombard, Van Ness, and Geary in San Francisco.

Let’s pretend we are researching the nature of affordable versus affluent housing across the Bay Area. Suppose we want to understand the shape of the distribution - is one of these neighborhoods more sharply polarized than the other?

To build intuition for our datasets, we can first compute the mean and standard deviation:

select
  'Pac Heights' as Neighborhood
  , round(avg(price), 2) as Mean
  , round(stddev(price), 2) as Stddev
from
  pac_heights

Our two datasets are very similar, with the average housing price being $4.8 million dollars and a standard deviation near $2.8 million in both cases.

Our next step is frequently to create the histograms that represent this data. Let’s create a simple histogram by rounding to the nearest million dollars:

select
  round(price, 0)
  , count(1)
from
  palo_alto
group by
  1
order by
  1

And we can see the histograms for our Palo Alto neighborhood:

And Pac Heights neighborhood:

What can we conclude about the distribution of housing prices between our two neighborhoods? Since we are grouping linearly, we naturally have a more voluminous grouping on the left side, and a broader dispersion among the more expensive houses - but it is not totally straightforward for us to evaluate the shape just from our histograms.

Should we identify Palo Alto as more sharply divided for its abundance of homes at the relatively inexpensive $1 million dollar range, or Pac Heights for its column at $9 million?

Fortunately we can help our understanding by pulling out more information through the third and forth moments of our distributions.

Enter Skew and Kurtosis

There is a solution that doesn’t involve a judgment call. We can compute the skew, or skewness to understand if the outliers are biased towards the low or high end of our spectrum.

We can then compute the kurtosis of our distributions to understand if the variance in our distributions is more readily attributed to a few, extreme outliers (high kurtosis) or several, modest deviations from the mean (low kurtosis).

Let’s compute the sample skewness from our distributions:

select
  sum(Skewness) * (n / ((n-1) * (n-2)))
    as Skewness,
  sum(Skewness) * ((n+1) * n / ((n-1) * (n-2) * (n-3)))
    as Kurtosis
from
  (
    select
      ((price - mean) / stddev)^3 as Skewness
      , ((price - mean) / stddev)^4 as Kurtosis
      , count(1) as n
    from
      pac_heights
      cross join stats
  )

Since our Palo Alto neighborhood has lower skewness but higher kurtosis than our Pac Heights neighborhood, we can conclude that there is less of a right-skew, i.e. there is a greater mixture of less expensive houses, but the most (and least) expensive houses tend to be further out along the extrema.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?