Understanding Outliers with Skew and Kurtosis in SQL
When you approach a distribution of data for the first time, it’s often helpful to pull out summary statistics to understand the domain of the data.
Mean and variance are certainly helpful for understanding the scope of the dataset, but to understand the shape of the data we often turn to generating the histogram and manually evaluating the curve of the distribution.
Distributions can be difficult to grok only by looking at the curve. Two additional summary statistics, skew and kurtosis, are a good next step for evaluating the shape of a distribution.
A Tale of Two Cities
Our motivating example will be analyzing housing prices from polygons I drew on Trulia. Our first example straddles the 101 covering parts of Palo Alto and East Palo Alto, and the second is the Pac Heights area defined by the streets Divisadero, Lombard, Van Ness, and Geary in San Francisco.
Let’s pretend we are researching the nature of affordable versus affluent housing across the Bay Area. Suppose we want to understand the shape of the distribution - is one of these neighborhoods more sharply polarized than the other?
To build intuition for our datasets, we can first compute the mean and standard deviation:
'Pac Heights' as Neighborhood
, round(avg(price), 2) as Mean
, round(stddev(price), 2) as Stddev
Our two datasets are very similar, with the average housing price being $4.8 million dollars and a standard deviation near $2.8 million in both cases.
Our next step is frequently to create the histograms that represent this data. Let’s create a simple histogram by rounding to the nearest million dollars:
And we can see the histograms for our Palo Alto neighborhood:
And Pac Heights neighborhood:
What can we conclude about the distribution of housing prices between our two neighborhoods? Since we are grouping linearly, we naturally have a more voluminous grouping on the left side, and a broader dispersion among the more expensive houses - but it is not totally straightforward for us to evaluate the shape just from our histograms.
Should we identify Palo Alto as more sharply divided for its abundance of homes at the relatively inexpensive $1 million dollar range, or Pac Heights for its column at $9 million?
Fortunately we can help our understanding by pulling out more information through the third and forth moments of our distributions.
Enter Skew and Kurtosis
There is a solution that doesn’t involve a judgment call. We can compute the skew, or skewness to understand if the outliers are biased towards the low or high end of our spectrum.
We can then compute the kurtosis of our distributions to understand if the variance in our distributions is more readily attributed to a few, extreme outliers (high kurtosis) or several, modest deviations from the mean (low kurtosis).
Let’s compute the sample skewness from our distributions:
sum(Skewness) * (n / ((n-1) * (n-2)))
sum(Skewness) * ((n+1) * n / ((n-1) * (n-2) * (n-3)))
((price - mean) / stddev)^3 as Skewness
, ((price - mean) / stddev)^4 as Kurtosis
, count(1) as n
cross join stats
Since our Palo Alto neighborhood has lower skewness but higher kurtosis than our Pac Heights neighborhood, we can conclude that there is less of a right-skew, i.e. there is a greater mixture of less expensive houses, but the most (and least) expensive houses tend to be further out along the extrema.