When you approach the distribution of data for the first time, it’s often helpful to pull out <a href="https://www.sisense.com/blog/sql-summary-statistics/">summary statistics</a> to understand the domain of the data.



Mean and variance are certainly helpful for understanding the scope of the dataset, but to understand the shape of the data we often turn to generating the histogram and manually evaluating the curve of the distribution.



Distributions can be difficult to grok only by looking at the curve. Two additional summary statistics, skew and kurtosis, are a good next step for evaluating the shape of a distribution. ​



<h2 class="wp-block-heading">A Tale of Two Cities</h2>



Our motivating example will be analyzing housing prices from polygons I drew on <a href="https://www.trulia.com/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Trulia</a>. Our first example straddles the 101 covering parts of Palo Alto and East Palo Alto, and the second is the Pac Heights area defined by the streets Divisadero, Lombard, Van Ness, and Geary in San Francisco:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/GS1lEcg.png" alt="Trulia map" class="wp-image-77623"/></figure>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/7IUegaw.png" alt="Trulia map 2" class="wp-image-77629"/></figure>



Let’s pretend we are researching the nature of affordable versus affluent housing across the Bay Area. Suppose we want to understand the shape of the distribution &#8211; is one of these neighborhoods more sharply polarized than the other?



To build intuition for our datasets, we can first compute the mean and standard deviation:



<pre class="wp-block-code"><code>select
 'Pac Heights' as Neighborhood
 , round(avg(price), 2) as Mean
 , round(stddev(price), 2) as Stddev
from
 pac_heights</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/mean.png" alt="Mean" class="wp-image-77635"/></figure>



Our two datasets are very similar, with the average housing price being $4.8 million dollars and a standard deviation near $2.8 million in both cases.



Our next step is frequently to create the histograms that represent this data. Let’s create a simple histogram by rounding to the nearest million dollars:



<pre class="wp-block-code"><code>select
 round(price, 0)
 , count(1)
from
 palo_alto
group by
 1
order by
 1</code></pre>



And we can see the histograms for our Palo Alto neighborhood:



<figure class="wp-block-image"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/3tKEDqg-770x226.png" alt="Palo Alto histograms" class="wp-image-77641"/></figure>



And Pac Heights neighborhood:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/I6DZT1x.png" alt="Pac Heights histogram" class="wp-image-77647"/></figure>



What can we conclude about the distribution of housing prices between our two neighborhoods? Since we are grouping linearly, we naturally have a more voluminous grouping on the left side, and a broader dispersion among the more expensive houses &#8211; but it is not totally straightforward for us to evaluate the shape just from our histograms.



Should we identify Palo Alto as more sharply divided for its abundance of homes at the relatively inexpensive $1 million dollar range, or Pac Heights for its column at $9 million?



Fortunately, we can help our understanding by pulling out more information through the third and fourth <a href="https://en.wikipedia.org/wiki/Moment_(mathematics)" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">moments </a>of our distributions.



<h2 class="wp-block-heading">Enter Skew and Kurtosis</h2>



There is a solution that doesn’t involve a judgment call. We can compute the skew, or skewness to understand if the outliers are biased towards the low or high end of our spectrum.



We can then compute the <a href="https://en.wikipedia.org/wiki/Kurtosis" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">kurtosis </a>of our distributions to understand if the variance in our distributions is more readily attributed to a few, extreme outliers (high kurtosis) or several, modest deviations from the mean (low kurtosis).



Let’s compute the sample skewness from our distributions:



<pre class="wp-block-code"><code>select
 sum(Skewness) * (n / ((n-1) * (n-2)))
 as Skewness,
 sum(Skewness) * ((n+1) * n / ((n-1) * (n-2) * (n-3)))
 as Kurtosis
from
 (
 select
 ((price - mean) / stddev)^3 as Skewness
 , ((price - mean) / stddev)^4 as Kurtosis
 , count(1) as n
 from
 pac_heights
 cross join stats
 )</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/skew.png" alt="Skew" class="wp-image-77653"/></figure>





Since our Palo Alto neighborhood has lower skewness but higher kurtosis than our Pac Heights neighborhood, we can conclude that there is less of a right-skew, i.e. there is a greater mixture of less expensive houses, but the most (and least) expensive houses tend to be further out along the extrema.

Understanding Outliers with Skew and Kurtosis in SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article