<p><strong>A Brief Tutorial&nbsp;</strong></p>



<p>Group by is one of the most frequently used SQL clauses. It allows you to collapse a field into its distinct values. This clause is most often used with aggregations to show one value per grouped field or combination of fields.</p>



<p>Consider the following table:</p>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-1-groupby-blog.png" alt="Country chart" class="wp-image-73387"/></figure>



<p>We can use an SQL group by and aggregates to collect multiple types of information. For example, an SQL group by can quickly tell us the number of countries on each continent.</p>



<pre class="wp-block-code"><code>-- How many countries are in each continent?
select
  continent
  , count(*)
from 
  countries
group by 
  continent</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-2-groupby-blog.png" alt="" class="wp-image-73392"/></figure>



<p>Keep in mind when using SQL GROUP BY:</p>



<ul><li>Group by X means put all those with the same value for X in the same row.</li><li>Group by X, Y put all those with the same values for both X and Y in the same row.</li></ul>



<p><strong>Analytics adoption has stalled; only infused analytics can help</strong></p>



<a class="action-btn " href="https://www.sisense.com/reports/getting-strategic-value-from-data-analytics-when-initial-attempts-fail/" target="_blank" rel="noopener noreferrer">Learn more</a>



<h2 class="wp-block-heading"><strong>More Interesting Things About SQL GROUP BY</strong></h2>



<h3 class="wp-block-heading">1. Aggregations Can Be Filtered Using The HAVING Clause</h3>



<p>You will quickly discover that the where clause cannot be used on an aggregation. For instance:</p>



<pre class="wp-block-code"><code>select 
  continent
  , max(area)
from
  countries
where 
  max(area) &gt;= 1e7
group by 
  1</code></pre>



<p>will not work, and will throw an error. This is because the where statement is evaluated before any aggregations take place. The alternate having is placed after the group by and allows you to filter the returned data by an aggregated column.</p>



<p>Using having, you can return the aggregate filtered results!</p>



<h3 class="wp-block-heading">2. You Can Often GROUP BY Column Number</h3>



<p>In many databases, you can group by column number as well as column name. Our first query could have been written:</p>



<pre class="wp-block-code"><code>select 
  continent
  , count(*)
from 
  base
group by 
  1</code></pre>



<p>and returned the same results. This is called ordinal notation and its use is debated. It predates column based notation and was SQL standard until the 1980s.&nbsp;</p>



<ul><li>It is less explicit, which can reduce legibility for some users.&nbsp;</li><li>It can be more brittle. A query select statement can have a column name changed and continue to run, producing an unexpected result.</li></ul>



<p>On the other hand, it has a few benefits.</p>



<ul><li>SQL coders tend toward a consistent pattern of selecting dimensions first and aggregates second. This makes reading SQL more predictable.</li><li>It is easier to maintain on large queries. When writing long ETL statements, I have had group by statements that were many, many lines long. I found this difficult to maintain.</li><li>Some databases allow using an aliased column in the group by. This allows a long case statement to be grouped without repeating the full statement in the group by clause. Using ordinal positions can be cleaner and prevent you from unintentionally grouping by an alias that matches a column name in the underlying data. For example, the following query will return the correct values:</li></ul>



<pre class="wp-block-code"><code>-- How many countries use a currency called the dollar?
select
  case when currency = 'Dollar' then currency
    else 'Other'
  end as currency --bad alias
  , count(*)
from
  countries
group by
  1</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-3-groupby-blog.png" alt="Currency count table" class="wp-image-73397"/></figure>



<p> But this will not, and will segment by the <strong>base table&#8217;s</strong> currency field <em>while accepting the new alias column labels</em>:</p>



<pre class="wp-block-code"><code>select
  case when currency = 'Dollar' then currency 
    else 'Other' 
  end as currency --bad alias
  , count(*)
from 
  countries
group by 
  currency</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-4-groupby-blog.png" alt="Dollar and others chart" class="wp-image-73402"/></figure>



<p>This is &#8216;expected&#8217; behavior, but remain vigilant.</p>



<p>A common practice is to use ordinal positions for ad hoc work and column names for production code. This will ensure you are being completely explicit for future users who need to change your code.</p>



<h2 class="wp-block-heading">3. The Implicit GROUP BY</h2>



<p>There is one case where you can take an aggregation without using a group by. When you are aggregating the full table there is an implied SQL group by. This is known as the &lt;grand total&gt; in SQL standards documentation.</p>



<pre class="wp-block-code"><code>-- What is the largest and average country size in Europe?
select
  max(area) as largest_country
  , avg(area) as avg_country_area
from 
  countries
where 
  continent = 'Europe'</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-5-groupby-blog.png" alt="Largest country table" class="wp-image-73407"/></figure>



<h2 class="wp-block-heading">4. GROUP BY Treats Null as Groupable Value, and that is Strange.</h2>



<p>When your data set contains multiple null values, group by will treat them as a single value and aggregate for the set.</p>



<p>This does not conform to the standard use of null, which is never equal to anything including itself.</p>



<pre class="wp-block-code"><code>select null = null
-- returns null, not True</code></pre>



<p> From the SQL standards guidelines in SQL:2008 </p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p> “Although the null value is neither equal to any other value nor not equal to any other value — it is unknown whether or not it is equal to any given value — in some contexts, multiple null values are treated together; for example, the &lt;group by&gt; treats all null values together.”</p></blockquote>



<h2 class="wp-block-heading">5. MySQL Allows you to GROUP BY without Specifying all your Non-Aggregate Columns</h2>



<p>In MySQL, unless you change some database settings, you can run queries like only a subset of the select dimensions grouped, and still get results. As an example, in MySQL this will return an answer, populating the state column with a randomly chosen value from those available.</p>



<pre class="wp-block-code"><code>select 
  country
  , state
  , count(*)
from
  countries
group by 
  country</code></pre>



<p>That&#8217;s all for today! Group by is a commonly used keyword, but hopefully you now have a clearer understanding of some of its more nuanced uses. </p>



<p><strong>Analytics adoption has stalled; only infused analytics can help</strong></p>



<a class="action-btn " href="https://www.sisense.com/reports/getting-strategic-value-from-data-analytics-when-initial-attempts-fail/" target="_blank" rel="noopener noreferrer">Learn more</a>


SQL GROUP BY — Everything You Need To Know

SQL Superstar

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article