So, after reading Helen's great post here:
Article No Longer Available
I decided it was high time I should contribute a post or two to DEV, so here goes with my first effort - a look at an underused feature of SQL, the ability to add totals and subtotals to results using the ROLLUP and CUBE operators.
First, a disclaimer. The SQL language is defined by an ISO standard (9075), but as with many "standards" few implementations adhere completely to the standard. Some database systems omit some features, others add their own extensions. That said, what follows is applicable to many of the common database platforms you might use, such as Microsoft SQL Server, Oracle, PostgreSQL and Presto/AWS Athena.
SQL GROUP BY 101
A simple SELECT * FROM <table>;
SQL statement will return rows of data from a table in a form that conceptually matches how those rows and columns of data are stored in the table. When designing an application that works with a database, you'll typically want to work with the data like this, querying rows for display or adding or updating rows with new information.
However, when using SQL for analytics we often don't want the information in every single row of a table. We want to aggregate it, for example retrieving the SUM or AVERAGE of the values in one of the columns:
SELECT SUM(UnitsSold) AS TotalUnitsSold
FROM Orders;
TotalUnitsSold |
---|
550 |
Great! So now we have an aggregate value for the whole table. But more realistically we don't want just the total for the table, we want the total broken down for each value in one of the other columns, such as Supplier
. So we do something like this:
SELECT Supplier, SUM(UnitsSold) AS TotalUnitsSold
FROM Orders
GROUP BY Supplier;
Supplier | TotalUnitsSold |
---|---|
Wonka's Chocolate Factory | 100 |
Trotters Independent Trading | 200 |
Kwik-E-Mart | 250 |
We've used the GROUP BY
clause of the SQL statement to tell the database to SUM
the values of the UnitsSold
column for each unique value of Supplier
.
We can extend this to group by two or more columns, so we get a total for each unique combination of values in the grouping columns:
SELECT Supplier, City, SUM(UnitsSold) AS TotalUnitsSold
FROM Orders
GROUP BY Supplier, City;
Supplier | City | TotalUnitsSold |
---|---|---|
Wonka's Chocolate Factory | Munich | 100 |
Trotters Independent Trading | New York | 10 |
Trotters Independent Trading | Paris | 20 |
Trotters Independent Trading | Peckham | 170 |
Kwik-E-Mart | Springfield | 150 |
Kwik-E-Mart | Shelbyville | 100 |
Nice. Now we've got the basis of a quick report or data extract.
Hold on though, how many units did "Trotters Independent Trading" sell? Now we've added multiple levels of grouping (Supplier -> City) we have to add those up again to get totals for the higher levels of grouping (e.g. Supplier).
If you're going to put this query into a reporting or visualisation tool such as Tableau, Power BI or MicroStrategy then your work here is probably done, as those tools can be configured to calculate totals for you when the data gets displayed.
However, if you're just trying to get some answers quickly out of a database (the boss wants those monthly sales figures stat!) then you're either going to export the data to another tool like Excel to work that out, or, horror of horrors, manually calculate the totals yourself to append to the results (I've seen it done). If you're a bit more confident you might try creating multiple SQL statements with different levels of grouping an appending them together using UNION ALL
.
ROLLUP, ROLLUP! The GROUP BY circus is in town
Don't panic, SQL is here to save your sanity with the often overlooked ROLLUP and CUBE clauses. Essentially what these SQL statement clauses allow you to do is get the database to do the heavy lifting for you and calculate totals and subtotals in the same set of results.
Let's go back to our example and say we want to get subtotals for each Supplier
and also an overall total for all orders, so we add ROLLUP
to our GROUP BY
clause:
SELECT Supplier, City, SUM(UnitsSold) AS TotalUnitsSold
FROM Orders
GROUP BY ROLLUP(Supplier, City);
Supplier | City | TotalUnitsSold |
---|---|---|
Wonka's Chocolate Factory | Munich | 100 |
Wonka's Chocolate Factory | NULL | 100 |
Trotters Independent Trading | New York | 10 |
Trotters Independent Trading | Paris | 20 |
Trotters Independent Trading | Peckham | 170 |
Trotters Independent Trading | NULL | 200 |
Kwik-E-Mart | Springfield | 150 |
Kwik-E-Mart | Shelbyville | 100 |
Kwik-E-Mart | NULL | 250 |
NULL | NULL | 550 |
Notice we now have an extra row for each Supplier
, with a NULL
value in the City
column. This represents the sum of all UnitsSold for each Supplier. There's also a row at the bottom that has a NULL
value for both Supplier and City, this is the total UnitsSold for all Suppliers and all Cities (i.e. the total for the whole table) - notice this matches our total from the first simple SUM query we wrote.
So that's it?
Well that's great in that we now have our subtotals, but all those NULL
values make this a bit hard to read (and confusing to any non-SQL people you show this to). Luckily most database platforms also implement a function called GROUPING (sorry Presto/Athena users, you're in the unlucky camp on this one). This function accepts a column name and simply returns 1 (true) or 0 (false) to indicate if the current row contains a subtotal for that column e.g. GROUPING(Supplier)
will return 1 if the row contains a subtotal for Supplier or 0 if it's a regular row.
One of the things we can do with this GROUPING function is to make our results look at bit clearer by replacing those NULL
values with something a bit more meaningful:
SELECT
CASE
WHEN GROUPING(Supplier) = 1
THEN '-All Suppliers-'
ELSE Supplier
END AS Supplier
, CASE
WHEN GROUPING(City) = 1
THEN '-All Cities-'
ELSE City
END AS City
, SUM(UnitsSold) AS TotalUnitsSold
FROM Orders
GROUP BY ROLLUP(Supplier, City);
Supplier | City | TotalUnitsSold |
---|---|---|
Wonka's Chocolate Factory | Munich | 100 |
Wonka's Chocolate Factory | -All Cities- | 100 |
Trotters Independent Trading | New York | 10 |
Trotters Independent Trading | Paris | 20 |
Trotters Independent Trading | Peckham | 170 |
Trotters Independent Trading | -All Cities- | 200 |
Kwik-E-Mart | Springfield | 150 |
Kwik-E-Mart | Shelbyville | 100 |
Kwik-E-Mart | -All Cities- | 250 |
-All Suppliers- | -All Cities- | 550 |
Now that is much easier to read! Time to email the numbers to the boss and nip down to the pub for a celebratory pint.
Wait, didn't you mention something about CUBE?
Ah yes, CUBE. To be honest, the use cases for CUBE are rarer than for ROLLUP but it's good to know about it just in case.
CUBE is basically ROLLUP on steroids - ROLLUP creates subtotals for each sublevel of grouping based on the order of the grouping columns (i.e. in our example there are subtotals for Supplier
and then "all rows", but not City
because that's the "bottom" level of the grouping).
CUBE on other hand creates subtotals for every combination of values in the grouping columns.
How is this useful? Let's look at our example again. Up to now these examples have been assuming that there's a natural hierarchy between Supplier and City i.e. one supplier to many cities (a one-to-many or 1:N relationship), but let's say that actually these are independent attributes of an order so that one Supplier can have many Cities and one City can be related to many Suppliers (a many-to-many or M:N relationship). In that case we don't just want subtotals for Supplier, we want subtotals for each City as well.
Simply replace ROLLUP
with CUBE
in our SQL statement and we now get this result:
SELECT
CASE
WHEN GROUPING(Supplier) = 1
THEN '-All Suppliers-'
ELSE Supplier
END AS Supplier
, CASE
WHEN GROUPING(City) = 1
THEN '-All Cities-'
ELSE City
END AS City
, SUM(UnitsSold) AS TotalUnitsSold
FROM Orders
GROUP BY CUBE(Supplier, City);
Supplier | City | TotalUnitsSold |
---|---|---|
Wonka's Chocolate Factory | Munich | 100 |
Wonka's Chocolate Factory | -All Cities- | 100 |
Trotters Independent Trading | New York | 10 |
Trotters Independent Trading | Paris | 20 |
Trotters Independent Trading | Peckham | 170 |
Trotters Independent Trading | -All Cities- | 200 |
Kwik-E-Mart | Springfield | 150 |
Kwik-E-Mart | Shelbyville | 100 |
Kwik-E-Mart | -All Cities- | 250 |
-All Suppliers- | Munich | 100 |
-All Suppliers- | New York | 10 |
-All Suppliers- | Paris | 20 |
-All Suppliers- | Peckham | 170 |
-All Suppliers- | Springfield | 150 |
-All Suppliers- | Shelbyville | 100 |
-All Suppliers- | -All Cities- | 550 |
Now we have subtotals for each Supplier
and also separate subtotals for each City
, plus the overall total!
Fantastic! Now I'm going to add ROLLUP to all my queries!
Woah there cowboy! So as I briefly mentioned, if you're writing a query that's going to be used by a reporting tool like Tableau or whatever then this technique isn't really helpful as those tools will usually do the totalling for you. In fact, this is usually an anti-pattern and not advisable because your reporting tool won't be able to differentiate the subtotal rows from the regular rows and that will make formatting your report a lot harder.
However, if you're working directly in SQL and just want to generate some data for a quick ad-hoc report or analysis then ROLLUP or CUBE can save you some time and effort in getting the results you need.
Top comments (5)
I love sql tut like this and @helen Anderson. I started digging into sql as I did for excel and VBA to make myself faster and help collegue who are not tech familiar with loads of data. Even if I am not going use this is the next future, is always good to see there is a vanilla way to do these things. Thank you for share!
Great post! simple explanations with easy-to-understand samples - keep it up!
Whoa, amazing! Those are great tools to have in our repertoire. It is always great to learn a new useful SQL construction. Thank you!
I have a query that I run with a roll up and was having trouble getting rid of the NULLs. I'm trying this tomorrow!
Great first post!