Intro

The SQL clause GROUP BY takes the results of a SELECT command, divides it into groups, and produces a result set with one row per group. Aggregate functions can then be used to make summary info on each of the groups. Without GROUP BY, aggregate functions would be used on the whole result set made by the SELECT statement. The WITH ROLLUP and WITH CUBE clause provide additional detail or summary information to the result set produced by the SELECT and GROUP BY combination.

Basic Syntax

The syntax is as follows. The parts in bold were not included in my section on the SELECT Command.

SELECT [ALL | DISTINCT] [TOP n [PERCENT]] SelectList
[INTO NewTable]
[FROM TableSources]
[WHERE WhereRowSearchCondition]
[ GROUP BY [ALL] GroupList [WITH [ROLLUP | CUBE]]
 [HAVING HavingRowSearchCondition] ]
[ORDER BY OrderList [ASC | DESC]]

If a SELECT command has any combination of WHERE, GROUP BY, HAVING, and ORDER BY clauses, then they are applied in that order.

The GROUP BY clause takes the result set of the SELECT command so far and reduces it into aggregate rows according to the columns listed in the GroupList. The ALL keyword returns all records instead of eliminating duplicates, the default behavior. Here are a few particulars about the GROUP BY clause:

  • The GroupList has columns or non-aggregate expressions that reference a column. EG:
    SELECT CONVERT(varchar, Started_on, 112) TheDate, COUNT(*) Cnt
    FROM LogTable
    WHERE LogEvent = 'Pigs fly'
    GROUP BY CONVERT(varchar, Started_on, 112)
    ORDER BY 1 DESC
    
  • All non-computed items in the SelectList must exist in the GroupList.
  • A GROUP BY clause is best used with aggregate functions in the SelectList and/or the HavingRowSearchCondition. Some aggregate functions include the following:
    • Avg([distinct] expr)
    • Binary Checksum
    • Checksum
    • Checksum Agg
    • Count([distinct] expr)
    • Count Big([distinct] expr)
    • Grouping
    • Max([distinct] expr). A cheesy trick: Even though GROUP BY is used, you can select almost any column by using max or min. Careful.
    • Min([distinct] expr)
    • Sum([distinct] expr)
    • Stdev
    • Stdevp
    • Var
    • Varp

The HAVING clause is sort of like a WHERE clause for the results of a SELECT commands that make use of the GROUP BY clause. The difference is that the WHERE clause applies before aggregation occurs, where as the HAVING clause applies on the results after aggregation has occurred.

The ROLLUP clause specifies that in addition to the aggregate rows produced by the SELECT and GROUP BY combination, there will also be aggregates for the hierarchy of values in the columns (aka dimensions in data warehouse terminology) in the GroupList. This is sort of like grand totals. The order of the columns in the GroupList can affect the number of rows produced.

The WITH CUBE clause specifies that in addition to the aggregate rows produced by the SELECT and GROUP BY combination, there will also be aggregates for all combinations of values in the columns (aka dimensions in data warehouse terminology) in the GroupList. The order of the columns in the GroupList is irrelevant.

EGs

GROUP BY 0.0

Assume that the titles table has these values:

title_id title type pub_id price
BU1032 The Busy Executive's Database Guide business 1389 19.99
BU1111 Cooking with Computers: Surreptitious Balance Sheets business 1389 11.95
BU2075 You Can Combat Computer Stress! business 736 2.99
BU7832 Straight Talk About Computers business 1389 19.99
MC2222 Silicon Valley Gastronomic Treats mod_cook 877 19.99
MC3021 The Gourmet Microwave mod_cook 877 2.99
MC3026 The Psychology of Computer Cooking UNDECIDED 877
PC1035 But Is It User Friendly? popular_comp 1389 22.95
PC8888 Secrets of Silicon Valley popular_comp 1389 20
PC9999 Net Etiquette popular_comp 1389
PS1372 Computer Phobic AND Non-Phobic Individuals: Behavior Variations psychology 877 21.59
PS2091 Is Anger the Enemy? psychology 736 10.95
PS2106 Life Without Fear psychology 736 7
PS3333 Prolonged Data Deprivation: Four Case Studies psychology 736 19.99
PS7777 Emotional Security: A New Algorithm psychology 736 7.99
TC3218 Onions, Leeks, and Garlic: Cooking Secrets of the Mediterranean trad_cook 877 20.95
TC4203 Fifty Years in Buckingham Palace Kitchens trad_cook 877 11.95
TC7777 Sushi, Anyone? trad_cook 877 14.99

This SELECT command and GROUP BY clause might aggregate 18 rows into 3 aggregate rows.

SELECT pub_id
FROM titles
GROUP BY pub_id

Results:

pub_id
------
0736
0877
1389

GROUP BY 0.1

This SELECT command and GROUP BY clause might distinguish 18 rows via 2 columns for 8 aggregate rows. This will show two other things: (1) How many rows were aggregated into each aggregate row. (2) The average price from all the rows that made each aggregate row.

SELECT pub_id, type, cnt = count(*), avg(price) as avg
FROM titles
GROUP BY pub_id, type

Results:

pub_id  type            cnt     avg
---------------------------------------
0736    business        1       2.9900
1389    business        3       17.3100
0877    mod_cook        2       11.4900
1389    popular_comp    3       21.4750
0736    psychology      4       11.4825
0877    psychology      1       21.5900
0877    trad_cook       3       15.9633
0877    UNDECIDED       1       NULL

GROUP BY 0.2

This HAVING clause will show the same except but only for those whose average price is above a certain level. It will then order the result set by the average.

SELECT pub_id, type, count(*), avg(price)
FROM titles
GROUP BY pub_id, type
HAVING avg(price) > 10
ORDER BY avg(price)

Results:

pub_id  type
---------------------------------------
0736    psychology      4   11.4825
0877    mod_cook        2   11.4900
0877    trad_cook       3   15.9633
1389    business        3   17.3100
1389    popular_comp    3   21.4750
0877    psychology      1   21.5900

Duplicates 1.0

This example has a table (t1), which has more than one row per fkID, and returns a list of fkIDs that have more than one row and how many rows that fkID has.

    SELECT fkID, count(fkID) as rowCount
FROM t1
GROUP BY fkID
HAVING count(fkID) > 1
ORDER BY rowCount desc

Duplicates 1.1

This example has a table (t1) that has more than one row per fkID, and returns a list of rows for fkIDs that have only one row.

SELECT * FROM t1 WHERE fkID NOT IN(
SELECT fkID
FROM t1
GROUP BY fkID
HAVING count(fkID) > 1)

Duplicates 1.2

This example has a table (t1) that has more than one row per fkID, and returns one row for each distinct fkID.

SELECT * FROM t1 WHERE ID IN (
    SELECT MAX(ID)
    FROM t1
    GROUP BY fkID
)

First or last in a GROUP BY

Let us say that you have the following table Games and that it has other rows with other UserIDs.

ID      UserID   Score  Time
============================
1       1       7       9:01
2       1       8       9:02
3       1       6       9:03

Finding the max/min scores and times per user is easy:

SELECT MIN(Score) as MaxScore, MAX(Time) as MaxTime
FROM Games
GROUP BY UserID

If we want all the columns for a given UserID for the latest time?

-- In PostgreSQL this is trivial!
SELECT DISTINCT ON (UserID), *
FROM Games
ORDER BY UserID

Unfortunately, DISTINCT ON() is not standard SQL. Standard SQL has to use a much more limited and harder workaround.

SELECT *
FROM Games G JOIN (
    SELECT MAX(ID) as G2ID
    FROM Games
    GROUP BY UserID
) G2 ON G2.G2ID = G.ID
ORDER BY UserID

This cheesy solution assumes that Games.ID is in chronological order. If you wanted to do something like get the time of the max score, then you have to be much more creative.

Running total (cumulative sum)

How to do a running total. Assume a table with two fields (strDate and intPatient).

First we'll show how many patients there are per date.

SELECT SUM(intPatient) AS Patients, strDate AS [Date]
FROM fact_Patient
GROUP BY strDate
HAVING strDate IS NOT NULL
ORDER BY strDate

/* That makes something like this:
Date      Patients
==================
200501    1
200502    2
200503    1
*/

Second we'll do the running total.

SELECT
    P.strDate as [Date],
    (
        SELECT SUM(intPatient) FROM fact_Patient
        WHERE strDate <= P.strDate
    ) AS 'Cumulative Patients'
FROM fact_Patient AS P
GROUP BY strDate
HAVING strDate IS NOT NULL
ORDER BY strDate

/* That makes something like this:
Date      Cumulative Patients
=============================
200501    1
200502    3
200503    4
*/

For more, see "SQL Server: Calculating Running Totals, Subtotals and Grand Total Without a Cursor" [databasejournal.com/features/mssql/article.php/3112381] and "Calculating Running Totals" [http://www.sqlteam.com/article/calculating-running-totals].

ROLLUP

If an Inventory contains this list:

Item                 Color                Quantity
-------------------- -------------------- --------------------------
Table                Blue                 124
Table                Red                  223
Chair                Blue                 101
Chair                Red                  210

Then ROLLUP can generate aggregates for the hierarchy of values in Item and Color:

SELECT
    CASE WHEN (GROUPING(Item) = 1) THEN 'ALL'
        ELSE ISNULL(Item, 'UNKNOWN')
    END AS Item,
    CASE WHEN (GROUPING(Color) = 1) THEN 'ALL'
        ELSE ISNULL(Color, 'UNKNOWN')
    END AS Color,
    SUM(Quantity) AS QtySum
FROM Inventory
GROUP BY Item, Color WITH ROLLUP

Item                 Color                QtySum
-------------------- -------------------- --------------------------
Chair                Blue                 101.00
Chair                Red                  210.00
Chair                ALL                  311.00
Table                Blue                 124.00
Table                Red                  223.00
Table                ALL                  347.00
ALL                  ALL                  658.00

(7 row(s) affected)

Note that ROLLUP returns NULL for the column/dimension being aggregated but the GROUPING function was used to convert the NULL to ALL.

CUBE

If the same table is used as in the ROLLUP example, then CUBE can generate aggregates for all combinations of values in Item and Color:

SELECT
    CASE WHEN (GROUPING(Item) = 1) THEN 'ALL'
        ELSE ISNULL(Item, 'UNKNOWN')
    END AS Item,
    CASE WHEN (GROUPING(Color) = 1) THEN 'ALL'
        ELSE ISNULL(Color, 'UNKNOWN')
    END AS Color,
    SUM(Quantity) AS QtySum
FROM Inventory
GROUP BY Item, Color WITH ROLLUP

Item                 Color                QtySum
-------------------- -------------------- --------------------------
Chair                Blue                 101.00
Chair                Red                  210.00
Chair                ALL                  311.00
Table                Blue                 124.00
Table                Red                  223.00
Table                ALL                  347.00
ALL                  ALL                  658.00
ALL                  Blue                 225.00
ALL                  Red                  433.00 

(9 row(s) affected)

Note that that CUBE produced the two extra rows marked in red, i.e. summaries based on values in the second column Color.

OVER()

The OVER() clause actually does NOT have to use the GROUP BY clause, and yet it it allows you to use aggregate functions.

SELECT RunnerID, RaceID, MilesRun, DateOfRun
    ,SUM(MilesRun) OVER(PARTITION BY RunnerID) AS 'Total'
    ,AVG(MilesRun) OVER(PARTITION BY RunnerID) AS 'Avg'
    ,COUNT(MilesRun) OVER(PARTITION BY RunnerID) AS 'Count'
    ,MIN(MilesRun) OVER(PARTITION BY RunnerID) AS 'Min'
    ,MAX(MilesRun) OVER(PARTITION BY RunnerID) AS 'Max'
FROM RaceResults
WHERE RunnerID in (7,8);

The OVER() clause can also be used with various ranking functions.

SELECT RunnerID, Age
    ,ROW_NUMBER() OVER (ORDER BY a.Age) AS 'Row Number'
    ,RANK() OVER (ORDER BY a.Age) AS 'Rank'
    ,DENSE_RANK() OVER (ORDER BY a.Age) AS 'Dense Rank'
    ,NTILE(4) OVER (ORDER BY a.Age) AS 'Quartile'
FROM Runners;

See also "SQL Sever 2005: Using OVER() with Aggregate Functions" [http://www.sqlteam.com/article/sql-sever-2005-using-over-with-aggregate-functions].



GeorgeHernandez.comSome rights reserved