GROUP BY, HAVING, and aggregates

Every business runs on KPIs — Key Performance Indicators. Runehold’s leadership team gets a Monday-morning dashboard with “revenue per country last week,” “average order value per channel,” “top 10 SKUs by gross margin,” “customer count by acquisition source.” All of those are GROUP BY queries.

If you understand GROUP BY, you can turn 10 million rows of orders into a 20-row executive summary. If you don’t, you spend hours building things in Excel that should have taken 30 seconds in SQL. Today’s lesson teaches the pattern.

What aggregation means

An aggregate function takes many rows and returns one value. The big five:

SUM(expr) — total
COUNT(*) / COUNT(expr) — row count (we covered the NULL difference in lesson 7)
AVG(expr) — arithmetic mean
MIN(expr) / MAX(expr) — smallest / largest

Plus:

STDEV(expr), VAR(expr) — standard deviation, variance. Rare but useful.
STRING_AGG(expr, separator) — concatenate strings. We’ll cover in lesson 14.

Without a GROUP BY, aggregate functions collapse the whole result set to one row:

SELECT SUM(Total)      AS total_revenue,
       AVG(Total)      AS average_order_value,
       COUNT(*)        AS order_count,
       MIN(OrderDate)  AS first_order_ever,
       MAX(OrderDate)  AS most_recent_order
FROM Sales.Orders;

That query answers “what do we know in aggregate about all orders?” One row out. Clean.

Add a GROUP BY and you get one row per group:

SELECT CountryCode,
       SUM(Total)  AS revenue,
       COUNT(*)    AS orders,
       AVG(Total)  AS avg_order_value
FROM Sales.Orders
GROUP BY CountryCode
ORDER BY revenue DESC;

That’s “revenue, order count, and average order value, broken down by country.” Four or five rows out. Marketing loves this query. This is how your dashboard’s “revenue per country” tile is computed.

The GROUP BY rule

Every column in your SELECT list must be either:

Inside an aggregate function, or
Listed in the GROUP BY clause.

Why? Because if you try to project a column that isn’t in GROUP BY and isn’t aggregated, the engine has no idea which of the many grouped rows’ values to show you. SQL Server returns an error: “Column ‘X’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.”

-- ERROR
SELECT CountryCode, Name, SUM(Total)
FROM Sales.Orders
JOIN Sales.Customer ON ...
GROUP BY CountryCode;
-- Name is not grouped and not aggregated.

-- FIX 1: add it to the group
SELECT CountryCode, Name, SUM(Total)
FROM Sales.Orders
JOIN Sales.Customer ON ...
GROUP BY CountryCode, Name;

-- FIX 2: aggregate it
SELECT CountryCode, MIN(Name), SUM(Total)
FROM Sales.Orders
JOIN Sales.Customer ON ...
GROUP BY CountryCode;

MySQL and Postgres in “loose” modes let you break this rule; SQL Server does not. This is a feature. “Loose” group-by always breaks in subtle ways when data changes. SQL Server saves you from yourself.

WHERE vs HAVING

WHERE filters rows before grouping. HAVING filters groups after aggregation. They are not interchangeable:

-- Countries with more than 5 orders AND total revenue above 500 EUR
SELECT CountryCode,
       COUNT(*)    AS orders,
       SUM(Total)  AS revenue
FROM Sales.Orders
WHERE OrderDate >= '2026-01-01'          -- per-row filter: before grouping
GROUP BY CountryCode
HAVING COUNT(*) > 5                       -- group filter: after aggregation
   AND SUM(Total) > 500;

Rule of thumb:

If your filter can be evaluated on a single row of the source table, use WHERE.
If your filter needs an aggregate (COUNT(*), SUM(...), etc.), it has to be HAVING.

Whenever you see a condition that could go in WHERE, put it in WHERE — the engine filters earlier, which is cheaper. Only aggregate-based filters belong in HAVING.

`DISTINCT` inside aggregates

Another useful trick. You can wrap DISTINCT around the argument to an aggregate:

-- How many unique customers placed an order last month?
SELECT COUNT(DISTINCT CustomerId) AS unique_customers
FROM Sales.Orders
WHERE OrderDate >= '2026-03-01' AND OrderDate < '2026-04-01';

-- Average order value per unique SKU per country
-- "On average, across products we sold in each country, what was the per-SKU revenue?"
SELECT CountryCode,
       SUM(Total) / COUNT(DISTINCT ProductId) AS avg_rev_per_sku
FROM Sales.OrderLine
JOIN Sales.Orders ON ...
GROUP BY CountryCode;

COUNT(DISTINCT customer_id) is the KPI most teams refer to as “active customers” or “MAU/DAU” (monthly/daily active users). Different from COUNT(*) when any customer has multiple orders.

`GROUP BY` with multiple columns

Group by as many columns as you want; you get one row per unique combination:

-- Revenue per country per month
SELECT CountryCode,
       YEAR(OrderDate)  AS yr,
       MONTH(OrderDate) AS mo,
       SUM(Total)       AS revenue,
       COUNT(*)         AS orders
FROM Sales.Orders
WHERE OrderDate >= '2026-01-01'
GROUP BY CountryCode, YEAR(OrderDate), MONTH(OrderDate)
ORDER BY yr, mo, revenue DESC;

Note: using YEAR(OrderDate) and MONTH(OrderDate) in GROUP BY is fine, but it prevents an index on OrderDate from being used for range seeks. Common alternative: group by a date-truncated expression, or pre-compute a YearMonth column in the schema. We’ll do that refactor properly in lesson 15.

The business KPIs, in SQL

Here are some actual KPIs Runehold’s finance and marketing teams run, written out in full. They’re useful as a reference when you’re asked “can you get me…?” by someone non-technical.

Total revenue

SELECT SUM(Total) AS gmv_eur          -- GMV = Gross Merchandise Value
FROM Sales.Orders
WHERE OrderDate >= '2026-01-01'
  AND OrderDate  < '2026-04-01';

Average Order Value (AOV)

SELECT SUM(Total) * 1.0 / COUNT(*) AS aov_eur
FROM Sales.Orders
WHERE OrderDate >= '2026-01-01' AND OrderDate < '2026-04-01';

The * 1.0 forces the division to return a decimal instead of integer division. Tiny detail, big difference: 15 / 7 is 2; 15 * 1.0 / 7 is 2.142857....

Conversion rate

“Sessions that resulted in a purchase, divided by total sessions.”

-- Schema sketch: Marketing.Session tracks visits, has HasPurchase bit
SELECT
    CountryCode,
    COUNT(*)                                           AS sessions,
    SUM(CAST(HasPurchase AS INT))                      AS purchasing_sessions,
    SUM(CAST(HasPurchase AS INT)) * 100.0 / COUNT(*)   AS conversion_rate_pct
FROM Marketing.Session
WHERE SessionStart >= '2026-03-01'
GROUP BY CountryCode
ORDER BY conversion_rate_pct DESC;

SUM(CAST(HasPurchase AS INT)) is the idiomatic way to count rows matching a condition when you also want the total row count in the same aggregate. COUNT(CASE WHEN HasPurchase = 1 THEN 1 END) is equivalent and more readable in complex cases.

Customer count by country — “DAU proxy”

SELECT CountryCode,
       COUNT(DISTINCT CustomerId) AS unique_customers_last_30d
FROM Sales.Orders
WHERE OrderDate >= DATEADD(DAY, -30, GETDATE())
GROUP BY CountryCode
ORDER BY unique_customers_last_30d DESC;

Customer Acquisition Cost (CAC), a two-step KPI

CAC = marketing spend for a period / new customers acquired in that period.

DECLARE @start DATE = '2026-03-01';
DECLARE @end   DATE = '2026-04-01';

WITH spend AS (
    SELECT Channel, SUM(Amount) AS total_spend
    FROM Marketing.Spend
    WHERE SpendDate >= @start AND SpendDate < @end
    GROUP BY Channel
),
new_customers AS (
    SELECT AcquisitionChannel AS Channel, COUNT(*) AS new_count
    FROM Sales.Customer
    WHERE CreatedAt >= @start AND CreatedAt < @end
    GROUP BY AcquisitionChannel
)
SELECT
    s.Channel,
    s.total_spend,
    n.new_count,
    s.total_spend / NULLIF(n.new_count, 0) AS cac_eur
FROM spend AS s
LEFT JOIN new_customers AS n ON n.Channel = s.Channel
ORDER BY cac_eur DESC;

NULLIF(n.new_count, 0) prevents divide-by-zero. If a channel spent money and acquired nobody, you’d otherwise get a T-SQL error; NULLIF turns the zero into NULL and X / NULL is NULL. The row shows up with NULL CAC, which is correct — it’s undefined.

Lifetime Value (LTV), basic form

LTV per customer = sum of all their order totals.

SELECT c.CustomerId,
       c.Name,
       c.Country,
       COALESCE(SUM(o.Total), 0) AS lifetime_value_eur,
       COUNT(o.OrderId)          AS order_count
FROM Sales.Customer AS c
LEFT JOIN Sales.Orders AS o ON o.CustomerId = c.CustomerId
GROUP BY c.CustomerId, c.Name, c.Country
ORDER BY lifetime_value_eur DESC;

The LEFT JOIN + COALESCE(SUM(...), 0) pattern handles customers who’ve never ordered (they get LTV = 0 instead of NULL). Leadership always wants the zero-LTV customers in their exports — they’re the biggest untapped population.

`ROLLUP`, `CUBE`, `GROUPING SETS` — subtotals without pain

Want a report that shows per-country revenue and a grand total in the same result? WITH ROLLUP:

SELECT CountryCode,
       SUM(Total) AS revenue
FROM Sales.Orders
GROUP BY CountryCode
WITH ROLLUP;

Produces one row per country plus an extra row where CountryCode is NULL and revenue is the grand total.

CUBE gives you all combinations of subtotals (every column × every other). GROUPING SETS gives you explicit control:

SELECT CountryCode, YEAR(OrderDate) AS yr, SUM(Total) AS revenue
FROM Sales.Orders
GROUP BY GROUPING SETS (
    (CountryCode, YEAR(OrderDate)),   -- per country per year
    (CountryCode),                      -- per country (all years)
    (YEAR(OrderDate)),                   -- per year (all countries)
    ()                                   -- grand total
);

These are invaluable for pivot-style reports. Use GROUPING() to tell the difference between “true NULL” and “null from a rollup”:

SELECT CASE WHEN GROUPING(CountryCode) = 1 THEN 'TOTAL'
            ELSE CountryCode
       END AS label,
       SUM(Total) AS revenue
FROM Sales.Orders
GROUP BY CountryCode WITH ROLLUP;

Run this on your own machine

USE Runehold;
GO

-- Revenue per country last month
SELECT CountryCode,
       SUM(Total)   AS revenue_eur,
       COUNT(*)     AS order_count,
       AVG(Total)   AS aov_eur,
       MIN(Total)   AS smallest_order,
       MAX(Total)   AS biggest_order
FROM Sales.Orders
WHERE OrderDate >= DATEADD(MONTH, -1, GETDATE())
GROUP BY CountryCode
ORDER BY revenue_eur DESC;

-- With HAVING: only countries with 2+ orders
SELECT CountryCode,
       COUNT(*)   AS orders,
       SUM(Total) AS revenue
FROM Sales.Orders
GROUP BY CountryCode
HAVING COUNT(*) >= 2
ORDER BY revenue DESC;

-- Active customers (unique) per country
SELECT CountryCode,
       COUNT(DISTINCT CustomerId) AS unique_customers,
       COUNT(*)                   AS orders,
       SUM(Total)                 AS revenue
FROM Sales.Orders
GROUP BY CountryCode;

-- Customer LTV ranked
SELECT c.CustomerId,
       c.Name,
       c.Country,
       COALESCE(SUM(o.Total), 0) AS ltv_eur,
       COUNT(o.OrderId)          AS orders
FROM Sales.Customer AS c
LEFT JOIN Sales.Orders AS o ON o.CustomerId = c.CustomerId
GROUP BY c.CustomerId, c.Name, c.Country
ORDER BY ltv_eur DESC;

-- ROLLUP for grand total
SELECT CASE WHEN GROUPING(CountryCode) = 1 THEN 'ALL COUNTRIES'
            ELSE CountryCode
       END AS country_label,
       SUM(Total) AS revenue
FROM Sales.Orders
GROUP BY CountryCode WITH ROLLUP
ORDER BY GROUPING(CountryCode), revenue DESC;

Notice how each query maps to a question a business person would actually ask: “what’s our revenue last month?”, “which countries have repeat customers?”, “who are our best customers?”. That’s the target. Writing these in your sleep is what separates a data engineer from someone who occasionally runs SELECT queries.

Next lesson: subqueries vs CTEs. When to use which, how to write readable ones, and why a recursive CTE isn’t nearly as scary as it sounds.