Grouping Data

Understanding Data Grouping

Grouping lets you divide data into logical sets so that you can perform aggregate calculations on each group.

Creating Groups

Groups are created using the GROUP BY clause in your SELECT statement. The best way to understand this is to look at an example:

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
GROUP BY vend_id;

The above SELECT statement specifies two columns, _vendid, which contains the ID of a product’s vendor, and _numprods, which is a calculated field (created using the COUNT(*) function). The GROUP BY clause instructs the DBMS to sort the data and group it by vend_id. This causes num_prods to be calculated once per vend_id rather than once for the entire table. As you can see in the output, vendor BRS01 has 3 products listed, vendor DLL01 has 4 products listed, and vendor FNG01 has 2 products listed.

Because you used GROUP BY, you did not have to specify each group to be evaluated and calculated. That was done automatically. The GROUP BY clause instructs the DBMS to group the data and then perform the aggregate on each group rather than on the entire result set.

Filtering Groups

In addition to being able to group data using GROUP BY, SQL also allows you to filter which groups to include and which to exclude. The HAVING clause is used, it works similar to the WHERE keyword but for groups.

Look at the following example:

SELECT cust_id, COUNT(*) AS orders
FROM Orders
GROUP BY cust_id
HAVING COUNT(*) >= 2;

The first three lines of this SELECT statement are similar to the statements seen above. The final line adds a HAVING clause that filters on those groups with a _COUNT(*) >= 2_—two or more orders.

As you can see, a WHERE clause does not work here because the filtering is based on the group aggregate value, not on the values of specific rows.

The Difference Between HAVING and WHERE

Here’s another way to look it: WHERE filters before data is grouped, and HAVING filters after data is grouped. This is an important distinction; rows that are eliminated by a WHERE clause will not be included in the group. This could change the calculated values which in turn could affect which groups are filtered based on the use of those values in the HAVING clause.

Using both WHERE and HAVING

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
WHERE prod_price >= 4
GROUP BY vend_id
HAVING COUNT(*) >= 2;

The first line is a basic SELECT using an aggregate function—much like the examples thus far. The WHERE clause filters all rows with a prod_price of at least 4. Data is then grouped by vend_id, and then a HAVING clause filters just those groups with a count of 2 or more. Without the WHERE clause an extra row would have been retrieved (vendor DLL01 who sells four products all priced under 4) as seen here:

SELECT vend_id, COUNT(*) AS num_prods
FROM Products
GROUP BY vend_id
HAVING COUNT(*) >= 2;

Grouping and Sorting

Lets look at ordering the results from a grouping. The following statement retrieves the order number and number of items ordered for all orders containing three or more items:

SELECT order_num, COUNT(*) AS items
FROM OrderItems
GROUP BY order_num
HAVING COUNT(*) >= 3;

To sort the output by number of items ordered, all you need to do is add an ORDER BY clause, as follows:

SELECT order_num, COUNT(*) AS items
FROM OrderItems
GROUP BY order_num
HAVING COUNT(*) >= 3
ORDER BY items, order_num;