eliminate rollup’s null confusion (hint: grouping keyword)

This quick blog post is presented to help you distinguish aggregate null from stored null values when using ROLLUP, CUBE, and GROUPING_SETS in your SQL analytical queries.

To help visualize the problem, I will focus on ROLLUP. This scenario (and solution) should work with any SQL-compliant database, but I will use Starburst Galaxy for this tutorial. Why? Well, it is FREE (and yes, I work there!!).

Once you get all setup with Starburst Galaxy, find your way to the Query editor and under the Cluster explorer you should be able to find the astronauts table.

Let’s focus in on a small number of records to calculate our ROLLUP results from.

SELECT year_of_birth, name, 
       military_civilian, occupation
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.' 
   AND year_of_birth BETWEEN 1936 AND 1937
 ORDER BY nationality, year_of_birth, name, 
          military_civilian, occupation;

Let’s now see a ROLLUP working like we would expect.

SELECT year_of_birth,
       military_civilian,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, military_civilian)
 ORDER BY year_of_birth, military_civilian;

The totals are identified below.

  • 27 total records
    • 14 records for 1936
    • 13 records for 1937

That use of null works fine until we have a scenario where there are actually null values in the columns being rolled up. We can see this when we change the ROLLUP to function on the occupation column instead of military_civilian.

SELECT year_of_birth,
       occupation,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, occupation)
 ORDER BY year_of_birth, occupation;

The totals are calculated similarly as before.

  • 27 total records
    • 14 records for 1936
    • 13 records for 1937
    NOTE: The confusing thing is that we have two rows with 1936 & null. When looking at these results (and the original data shown earlier in this post) you can deduce what’s going on. There are two records for 1936 that have null values and those 2 rollup into the 14 total for 1936.

By adding GROUPING into the query below, for the column in question, we can get some more insights into what is going on.

SELECT year_of_birth,
       occupation,
       GROUPING(occupation) AS occ_tot_ind,
       COUNT() AS nbr_trips
  FROM sample.demo.astronauts
 WHERE nationality = 'U.S.'
   AND year_of_birth BETWEEN 1936 AND 1937
 GROUP BY ROLLUP(year_of_birth, occupation)
 ORDER BY year_of_birth, occupation, occ_tot_ind;

If the GROUPING function returns 0 then the returned null is for a regular null value. If it returns a 1 then it means it is the special use case for the appropriate rollup total.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading