iceberg materialized views in galaxy (no más storage_schema)

In my prior posts about using Iceberg materialized views on Galaxy, you might remember I was pushing the storage_schema property in the WITH clause. This was to prevent the ugly storage table (the one above looks pretty sweet) from showing up in the same schema. You gotta love the SaaS model as features show up all the time. This one I stumbled on isn’t in the Starburst Galaxy release notes either, but I’m glad it was rolled out.

Basically, you can just NOT call out where you want the storage table to be created at as it won’t be visible in the schema you create a materialized view for.

CREATE MATERIALIZED VIEW sf1_cust_mv
AS SELECT * FROM tpch.sf1.customer;

REFRESH MATERIALIZED VIEW sf1_cust_mv;

Only the materialized view shows up in the GUI as well as from SHOW TABLES.

But after running the SHOW CREATE ... command from above, you see that it is actually being persisted in the same schema. It just doesn’t show up in the list of tables & views which is a very cool thing.

Let’s run a query on the MV and then click on the query details icon (the “interesting” eye ball icon) to learn more.

Once the new page renders, click on the Advanced tab (it starts on General).

There’s a lot packed in that table about tables labeled as Tables (I just wanted to put “table” in a single sentence 3 times!!). From the bottom up it is showing that we queried the MV called sf1_cust_mv, but that no bytes were read from it. Keep going up, we can see that the table used in the MV definition was tpch.sf1.customer, but it also didn’t really get read since I did refresh the MV after creating it.

It then shows the actual name of the underlying (and hidden-ish) storage table. Mouse over it to see the full name.

Basically it is your_mv_name followed by $materialized_view_storage. You can verify they are returning the same data with a simple query like this which returns 0 rows.

SELECT * FROM sf1_cust_mv
 WHERE custkey BETWEEN 75001 AND 75010
EXCEPT
SELECT * FROM "sf1_cust_mv$materialized_view_storage"
 WHERE custkey BETWEEN 75001 AND 75010;

As one might expect, the cool metadata tables like $files won’t really work against the MV name.

It also doesn’t work when you TRY to get a bit fancy…

Fortunately, the Hive metadata columns still work to get at some of the information you might want and need, but only when you reference $materialized_view_storage.

SELECT "$path", min(custkey), max(custkey)
  FROM "sf1_cust_mv$materialized_view_storage"
 GROUP BY "$path";

HAPPY MATERIALIZING!

Published by lestermartin

Software development & data engineering trainer/evangelist/consultant currently focused on data lake frameworks including Hive, Spark, Kafka, Flink, NiFi, and Trino/Starburst.

Leave a comment