better iceberg materialized views in galaxy (no staleness check)

Ok, YOU try to find a cool “staleness” image to put at the top of YOUR blog post about how Apache Iceberg materialized views handle potential staleness of underlying datasets they are created from. Yep… not so easy!!

The image above comes from a blog post with the same title, How to Make Bread Stale Fast?, and it is appropriate for this post as it calls out that “staleness” (even with bread) doesn’t always mean “bad”. Alright, let’s go a bit deeper into my topic for today as this is not a cooking show by any means. I do make a mean lettuce wrap appetizer, but I digress.

Recently, I published starburst galaxy’s materialized views (using apache iceberg) where I made a big deal about how “staleness” was addressed. In this context, I’m talking about when the situation arises (and it does VERY OFTEN) where the underlying dataset(s) that were used to populate a materialized view are updated. Materialized views are still relatively new in the big data world and generally they don’t (today at least!) automatically update any time a change occurs to the underlying dataset(s) they are populated from.

That said, as you’ll see in the Addressing Staleness section of the link above that I called out that the underlying Trino Iceberg Connector describes in detail how it DOES actively check for such a situation and then treats the materialized view as a basic view (i.e. queries the underlying table(s) instead).

The incredible Tom Nats recently pointed out to me that this wasn’t acting that way any longer and I had a mixed bag of test results when I did a double-check. Then when the tenacious Kyle (Major) Payne pinged me about it this morning I decided to take yet another look.

Low and behold, the Apache Iceberg materialized views running on Starburst Galaxy are NOT checking for staleness and always returning the values from the last time the REFRESH command was run!!

This, my friends, is a GOOD thing in my book. Maybe I’ll write a blog post on why I feel that way, but the nutshell is that data changes fast in the big data world and refreshing materialized views isn’t free. My personal belief is that data engineering pipeline which is populating/modifying the data is the best place to include, at appropriate ingestion points, the directive to rebuild the materialized view. Comments ALWAYS welcomed. 😉

If that’s enough of an explanation, then thanks for attending my TED Talk and I’ll see you next time. If you wanna SEE this in action, then come on in… the water’s feels fine!

Behind the Curtain

I was able to perform the same steps from starburst galaxy’s materialized views (using apache iceberg) and see the staleness condition ignored. Much of below is the same stuff from that post.

I created a couple of schemas then created, populated, and queried an Iceberg table.

CREATE SCHEMA mycloud.vmv_storage;
CREATE SCHEMA mycloud.vmv;
USE mycloud.vmv;

CREATE TABLE dune_characters (
  id integer,
  name varchar(55),
  notes varchar(255)
)
WITH (type = 'iceberg');
 
INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (101, 'Leto', 'Ruler of House Atreides'),
  (102, 'Jessica', 'Concubine of the Duke');

SELECT name, notes 
  FROM dune_characters;

Looking into the query details you can see these 2 rows came from the base table as you would expect.

Next, I created a materialized view, but did not perform an initial REFRESH on it before I queried it.

CREATE MATERIALIZED VIEW mat_view
WITH (storage_schema = 'vmv_storage')
AS SELECT name, notes 
     FROM dune_characters;

SELECT * FROM mat_view;

Looking into the query details you can see the engine was aware of mat_view, but ended up querying the dune_characters table for these initial 2 rows.

Refreshed the materialized view with the command below and queried it again to see the same (i.e. only) 2 rows.

REFRESH MATERIALIZED VIEW mat_view;

Notice below that the results actually came from the materialized view’s storage table.

Now, the big test. I added another row and queried the materialized view again.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (103, 'Paul', 'Son of Leto');
 
SELECT * FROM mat_view;

Like before, I only received the original 2 records and could verify the storage table was read. I added another record and checked again.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (103, 'Paul', 'Son of Leto');

Again, just got the 2 records and same read on the storage table. Final test is to refresh the materialized view and query it again.

REFRESH MATERIALIZED VIEW mat_view;

SELECT * FROM mat_view;

And yep, now all 4 rows are returned!!

I’m GLAD this change was made as the “check for staleness” was technically cool, but practically made Iceberg materialized views worthless since the underlying source table(s) are very likely changing all the time.

better iceberg materialized views in galaxy (no staleness check)

Behind the Curtain

Like this:

Related

Published by lestermartin

One thought on “better iceberg materialized views in galaxy (no staleness check)”

Leave a ReplyCancel reply

Behind the Curtain

Share this:

Like this:

Related

Published by lestermartin

One thought on “better iceberg materialized views in galaxy (no staleness check)”

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)