recap of the inaugural iceberg summit (my top 5 observations)

I was a little late to the party by not really focusing on Apache Iceberg until the beginning of 2022, but my 10+ years working with the de-facto table format of Hive firmly registers me in the believer camp. I hope some of my Iceberg posts capture this enthusiasm, too. Not suggesting the other modern table formats are bad, but I believe the openness, the adoption, and the features make Iceberg the best answer we have today.

It really is time for this first Iceberg Summit, not just Iceberg talks at all the other conferences. It is also time for (yet another) “top 5 observations” blog post! I’m waiting until I publish mine before I go read others as I don’t want to be influenced. I hope you enjoy my recap & I encourage you to share your comments with me.

  1. Iceberg is pervasive
  2. The real fight is for the catalog
  3. Concurrent transactional writes are a bitch
  4. Append-only tables still rule
  5. Trino is widely adopted

Iceberg is pervasive

Every OSS-based processing engine supports Iceberg which isn’t a big surprise. This includes the big data grandpa (Cloudera), but doesn’t stop there. Even the extremely successful goliaths, Databricks & Snowflake (or as I refer to them as Dataflakes & Snowbrick; hehe), can’t stop talking about how they are embracing Iceberg despite the fact both of them have valid business reasons why they actually would prefer to completely ignore it.

As a trainer & dev advocate at Starburst, I would be amiss to not mention that we are ALL-IN with Iceberg. Our CEO even coined the term Icehouse which basically means a data lakehouse implemented with Iceberg & Trino.

“Iceberg is Ubiquitous at Apple”

Russell Spitzer

More importantly, real end-users are leveraging Iceberg. Consumer-focused giants like Apple and Netflix for sure, but vendors, service providers, and internal IT teams are adopting Iceberg in droves. I encourage everyone to watch some of the session recordings for the feedback on features and performance/scale which has been incredible.

The real fight is for the catalog

Years ago, the fight was for the file format. Columnar stores won and despite my affinity to ORC, it is fair to say the winner is Parquet. Knowing there are other table formats, one could argue that the current fight is for the table format. For the folks that attended this summit, and for me, that fight is over and leader is Iceberg. Many haven’t realized that the real fight is centered on the catalog.

There are some different approaches to an Iceberg catalog (don’t worry, I still say metastore at times, too) implementation. All the Iceberg committers are clearly on the REST catalog bandwagon, but in fairness this conversation revolves solely around Iceberg tables. There are plenty of us out there that still need a catalog for something like an external Hive table backed by CSV files. That ain’t gonna live in the REST catalog.

The “real fight” is who RUNS the catalog and how open they are to other frameworks and engines. Iceberg relies on the catalog for the atomic swap of a new version/snapshot of a table. You can’t have two catalogs simultaneously managing an Iceberg table so you end up in one of these 3 situations (and only the last one of them is ideal).

  1. The catalog provider does not allow any other query engine to even access it. This means that the engine hosting the catalog is the only engine that can read & write to the table.
  2. The catalog provider allows external engines to read from it. Yep, that’s what it also means for queries — SELECT only.
  3. The catalog provider allows external engines first-class access. When you have a shared catalog/metastore, then, and only then, do we get to see the optionality that Iceberg promises.

I could easily pick on Snowbrick & Dataflakes on this one for offering the #2 option for their Iceberg tables, but to be totally transparent and fair, Starburst Galaxy is still stuck in #1. The big difference is that those big guys likely will NOT open theirs up and even more likely that this Starburst feature will get off the implementation backlog and into production just as soon as a customer asks for it.

Concurrent transactional writes are a bitch

No matter which catalog is managing your Iceberg tables, having multiple writers trying to make changes to content (or structure) of an Iceberg concurrently is tough. The spec calls for the well-known optimistic locking strategy. In a nutshell, two writers can create all the changes they want on disk to support their updates, but once each is finally ready to commit, the snapshot ID listed in the catalog has to be the same as what they thought it was originally to tackle the Atomicity aspect of the ACID test.

This is a good thing as it takes care of the Consistency property as well. It also ensures readers will only see committed versions. The bad news is for the other writer who finds out that their commit failed because the snapshot they started from is no longer the current snapshot. If there’s some good news, it is that Iceberg is smart enough to see if this attempt to commit has any overlaps in the partitions that it is changing compared to any snapshots that have happened since it started processing.

More good news is that often times, the batch or real-time ingestion pipeline is the primary piece of code that is making additions & modifications to a table which greatly reduces this concern. It also still allows applications (such as a GDPR request to erase a user’s data) to commit, or worse case, to retry until it can commit.

This is all some pretty great stuff, but it also makes the (hopefully) well-known theme that we shouldn’t see Iceberg (or any of the data lake table formats) as a replacement for a general-purpose RDBMS. Those things still have an important role today, tomorrow, and probably forever.

Append-only tables still rule

I hope this observation, and especially the last one, doesn’t make me sound like I’m against Iceberg. I ❤ Iceberg, but again, this is still an OLAP, not OLTP, oriented framework. Having ACID-compliant transactions (even if only single-statement/single-table) is awesome even though the majority of the largest tables really don’t need it.

These ludicrous-scale tables are still predominately housing time-series immutable data. I was thrilled to see that many of the presenters even stated they are identifying these giant tables as Version 1 (Analytic Data Tables) that can only be added to as opposed to Version 2 (Row-level Deletes) as identified in the Format Version section of the spec. And yes, even v1 tables have snapshots and get the cool features of time-travel querying and table rollbacks.

The UPDATE, DELETE, and MERGE commands are awesome; especially for the tables that need this. My major point is that in practice we are building solutions where the biggest tables of gigantic scale are fundamentally INSERT-only oriented and it is cool that Iceberg allows you to lock a table down to only allow that if it is what is needed.

Trino is widely adopted

Let me be crystal clear… EVERY session talked about the big guy on the block, Spark, but the majority of the sessions also talked about Trino. Usually, these were the ONLY two distributed processing engines being discussed. In a few sessions, it was even cool to see that folks were using Trino and did not have any reference to Spark (again, a few sessions).

We can’t discount Spark (regardless of where you run it), but Trino can’t be discounted either as a competitive (performance, scale, and cost) solution for interactive queries and dashboards. In fact, with fault-tolerant execution and PyStarburst a SaaS Icehouse offering like Starburst Galaxy (as well as Starburst Enterprise) is a potentially viable option to move away from Spark if it can solve all your performance/scalability & feature needs.

I know… >>

Please forgive the product promotion, but I just love me some Starburst Galaxy!!

Lastly, I doubled-check with some others who attended the conference to see if it was only me, but they concurred — I heard Trino mentioned a TON of times, but I did not hear Presto mentioned a single time. Therefore, I declare the Trino/Presto war is officially over — Trino wins!

Published by lestermartin

Software development & data engineering trainer/evangelist/consultant currently focused on data lake frameworks including Hive, Spark, Kafka, Flink, NiFi, and Trino/Starburst.

2 thoughts on “recap of the inaugural iceberg summit (my top 5 observations)

Leave a comment