determining # of splits w/trino/starburst/galaxy (iceberg table format)

At the end of determining # of splits w/trino/starburst/galaxy (hive table format), I raised the question of how Trino decides the number of splits for the Apache Iceberg table format. I was hoping to find some cool properties like with Hive, but found this instead…

Splits-size while querying Iceberg data is static #10874

Based on what I discovered in the Hive blog post, then this suggests that any file <= 128MB will be a single split and any file > 128MB will be broken up into that increment (or whatever is left). Let’s verify.

Here is the S3 contents of an Iceberg table that I created and loaded some data into.

There are a total of 15 files — 4 of which are bigger than 128MB. That suggests that there would be 19 total splits when this table is queried since the ones > 128MB (and < 256MB) would be broken into two splits (i.e. 15 original files + 4 more splits = 19 total). Let’s verify.

Here is the Trino console query details page that indicates the 4 workers in my cluster operated on a total of 19 splits.

To further verify this, I added some more data to the table which drove some more files on both sides of the 128MB split-size and ended up with 29 files that you can see the sizes of below along with the per-file number of splits and the overall total number of splits.

Here is the Trino console query details page that indicates the 4 workers in my cluster operated on these 51 splits.

After seeing all of that, my initial suggestion is that you might be best served having files just under that split limit — let’s just say 110-120MB might just be your best bet if you were lucky enough to control it at that level for most files.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading