determining # of splits w/trino/starburst/galaxy (hive table format)

Parallel processing engines like Trino/Starburst, Hive, and Spark perform & scale because they can work on multiple “chunks” of the data at the same time. These frameworks use different terminology as shown below, but they GENERALLY mean the same thing.

Trino / Starburst	splits
Hive	blocks
Spark	partitions

six of one, half a dozen of the other

This article is geared to those who understand, at least conceptually, how Hive and Spark work when consuming data from HDFS and/or object stores. And, of course, those Trino / Starburst folks who don’t know exactly how the number of splits is decided when using the Hive connector.

HDFS uses block sizes (default is 128MB) to control the “chunks” being processed in parallel.
In a nutshell

To help show how “small files” are bad to everyone in the big data space, especially with Trino / Starburst that I’m teaching on currently, I created a couple of tables with the same information, but in different setups.

Table	File size	# of files	Notes
`5min_ingest`	500KB	10,080	New file created every 5 mins
`daily_rollup`	50MB	105	Each day compacted into 3 files

35 Days of Transactions

Of course, I got a massive performance improvement with the fewer, reasonably-sized, files than with 10,080 small files. That said, the behind the scenes breakdown of the number of splits was not exactly what I had imagined and I wanted to find out more.

The 5min_ingest table’s behavior was easy enough to decipher; it had 10,080 splits which aligned to the 10,800 files.

Hive, which has a large amount of overhead in spinning-up a JVM to read data, has config properties to allow each process to read more than one of these small files. The Trino worker daemons are always running. This dramatically lowers the scheduling overhead for processing splits (similar to what happens when a Spark program gets, and holds onto, its resources) and thus doesn’t have the same concern.

Additionally, and I think it is a good thing, Trino does not try to mask this small files problem which itself needs to be remedied.

The mystery came when I queried daily_rollup. Ideally, the compaction could have created larger (thus fewer) files each day. I imagined these 105 somewhat reasonably-sized files would have become 105 splits (i.e. one/file as I saw with the small files). I ended up getting 205 which puzzled me. I (correctly) figured that the engine was breaking the files into 2 sections thus giving me 2 splits/file, but the math didn’t really add up. That should be 210, not 205. My head began to hurt wondering what happened.

Thankfully, someone pointed me in the right direction (thanks, Anne!). Yep, RTFM time again! Nice warning, too!

These 3 properties are what I needed to look at a bit more.

Property name	Description	Default value
`hive.max-initial-splits`	For each table scan, the coordinator first assigns file sections of up to `max-initial-split-size`. After `max-initial-splits` have been assigned, `max-split-size` is used for the remaining splits.	`200`
`hive.max-initial-split-size`	The size of a single file section assigned to a worker until `max-initial-splits` have been assigned. Smaller splits results in more parallelism, which gives a boost to smaller queries.	`32MB`
`hive.max-split-size`	The largest size of a single file section assigned to a worker. Smaller splits result in more parallelism and thus can decrease latency, but also have more overhead and increase load on the system.	`64MB`

I ran through my two table setups from above and YAY!… IT ALL MADE SENSE! Let’s do the math…

5min_ingest table (10,080 files; 500KB each)
- Table scan yielded 10,080 files, all smaller than hive.max-initial-split-size, so the first 200 splits were 200 of these files
- Now that the hive.max-initial-splits threshold was reached, each of the remaining 9,880 files are smaller than the hive.max-split-size limit, so each are assigned to a split
- 200 + 9,880 = 10,800 splits
daily_rollup table (105 files; 50MB each)
- Table scan yielded 105 files, all larger than hive.max-initial-split-size
- For the first 100 of these 50MB files, the initial 32MB (ref: hive.max-initial-split-size) was used to create a split and the remaining 18MB (which was not larger than hive.max-split-size) became the second split which hit the 200 hive.max-initial-splits value
- The remaining 5 files were all under the hive.max-split-size limit, so were each assigned to a split
- 200 + 5 = 205 splits

Now that this all makes sense (aka don’t be too proud of this technological terror you constructed), I’m wondering how Trino & Starburst determine splits with Iceberg!

determining # of splits w/trino/starburst/galaxy (hive table format)

Like this:

Related

Published by lestermartin

2 thoughts on “determining # of splits w/trino/starburst/galaxy (hive table format)”

Leave a ReplyCancel reply

Share this:

Like this:

Related

Published by lestermartin

2 thoughts on “determining # of splits w/trino/starburst/galaxy (hive table format)”

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)