
Parallel processing engines like Trino/Starburst, Hive, and Spark perform & scale because they can work on multiple “chunks” of the data at the same time. These frameworks use different terminology as shown below, but they GENERALLY mean the same thing.
| Trino / Starburst | splits |
| Hive | blocks |
| Spark | partitions |
This article is geared to those who understand, at least conceptually, how Hive and Spark work when consuming data from HDFS and/or object stores. And, of course, those Trino / Starburst folks who don’t know exactly how the number of splits is decided when using the Hive connector.
HDFS uses block sizes (default is 128MB) to control the “chunks” being processed in parallel.
In a nutshell
To help show how “small files” are bad to everyone in the big data space, especially with Trino / Starburst that I’m teaching on currently, I created a couple of tables with the same information, but in different setups.
| Table | File size | # of files | Notes |
5min_ingest | 500KB | 10,080 | New file created every 5 mins |
daily_rollup | 50MB | 105 | Each day compacted into 3 files |
Of course, I got a massive performance improvement with the fewer, reasonably-sized, files than with 10,080 small files. That said, the behind the scenes breakdown of the number of splits was not exactly what I had imagined and I wanted to find out more.
The 5min_ingest table’s behavior was easy enough to decipher; it had 10,080 splits which aligned to the 10,800 files.
Hive, which has a large amount of overhead in spinning-up a JVM to read data, has config properties to allow each process to read more than one of these small files. The Trino worker daemons are always running. This dramatically lowers the scheduling overhead for processing splits (similar to what happens when a Spark program gets, and holds onto, its resources) and thus doesn’t have the same concern.
Additionally, and I think it is a good thing, Trino does not try to mask this small files problem which itself needs to be remedied.
The mystery came when I queried daily_rollup. Ideally, the compaction could have created larger (thus fewer) files each day. I imagined these 105 somewhat reasonably-sized files would have become 105 splits (i.e. one/file as I saw with the small files). I ended up getting 205 which puzzled me. I (correctly) figured that the engine was breaking the files into 2 sections thus giving me 2 splits/file, but the math didn’t really add up. That should be 210, not 205. My head began to hurt wondering what happened.
Thankfully, someone pointed me in the right direction (thanks, Anne!). Yep, RTFM time again! Nice warning, too!

These 3 properties are what I needed to look at a bit more.
| Property name | Description | Default value |
hive.max-initial-splits | For each table scan, the coordinator first assigns file sections of up to max-initial-split-size. After max-initial-splits have been assigned, max-split-size is used for the remaining splits. | 200 |
hive.max-initial-split-size | The size of a single file section assigned to a worker until max-initial-splits have been assigned. Smaller splits results in more parallelism, which gives a boost to smaller queries. | 32MB |
hive.max-split-size | The largest size of a single file section assigned to a worker. Smaller splits result in more parallelism and thus can decrease latency, but also have more overhead and increase load on the system. | 64MB |
I ran through my two table setups from above and YAY!… IT ALL MADE SENSE! Let’s do the math…
5min_ingesttable (10,080 files; 500KB each)- Table scan yielded 10,080 files, all smaller than
hive.max-initial-split-size, so the first 200 splits were 200 of these files - Now that the
hive.max-initial-splitsthreshold was reached, each of the remaining 9,880 files are smaller than thehive.max-split-sizelimit, so each are assigned to a split - 200 + 9,880 = 10,800 splits
- Table scan yielded 10,080 files, all smaller than
daily_rolluptable (105 files; 50MB each)- Table scan yielded 105 files, all larger than
hive.max-initial-split-size - For the first 100 of these 50MB files, the initial 32MB (ref:
hive.max-initial-split-size) was used to create a split and the remaining 18MB (which was not larger thanhive.max-split-size) became the second split which hit the 200hive.max-initial-splitsvalue - The remaining 5 files were all under the
hive.max-split-sizelimit, so were each assigned to a split
- 200 + 5 = 205 splits
- Table scan yielded 105 files, all larger than

Now that this all makes sense (aka don’t be too proud of this technological terror you constructed), I’m wondering how Trino & Starburst determine splits with Iceberg!
See related post at https://lestermartin.wordpress.com/2023/07/18/determining-of-splits-w-trino-starburst-galaxy-iceberg-table-format/ for how this is tackled with Apache Iceberg.