
Just like every other query engine, you’ve made it to the big leagues when you can actually decipher the query plan (aka explain plan) that gets produced. I’ve been explaining how data lake analytics engines such as Hive, Spark, and Trino break their work down into stages that operate in massively parallel fashions — and detailing all the fun & expense of the data exchange happening between those stages.
These following high-level (and rather universal) concepts are explained in the 1st of 3 videos.
- Understand how & why datasets are divided into splits
- Explain the decomposition of a query into tasks, stages, and exchanges
- Visualize multi-stage queries that tackle sorting, aggregation, and multiple types of joins
- See the conceptual visualization of a parallelized query aligns with Trino’s generated visual plan at runtime
The middle video is focused on the particular syntax Trino’s CBO uses in the textual query plan. This video hits on topics such as the following.
- Optimizations available with table statistics
- Generating a textual and a visual query plan
- Understand the format of the query plan output to include fragments, exchanges, distribution, estimates, and performance
- Performance with, and without, accurate table statistics
The final video in the series is the FUN one! I squarely focus on walking through query plans since you have all that background from the earlier videos, with the mindset of spotting potential performance issues and remedies to overcome them. You’ll see first-hand topics such as the ones listed below.
- Column pruning, predicate pushdown, and partition pruning
- Trino’s approach to tackle sorting and aggregation across two stages
- Join optimizations including broadcast, reordering, and dynamic filtering
In addition to the comments sections of the videos above and this blog post’s comments area, feel free to bring your concerns to Starburst Community Forum.
I love this post!