query plan analysis is critical for getting every single ounce of performance & scalability out of your trino cluster; my 3-part video series will get you started with the basics
Tag Archives: tuning
iceberg snapshots affect storage footprint (not performance)
it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space
well designed partitions aid iceberg compaction (call them ice cubes)
despite what you may have heard, partitions are not dead (yes, there are multiple tools in the shed) and using a well-defined partitioning strategy with apache iceberg can help prevent concurrency issues when compacting files
apache iceberg table maintenance (is_current_ancestor part deux)
as a follow-on to my earlier post about iceberg versioning (and the is_current_ancestor flag), i thought it would be useful to show working examples of the maintenance activities that are needed to manage the sprawl of data lake files that come with more and more versions
z-order (visualized)
when asked to compare sort-by with z-order for data lake tables i realized i finally needed to have a better understanding of what z-order is all about and my goal with this blog post is to present a simplified visualization of what’s going on and how it can help
configuring the cache service (starburst enterprise)
showcasing a video walk-through of configuring and validating the caching service for starburst enterprise which enables table scan redirection, materialized views, and data products
determining # of splits w/trino/starburst/galaxy (iceberg table format)
a prior post tackled this same quest of understing how trino decides how many splits to use in a query with the hive table format — it ended with a question of how iceberg tackles the same problem which is answered in this post
determining # of splits w/trino/starburst/galaxy (hive table format)
ever wondered how trino decides how many splits to use in a query when reading files from your data lake — if so, come along and ride on a fantastic voyage
presenting at hadoop summit (archiving evolving databases in hive)
overview of, and links to related artifacts for, my presentation at hadoop summit about strategies to handle changing data in hive’s immutable architecture
small files and hadoop’s hdfs (bonus: an inode formula)
a walk-thru of the infamous small files problem for hadoop coupled with a unique problem with inodes usage for mass quantities of extremely small files on hdfs