iceberg snapshots affect storage footprint (not performance)

it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space

well designed partitions aid iceberg compaction (call them ice cubes)

despite what you may have heard, partitions are not dead (yes, there are multiple tools in the shed) and using a well-defined partitioning strategy with apache iceberg can help prevent concurrency issues when compacting files

apache iceberg table maintenance (is_current_ancestor part deux)

as a follow-on to my earlier post about iceberg versioning (and the is_current_ancestor flag), i thought it would be useful to show working examples of the maintenance activities that are needed to manage the sprawl of data lake files that come with more and more versions

determining # of splits w/trino/starburst/galaxy (iceberg table format)

a prior post tackled this same quest of understing how trino decides how many splits to use in a query with the hive table format — it ended with a question of how iceberg tackles the same problem which is answered in this post

presenting at hadoop summit (archiving evolving databases in hive)

overview of, and links to related artifacts for, my presentation at hadoop summit about strategies to handle changing data in hive’s immutable architecture