for or a given iceberg snapshot, there can be 0 or 1 deletion vector per data file & a deletion vector cannot span more than one data file
Tag Archives: open_source
building trino data pipelines (with sql or python)
trino is well-known as a fast query engine, but it is also a robust transformation processing engine that allows data engineers to developer in sql and/or python
trino query plan analysis (video series)
query plan analysis is critical for getting every single ounce of performance & scalability out of your trino cluster; my 3-part video series will get you started with the basics
logo to company match game (data engineering open-source projects)
can you match the open-source data engineering project logos with the company names who are most affiliated with each?
delta lake time-travel (just reference the version)
trino’s delta lake connector offers features around versioning to include comparing versions and time-travel querying
optionality and common sense (why i returned to starburst)
i’m so excited to have returned to starburst and be focused on rebooting the devrel function, not to mention staying active in the trino and iceberg communities — long live the icehouse
apache spark (yet another overview)
an overview of apache spark presented from 20,000 feet, on the surface, and below the waterline
iceberg acid transactions with partitions (a behind the scenes perspective)
a port of my prior post taking a deeper look at what happens under the hood of hive with “acid” transactions — this time on iceberg tables with parquet files
develop, deploy, execute & monitor in one tool (welcome to apache nifi)
for those not familiar with apache nifi, come on a short overview of how this framework rather uniquely spans so many of the phases of the typical software development lifecycle
iceberg snapshots affect storage footprint (not performance)
it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space