super stoked to announce the first early release of my upcoming o’reilly book, optimizing your apache iceberg lakehouse, has been published — pull down the pdf and let me know what you think
Tag Archives: galaxy
understanding iceberg deletion vectors (and enjoying some humble pie)
for or a given iceberg snapshot, there can be 0 or 1 deletion vector per data file & a deletion vector cannot span more than one data file
building trino data pipelines (with sql or python)
trino is well-known as a fast query engine, but it is also a robust transformation processing engine that allows data engineers to developer in sql and/or python
yarp: yet another rag post (this time using sql)
you don’t have to know python or bother your data scientists to start exploring genai concepts like rag; you just need a tool that offers these features in a familiar sql interface
trino query plan analysis (video series)
query plan analysis is critical for getting every single ounce of performance & scalability out of your trino cluster; my 3-part video series will get you started with the basics
optionality and common sense (why i returned to starburst)
i’m so excited to have returned to starburst and be focused on rebooting the devrel function, not to mention staying active in the trino and iceberg communities — long live the icehouse
iceberg acid transactions with partitions (a behind the scenes perspective)
a port of my prior post taking a deeper look at what happens under the hood of hive with “acid” transactions — this time on iceberg tables with parquet files
iceberg materialized views in galaxy (no más storage_schema)
starburst galaxy, as a saas offering, just keeps slipping in nice bits of features & functionality — this one tackles hiding the underlying storage table of an iceberg materialized view
recap of the inaugural iceberg summit (my top 5 observations)
tl;dr – iceberg is pervasive, the real fight is for the catalog, concurrent transactional writes are a bitch, append-only tables still rule, and trino is widely adopted
joining spark dataframes with identical column names (an easier way)
presenting an easier solution to the problem of colliding column names when joining spark dataframes than i previously offered in my most popular post that just happens to be four years old — some things do age well