as a follow-on to my earlier post about iceberg versioning (and the is_current_ancestor flag), i thought it would be useful to show working examples of the maintenance activities that are needed to manage the sprawl of data lake files that come with more and more versions
Tag Archives: open_source
iceberg snapshot is_current_ancestor flag (what does it tell us)
i’ve noticed the is_current_ancestor column of the apache iceberg $history metadata table for a while now – it wasn’t until I got a direct question about it that i realized it was time to find out for sure
dbt cloud & starburst galaxy workshop (beta testers welcome)
interested in building a data pipeline with dbt cloud and starburst galaxy? if so, then this post presents recorded videos of 7 lab exercises plus the lab guide itself so you work through them on your own & at your pace
ibis & trino (dataframe api part deux)
this is a port of the dataframe api code from my original pystarburst posting – this time i implemented the same scenarios with ibis, the portable python dataframe library, and had a blast doing it
hive acid transactions work on trino (can even update a partitioned column)
it seems that folks who haven’t used hive in production are always quick to say that hive doesn’t have classic crud operations, much less the merge statement, and that simply isn’t true – this post shows you that you can create a hive acid table and mutate its contents with trino
pystarburst (the dataframe api)
the dataframe api is finally available for trino and starburst galaxy thanks to the pystarburst libraries — take a peek at some example usages in this quick validation run
building a sql-based data pipeline with trino & starburst (5 slick videos)
a collection of videos presented as an overview of how you could build a sql-based data transformation pipeline utilizing trino/starburst and automating it with dbt
better iceberg materialized views in galaxy (no staleness check)
i’m happy to report that some code changes were made since my last post on materialized views in starburst galaxy and the (mostly useless) “staleness check” is not being executed any more
determining # of splits w/trino/starburst/galaxy (iceberg table format)
a prior post tackled this same quest of understing how trino decides how many splits to use in a query with the hive table format — it ended with a question of how iceberg tackles the same problem which is answered in this post
delta lake in starburst galaxy (intro & integration)
delta lake is a popular data lake table format and the trino engine, and starburst galaxy, easily integrate with it all while using your favorite cloud provider’s object store thanks to galaxy’s great lakes connectivity