iceberg materialized views in galaxy (no más storage_schema)

starburst galaxy, as a saas offering, just keeps slipping in nice bits of features & functionality — this one tackles hiding the underlying storage table of an iceberg materialized view

recap of the inaugural iceberg summit (my top 5 observations)

tl;dr – iceberg is pervasive, the real fight is for the catalog, concurrent transactional writes are a bitch, append-only tables still rule, and trino is widely adopted

joining spark dataframes with identical column names (an easier way)

presenting an easier solution to the problem of colliding column names when joining spark dataframes than i previously offered in my most popular post that just happens to be four years old — some things do age well

pystarburst in 90 seconds (try it)

still thinking about trying to get a pystarburst code stub up/n/running? starburst galaxy makes it pain free and you can even get your first dataframe created via python in under 90 seconds — why not give it a try?

hive to iceberg migration tool (rev1)

they had a need for an iceberg migration tool, I wrote an iceberg migration tool — i committed it as a github project, then i promoted a github project (i’ve got macklemore’s thrift shop in my head as i write this excerpt)

pystarburst via a jupyter notebook (exploring the tpc-h dataset)

ready to explore pystarburtst via a jupyter notebook? this post points you to a single-click solution to spin up jupyter that has sample notebooks ready to run — you’re welcome!

building scalar udf’s w/sql for trino (aka sql routines)

check out this quick set of simple examples showing how easily you can create sql-based user-defined functions (udf), formally referred to as trino sql routines, to allow more succinct queries and offer reusability

apache iceberg table maintenance (is_current_ancestor part deux)

as a follow-on to my earlier post about iceberg versioning (and the is_current_ancestor flag), i thought it would be useful to show working examples of the maintenance activities that are needed to manage the sprawl of data lake files that come with more and more versions