well, shoot… (i guess i do like writing queries via chat)

i actually like writing code and did not imagine i would enjoy just asking my business data analysis questions with natural language, but i’m nothing if not flexible and open to reevaluating my opinions

building trino data pipelines (with sql or python)

trino is well-known as a fast query engine, but it is also a robust transformation processing engine that allows data engineers to developer in sql and/or python

unstructured docs in ai (the wild west)

rag ai apps can only be as good as the parsed and chunked data that fuels them – testing, testing, and more testing the outputs of all the various available libraries with the front-end apps is critical

joining spark dataframes with identical column names (an easier way)

presenting an easier solution to the problem of colliding column names when joining spark dataframes than i previously offered in my most popular post that just happens to be four years old — some things do age well

pystarburst in 90 seconds (try it)

still thinking about trying to get a pystarburst code stub up/n/running? starburst galaxy makes it pain free and you can even get your first dataframe created via python in under 90 seconds — why not give it a try?

hive to iceberg migration tool (rev1)

they had a need for an iceberg migration tool, I wrote an iceberg migration tool — i committed it as a github project, then i promoted a github project (i’ve got macklemore’s thrift shop in my head as i write this excerpt)

pystarburst via a jupyter notebook (exploring the tpc-h dataset)

ready to explore pystarburtst via a jupyter notebook? this post points you to a single-click solution to spin up jupyter that has sample notebooks ready to run — you’re welcome!

becoming a data engineer (yet another top 10 list)

after a recent class i was asked what skills someone needs to become a data engineer – there are plenty of these lists all over the internet, yet here i go assuming i know enough to jot down yet another; at least i put mine all in a single picture 😉

ibis & trino (dataframe api part deux)

this is a port of the dataframe api code from my original pystarburst posting – this time i implemented the same scenarios with ibis, the portable python dataframe library, and had a blast doing it