Big Data – Lester Martin (l11n)

don’t lead your chat-based llm (it wants to please)

ai tools want to please us, but their overly-agreeable responses are tweaked to make use happy, not necessarily provide the right, or best, response — don’t trust the response at face value!

understanding iceberg deletion vectors (and enjoying some humble pie)

for or a given iceberg snapshot, there can be 0 or 1 deletion vector per data file & a deletion vector cannot span more than one data file

building trino data pipelines (with sql or python)

trino is well-known as a fast query engine, but it is also a robust transformation processing engine that allows data engineers to developer in sql and/or python

yarp: yet another rag post (this time using sql)

you don’t have to know python or bother your data scientists to start exploring genai concepts like rag; you just need a tool that offers these features in a familiar sql interface

trino query plan analysis (video series)

query plan analysis is critical for getting every single ounce of performance & scalability out of your trino cluster; my 3-part video series will get you started with the basics

logo to company match game (data engineering open-source projects)

can you match the open-source data engineering project logos with the company names who are most affiliated with each?

delta lake time-travel (just reference the version)

trino’s delta lake connector offers features around versioning to include comparing versions and time-travel querying

unstructured docs in ai (the wild west)

rag ai apps can only be as good as the parsed and chunked data that fuels them – testing, testing, and more testing the outputs of all the various available libraries with the front-end apps is critical

optionality and common sense (why i returned to starburst)

i’m so excited to have returned to starburst and be focused on rebooting the devrel function, not to mention staying active in the trino and iceberg communities — long live the icehouse

apache spark (yet another overview)

an overview of apache spark presented from 20,000 feet, on the surface, and below the waterline