building trino data pipelines (with sql or python)

trino is well-known as a fast query engine, but it is also a robust transformation processing engine that allows data engineers to developer in sql and/or python

yarp: yet another rag post (this time using sql)

you don’t have to know python or bother your data scientists to start exploring genai concepts like rag; you just need a tool that offers these features in a familiar sql interface

unstructured docs in ai (the wild west)

rag ai apps can only be as good as the parsed and chunked data that fuels them – testing, testing, and more testing the outputs of all the various available libraries with the front-end apps is critical

develop, deploy, execute & monitor in one tool (welcome to apache nifi)

for those not familiar with apache nifi, come on a short overview of how this framework rather uniquely spans so many of the phases of the typical software development lifecycle

exploring ai data pipelines (hands-on with datavolo)

after explaining what rag ai apps are all about & showing what a typical ai data engineering pipeline looks like, i wanted to offer a hands-on lab exercise actually building a simple pipeline use datavolo cloud

understanding rag ai apps (and the pipelines that feed them)

i’m learning all about rag ai apps and wanted to try to explain, at a high-level, what these are all about plus do the same for the etl pipelines that are key to their success

iceberg snapshots affect storage footprint (not performance)

it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space

becoming a data engineer (yet another top 10 list)

after a recent class i was asked what skills someone needs to become a data engineer – there are plenty of these lists all over the internet, yet here i go assuming i know enough to jot down yet another; at least i put mine all in a single picture 😉

dbt cloud & starburst galaxy workshop (beta testers welcome)

interested in building a data pipeline with dbt cloud and starburst galaxy? if so, then this post presents recorded videos of 7 lab exercises plus the lab guide itself so you work through them on your own & at your pace