Lester Martin (l11n)

lakehouse lab (short-form video series)

Happy to share that the Lakehouse Lab with Lester Martin short-form video series has been established. A new episode drops each week and I already have the following first 6 of them up on the playlist.

As I called out my “other experiment“, I’m currently the one-person DevRel show at Starburst. That absolutely means that I am an advocate for all things Starburst, but I am ALSO an advocate for open-source Trino and Apache Iceberg. In fact, I see myself as an all around advocate for all things data engineering.

With that backdrop, it should not come as a surprise that as I grow out the Lakehouse Lab short-form video “show”, my episodes will often be about things you don’t need to buy from Starburst and are available & applicable for all. This includes me sharing some of the knowledge (and doubts) I’ve built up over my 35 year career, not just pitching product A & B or open-source effort Y & Z.

I do think the show will get a bit more polished as I go, but it is purposefully NOT trying to be something TOO polished. I hope you like the style of it all. I did work on the virtual background and really like the imaginative, and raw, look of “the lab” I settled on. I hope you’ll watch and then like/comment/follow the series.

well, shoot… (i guess i do like writing queries via chat)

NGL; I actually LIKE to write code (real code and even SQL) and did not think I was going to actually enjoy just writing out my requests in natural language, but today I had a little “stop fighting it” moment. Let me show you…

Setup

I held a hands-on, workshop-styled, webinar showing the basics of Building Data Pipelines with Ibis and Starburst which focused on Starburst Galaxy‘s sample burstbank dataset with these entities & relationships.

Details on the exercises are in my Starburst DevRel GitHub repo, specifically here, but one of the analysis questions I wanted to find the answer to was as follows.

Define a view that groups customers by their medallion status and the state/province they live in and calculate the following for the group; number of auto loans and average auto payment.

Taking a peek at the tables themselves ensured this was a softball question, but just enough to showcase implementing it with the Ibis Expression (aka dataframe) API using the Trino backend, as well as comparing it with straight SQL. Both implementations got the following results.

Check out the Jupyter notebook if you want to run any of this yourself.

Ibis

I got the results above from this bit of Python code using Ibis.

			
# grab needed fields from customer
all_cust = con.table("customer", database=("sample", "burstbank")) \
    .select("custkey", "state")
# grab needed fields from customer_profile
cust_p = con.table("customer_profile", database=("sample", "burstbank")) \
    .select("custkey", "customer_segment")
# css -> Customer State (and) Segment
css = all_cust.join(cust_p, all_cust.custkey == cust_p.custkey)
# grab needed fields from account
all_acct = con.table("account", database=("sample", "burstbank")) \
    .select("custkey", "auto_loan_id", "auto_loan_status", "auto_loan_balance")
# only keep those with ACTIVE auto loans
active_auto = all_acct.filter(
    [all_acct.auto_loan_status == "open",
     all_acct.auto_loan_balance > 0])
# grab needed fields from loan_payment
auto_pymt = con.table("auto_loan_payment", database=("sample", "burstbank")) \
    .select("auto_loan_id", "payment_amount")
# put it all together
the_aggs = active_auto.join(css, active_auto.custkey == css.custkey) \
    .join(auto_pymt, active_auto.auto_loan_id == auto_pymt.auto_loan_id) \
    .group_by(css.state, css.customer_segment) \
    .aggregate(
        nbr_auto_loans = active_auto.auto_loan_id.nunique(),
        avg_payment = auto_pymt.payment_amount.mean()
    )
# few touch-ups to be DONE
the_view = the_aggs.mutate(
    avg_payment = the_aggs.avg_payment.round()
).order_by("customer_segment", "state")
print(the_view)

		

I can’t lie, it was fun touching this dataframe API again, but for folks that aren’t familiar with Python or the concept of a dataframe, this might look a bit intimidating at first glance.

SQL

In that same notebook I show that this could also be tackled with some straight-forward SQL (maybe I went overboard with my CTEs; haha).

			
WITH
active_auto_loans AS (
SELECT custkey, auto_loan_id, auto_loan_balance
  FROM sample.burstbank.account
 WHERE auto_loan_status = 'open'
   AND auto_loan_balance > 0
),
cust_state_and_segment AS (
SELECT c.custkey, c.state, cp.customer_segment
  FROM sample.burstbank.customer c
  JOIN sample.burstbank.customer_profile cp
    ON c.custkey = cp.custkey
),
cust_active_auto AS (
SELECT *
  FROM active_auto_loans aal
  JOIN cust_state_and_segment csas
    ON aal.custkey = csas.custkey
)
SELECT caa.state, caa.customer_segment,
       count(distinct alp.auto_loan_id) AS nbr_auto_loans,
       round(avg(alp.payment_amount)) AS avg_payment
  FROM cust_active_auto caa
  JOIN sample.burstbank.auto_loan_payment alp
    ON caa.auto_loan_id = alp.auto_loan_id
 GROUP BY caa.state, caa.customer_segment
 ORDER BY caa.customer_segment, caa.state

		

As I stated before, this was a pretty simple ask and it was almost as much fun as the Python code, but it did take “more than a minute” to get it working perfectly. In fact, I’ll let you know that I actually had an error in my Ibis code and my query that are in the video recording of the webinar (can you find it?); but more on that in a bit.

Claude

Like most everyone else, I work for Initech and we are ALL being tasked with finding out how to work more efficiently using Gen AI tools. As always, new technology isn’t here to make work life easier for anyone… it is here for us to work even harder (my hot take for the day). One of the supported tools we have is everyone’s good friend; Claude.

Following Starburst’s documentation, which was clearly where Claude got its answer to this same question, I wired it up to the MCP Server available on Starburst Galaxy. I then asked the usual “tell me about” questions to learn more about the burstbank tables. Heck, I even had Claude build me that ERD at the top of this post and I’m surely going to use that little trick again.

Ultimately, I broke down and asked the million dollar question…

Find customers with active auto loans, group them by their medallion status and state they live in, provide the total number of these accounts by the grouping as well as an average auto payment amount, rounded to the nearest dollar. Order the results by medallion status and state.

And yes… it got the answer perfectly right. Of course, this is a softball question as I mentioned at the top, but I guess that even this COP (Cranky Old Programmer) can admit this might be the place to start when crafting any non-trivial query.

But remember, as I pointed out in don’t lead your chat-based llm (it wants to please), DON’T TRUST THE RESPONSE AT FACE VALUE! Verify & validate, early & often!! That said, my initial results from Claude actually were different than I received from Ibis and my SQL query implementations. Initially I thought, “see, it got it wrong!”, but in my verify/validate efforts I realized it was ME that got it wrong earlier.

I quickly realized the problem when I compared results and was able to update the Jupyter notebook to be correct. Side quest: if YOU can see what my error was in the webinar recording let me know and I’ll send you some Trino, Iceberg, and Starburst stickers! Heck, Claude even showed me the query it wrote.

			
SELECT 
  cp.customer_segment AS medallion_status,
  c.state,
  COUNT(DISTINCT a.auto_loan_id) AS total_accounts,
  ROUND(AVG(alp.payment_amount)) AS avg_auto_payment
FROM sample.burstbank.customer c
JOIN sample.burstbank.account a ON c.custkey = a.custkey
JOIN sample.burstbank.customer_profile cp ON c.custkey = cp.custkey
JOIN sample.burstbank.auto_loan_payment alp ON a.auto_loan_id = alp.auto_loan_id
WHERE a.auto_loan_status = 'open'
GROUP BY cp.customer_segment, c.state
ORDER BY cp.customer_segment, c.state

		

Yes, seems I do like to write little CTEs instead of just banging out all the joins at once. Hey, it’s a matter of style! 😉

Epilogue

At a personal level, I still have a LOT of concerns about the impact of Gen AI including environmental harm and the effect of ai on intelligence (behold the idiocracy), but these tools are here and I would be lying to suggest they are not making me go faster. I guess I’d just say, find what works for you, don’t be completely closed-minded, NEVER forget that everything has consequences, and absolutely DON’T TRUST THE RESPONSE AT FACE VALUE!

Note: I did NOT use the help of any LLM-based Gen AI tool to write this blog post. I actually hope that shows! It’s still our time. Long live the writers!

what is driving the semantic layer revival? (ai can’t live without it)

TL;DR: We have been trying to define meaningful context layers for decades, but the humans we created them for just wouldn’t embrace them. Along come AI agents and they are thirsty (yes, I’m personifying AI) for this clarifying metadata in an attempt to provide more accurate answers.

Same old problem

As database, table, and column names are defined by technologists, use a variety of abbreviation patterns, and their initial intention is rarely rigorously aligned with the data that eventually gets stored within them, it should not come as a surprise that we have always needed a well-documented and regularly-maintained business translation layer. Nowadays, we call this the ‘semantic layer’, but 30+ years ago when I started in software we called it a ‘data dictionary’.

There have been many other names along the way (data catalog, business object universe, glossary, metadata repository, etc), but the goal is still the same — to create an abstraction between raw data and the people or systems consuming it. Why have we never really solved this in a pervasive way? As is usually the case, I blame us humans.

Skinning elbows

Humans have a tendency to want to learn everything themselves. Imagine how advanced of a civilization we would be if we could just take a face-value all of the knowledge and life-lessons that those before us built up in their lifetime. Unfortunately, we have to ‘skin our elbows’ on the same problems that have been solved so many times before. That unwillingness to learn from others is the heart of why we don’t already have incredibly accurate semantic layers.

Without that desire to just accept knowledge others are trying to share, it is also no wonder the ‘data czar’ role never really hit at the vast majority of enterprises. When it did, it was more like an intellectual society writing for themselves and nobody else wanted to listen. Then came the self-service BI world that exclaimed there was no need for a semantic layer.

Awesome BI tools brought forward the ‘story tellers’ of the data. These artists sure didn’t need a business glossary to help them tell their stories around the campfire!

Interested clients

So what’s new? Yep, GenAI is here; like it or not. Along with it is the expectation that somehow these probabilistic pattern-matching models will magically be cognizant and be able to provide truly deterministic answers. That isn’t going to happen, but as with the earlier phases of ‘prompt engineering’ and RAG, the current round of AI ‘agents’ (yep, they are still heavily using the LLMs) return better answers when provided with additional definitions, business meanings, and example scenarios (aka CONTEXT).

Will it all work this time? Will this generation of data systems actually have accurate semantic layers representing them? The pessimistic-optimist in me says it might just so, but the pragmatic programmer in me thinks it probably won’t. Costs will continue to rise for AI tooling, unrealistic expectations will continue to drive AI adoption, and ultimately I fear folks don’t want to maintain these data dictionaries and executives don’t want to pay for their creation and upkeep.

The semantic layer — it is STILL a good idea!

my iceberg book (early release 1)

While still very early in the process, I’m tickled pink to announce that the 1st early release (ER1) of my upcoming O’Reilly book named Optimizing Your Apache Iceberg Lakehouse (Improving Performance & Scalability) has been published. It only features Chapter 2 at this time and the final product will have 7 chapters in all.

I’d love any & all feedback that anyone has. Here is the PDF for ER1 for your reading enjoyment!

Optimizing Your Apache Iceberg Lakehouse_ER1 Download

don’t lead your chat-based llm (it wants to please)

There’s a whole lot of talk about “AI slop” nowadays (deservedly, IMO), but what does it mean? In the earliest days of GenAI & RAG (not that long ago!) there was a LOT of talk about “hallucinations”, but what does that mean? Good eye; you noticed I answered a question with another question.

Hallucinations

Google’s Gemini answers the “what is genai hallucination” question this way…

That’s a decent answer, but what I personally don’t like about chat-based LLM (Large Language Models) interfaces is that too many, maybe most, people think they are actually interacting with some kind of intelligent being. In reality, these tools, like well-known ChatGPT, are really just doing some (very) fancy math that lets them find similarities in previously consumed content available on the internet and then predict how best to assemble their reply.

Probabilistic

Oh… and the answers are PROBABILISTIC, not DETERMINISTIC. Fancy terms that basically say it doesn’t “know” what the exact answer is, and it is very likely that the next time you ask the same question your answer could be different. Let’s ask Google’s Gemini that same question again.

The good news is that these two responses are pretty much saying the same thing. Maybe that’s not a big deal for this question, but “hallucinations” can be far more impactful.

I keep air-quoting “hallucinations” because the LLM isn’t just messing up sometimes and making stuff up — “hallucinations” are really how they work all the time (i.e. finding similarities in what they’ve seen before and generating a response mushing all of that back together). The LLM doesn’t know if it is right or wrong, much less is aware of the impact.

AI slop

These hallucinations (eh-hem, “normal output of the LLM”) have rolled into a larger phrase now; “AI slop”. If you’ve heard the term you might know it from the crappy looking videos on social media. In a business context it often refers to someone presenting you information from their GenAI tool of choice (aka these chat-based LLM interfaces like ChatGPT that we’re talking about) without performing any oversight activities.

You read it and think, “what the H. E. double-hockeysticks was that!?” I’m going to come back to it, but basically the warning is… DON’T TRUST THE RESPONSE AT FACE VALUE!

If you are going to use these tools to help you, then you have to accept that the output has YOUR name on it. If it is wrong, misleading, rambling (like my own blog posts that you can surely tell are not AI-generated), or even disorienting, it reflects badly on you. Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

I miss the 80’s

All of that is not inherently bad or good (okay, because most folks don’t really know what’s going on… it IS bad). For myself, I can handle this and despite wishing I was back in the 1980’s sometimes (Gen X folks know what I mean), these tools are here AND I find myself leveraging them, too.

Yes, that’s despite my grave concerns of the environmental impact of what all of this tech is doing to us, but I think we have to save that for another post — this one is already rambling way too much.

Repeating myself

If you’re still with me, let’s review…

GenAI is NOT really an intelligence; it is just amazingly sophisticated pattern matching
GenAI isn’t guessing/hallucinating SOMETIMES, it is just doing what it does ALL THE TIME
GenAI will likely predict a different answer to the same question each time you ask it

I can live with that (i.e. repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!) as I understand I am the owner of the final output I share with anyone. Then what am I all frustrated about? It comes down to two things.

Misleading humans

First, I wish I could count on everyone to understand that, but the tech companies who make all of this have realized it would be much more popular if they let people believe they were talking to some super-mind. Who doesn’t want a super-mind who only wants to help us be more productive and hopefully more accurate? Heck, I’d pay a LOT for that…

Whose at fault for #1? Like most things, there’s a big bowl of blame and yes, we OUGHT to be able to rely on the average person to know what they are using, but this stuff is a whole bunch more complicated that driving a car. Okay, I do blame the GenAI companies for misleading consumers.

Leading GenAI (we finally got to the title of this thing!)

Second, I absolutely HATE that the default behavior of these chat-based LLM tools is to do just that — chat it up with us. Who doesn’t like to chat? And if we are going to chat with someone/something, we sure want it to be entertaining, relatively nice, maybe even interesting, but absolutely FLATTERING! There’s the rub I have.

Couple a misunderstanding of what is happening under the covers with a overly-agreeable tool that is going to make sure it writes its responses in a way that makes you feel good — maybe even makes you feel smart (ex: “the super-mind agrees with me!”) and we’ve got a deadly combination.

An example

Lead GenAI

I work in the data engineering world (yes, yet another role that is figuring out our future with these GenAI tools) and someone sent me some GenAI output suggesting that Apache Iceberg would be a good option for an operational database. If you don’t know what either of these mean, that’s OK, they are just the backdrop (reference links intentionally left out). I consider myself very well versed in this particular technology domain and my immediate response was “that’s incorrect – it would NOT be a good option.”

I wanted to reproduce this AI slop/hallucination/inaccuracy/lie/normal-output so I went to Google Gemini and asked, “why is apache iceberg a good choice for an operational database” and was flabbergasted by the opening paragraph that said it was STRONG choice…

Yep, the darn thing was telling me what I wanted to hear; being overly-agreeable. Not because that’s really what it surmised, but because it could fit the narrative I wanted. I was “leading the witness” and there was no opposing attorney to object. It stroked my ego and rambled on and on of why this was a good idea.

Interestingly, the correct answer was in there if one could read between the lines & had enough domain knowledge to notice the nuance. Even more interestingly, waaaaaaaay down at the very bottom, it finally, buried within a long run-on AI slop paragraph, mentioned that it was NOT a replacement for an operational database which is exactly what the user need to hear at the very top of the misdirecting manifesto.

Lead it in the opposing direction

You guessed it; I decided to flip the way I was leading my witness by suggesting it was a good idea with, “why is apache iceberg not a good choice for an operational database”, and the darn thing did it again — it told me what I wanted to hear!!

Stop acting like a human! Stop stroking our egos! Stop being so overly-agreeable! Since most folks don’t really know what’s really going on, stop tricking people just to make a buck! Be more like Sergeant Friday!

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

Be neutral

If you want to get a better answer, it helps to not lead the tool. Just ask the base question without suggesting the answer. It wants to please you (the “it” is an algorithm and some software, not a conscious being — it doesn’t have feelings; even if it sounds like it does). It can easily take you down the wrong path.

I asked it one last time, but went with just “is apache iceberg a good choice for an operational database” which let the simulated-intelligence not have to act in a simulated-flattering way or be simulated-defensive with its pretty solid response.

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

What do I want?

From tool vendors	From end users
Take away all this nice/polite/overly-agreeable/human-ISH nonsense and just act like a too — be a darn computer and give me a computer-ISH answer.	Assemble the best response you can by possibly asking the question a few ways & using more than one particular tool. Remember that you own the final product.

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

Thanks for reading

understanding iceberg deletion vectors (and enjoying some humble pie)

image source: https://www.starburst.io/blog/iceberg-v3/

My confession

While digging deeper into Apache Iceberg v3 deletion vectors, I’ve realized that one of my initial assumptions was wrong. Furthermore, letting the great Danica Fine record me pontificating on how it all works is forcing me to eat a big piece of humble pie! I was close, but surely not close enough. Here’s my previous thinking…

I sure SOUNDED confident, but I did mess up one very important concept. I ass-u-me-d that a deletion vector could span more than one data file. Unfortunately, that’s not the case!

TL;DR

For a given snapshot, there can be 0 or 1 deletion vector file per data file AND an Iceberg deletion vector file cannot span more than one data file. That said, having writers logically merge, & physically write, an updated deletion vector file for a new snapshot still goes a long way to make this a better solution than the v2 positional delete files.

Should you continue reading?

Okay, if that explains it all to you — you are free to go (thanks for your attention to this matter), but if it didn’t make sense let’s explore some more. AND IF the whole concept of Apache Iceberg v2 “positional delete files” is something new, please check out Alex Merced‘s Understanding Apache Iceberg Delete Files first.

Heck, if you need an Iceberg overview in general, the good folks at Starburst Academy put a nice & short video together, Understanding Apache Iceberg architecture, to fill in the blanks.

For the Missourians (the show-me state)

Yep, time to get into it and SEE what’s going on for those, like me, who want to see an example or three. I’m going to use Starburst Galaxy as it has the new Iceberg v3 implementation available to anyone — check out the free trial if you haven’t used it before.

Test data

We can use the tpch.tiny.customer table from the Trino TPC-H connector as a source for our testing data.

This has 1500 records whose custkey values so nicely range from 1 – 1500. As this is a very small example (and we want to create multiple data files), we can insert them into our test tables in 3 bands of 500 records each.

Additionally, when we delete records, we can delete evenly distributed batches of records based on how nicely the customer key value divides by 5 and 2. You’ll see!

Version 2 positional delete files

Set up a table

Create a test table, populate it with the 1500 rows, and verify 3 data files of 500 records each are present.

create table cust_v2
with (type='iceberg', format_version=2)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v2$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v2
 where mod(custkey, 5) = 0;

The bottom 3 files are the v2 positional delete files. One for each of the data files. Each referencing 100 records from the 300 total that have been deleted.

Delete some more

Now, delete the 600 records whose customer keys are even numbers and list the $files metadata table again.

delete from cust_v2
 where mod(custkey, 2) = 0;

The middle 3 files are the next round of v2 positional delete files. Like before, one for each of the data files. Each referencing 200 records from the 600 total that have been deleted this time.

What happens when we delete more?

Each time we run another DELETE command a new positional delete file will be created for each affected data file. This will continue to grow and sprawl until we can run a compaction operation to clean it all up.

Version 3 deletion vectors

Set up a table

Create a v3 test table, populate it with the same 1500 rows, and verify (just as before) 3 data files of 500 records each are present.

create table cust_v3
with (type='iceberg', format_version=3)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v3$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v3
 where mod(custkey, 5) = 0;

As before, the bottom 3 files represent the 300 records deleted. What is different is that these delete files are persisted as Puffin files, not Parquet ones.

Delete some more

Now, delete the 600 records with even numbers for their customer keys then list the $files metadata table again.

delete from cust_v3
 where mod(custkey, 2) = 0;

Notice that there are still only 3 deletion vector (Puffin) files. That’s because the compute engine merged the contents of the previous 100 affected rows per file with the new change deleting another 200 rows. The deletion vectors now include pointers to the comprehensive 300 records for each of their associated 500 record data files.

What happens when we delete more?

For any new deletions aligned to a data file with an existing deletion vector file, the compute engine will merge the comprehensive contents into a new Puffin file. This approach keeps a smaller number of delete files around that need to be read when querying the table. It will prevent the delete file spawl that occurs in the v2 implementation and will also be rolled up when a compaction process is run.

Why do we care?

With the sprawl of v2 positional delete files comes more and more file I/O which decreases performance and increases costs. The v3 deletion vector files ensure that for a given snapshot there will never be more than a single deletion file aligned to each data file. Performance doesn’t keep degrading as more and more deletions occur and we prevent cost spikes for data lake solutions who charge on the number of GET operations.

Did we have fun?

I don’t know about you, but I sure had a blast exploring!

If you want to go digging for even more, we didn’t even compare the difference in what is being stored in the Parquet-based v2 positional delete files vs the Puffin-based v3 vector deletion files. But hey… let’s get into that another time!

Happy Iceberging!!

my freewrite alternative (under $200)

I don’t know how Freewrite can use the word “free” in their company name. I adore the look & promise of their original Smart Typewriter, but $700 is far from free AND way too much!

Even their mid-tier Traveler comes in at $550 (I feel like Steve Martin in Father of the Bride when he declared, “A cake, Franck, is made of flour and water. My first car didn’t cost $1200” — THIS IS A CLAMSHELL KEYBOARD & A TINY E-INK SCREEN!!). I was actually hoping the Alpha would list for a reasonable $200 instead of its $350 sticker shock.

I’m probably digging the Alpha as it is the closest to the used Alphaworks Neo I picked up on eBay after remembering it while reading 9 Distraction-Free Writing Devices. In fact, it would have been exactly what I wanted, but we’re talking about some pretty old hardware and we all know how tough folks can be on keyboards. Mine had a few keys that didn’t work and thankfully the seller gave me a full refund.

The original Freewrite Smart Typewriter has been out there for a few years now and I know I’m not the only person drooling over it. I simply can’t understand why others haven’t entered into this market, including those who want to solely compete on price and bring this thing down. Again, the target price in my head is $200 and the manufacturer can still make some money.

There is the Zerowriter Ink with a pretty decent price of $230, but as far as I can tell it isn’t a REAL product I can buy and play with today. So, I went back to the “9 devices” post I referenced earlier and got me thinking about a small Android-based e-ink device coupled with a Bluetooth keyboard. Some digging around on Amazon ultimately got me to this setup — AND AT THE RIGHT PRICE!

Yes, you are right… this is NOT a completely dedicated device, but for me it is darn close as once I got it all setup I’m in the Google Docs app almost immediately and have a full-sized keyboard underneath my fingertips.

I also verified it worked great being online or offline as the Google app took care of all the magic. It is working great for me and already banged out my first body of work on it. Plus, it is so cute!!

The added bonus to me is the very nice feeling $30 keyboard as I can be as rough as I want on it without worry I might destroy my device.

building trino data pipelines (with sql or python)

When people think of Trino they naturally hone in the benefits of querying data with SQL in this “engine that runs at ludicrous speed”. When asked if Trino would be useful as a transformation processing engine for ETL/ELT workloads, not all would immediately answer YES!

Usually, the answer depends on if someone has only heard of Trino as opposed to those who have actually used Trino. The latter will usually say ABSOLUTELY it can! This post focused on ensuring all know Trino is well suited for constructing data pipelines AND that data engineers have their choice of SQL and/or Python when constructing them.

History of Trino for ETL workloads

The project website is a great place to start if Trino is new to you. The Starburst DevCenter is another useful resource. And, of course, I’m proud to offer my own blog posts discussing Trino and/or Starburst to the community. Please spend a little bit of time at some (or all) of these resources if needed before continuing.

Within a year after Trino (then known as Presto) was released into production at Facebook for interactive use cases, users started scheduling batch/ETL queries with Trino instead of Hive. Within a few more years, 50% of the existing ETL workloads were running on Trino as well as 85% of all new workloads. Soon others caught on, and companies like Salesforce and Lyft began utilizing Trino for their batch/ETL workloads, too.

Architectural concerns & remedies

Trino came out of the gate blowing Hive away on query execution speed because Trino focused more on performance as opposed to Hive’s focus on reliability. On a side note, hive, trino & spark features (their journeys to sql, performance & durability) points out that both of these engines are always improving. To understand all of that, let’s talk about the original architecture of Trino.

Parallel processing fundamentals

Fundamentally, the Trino engine works in a distributed processing mechanism similar to Hive and Spark. Meaning, it breaks work down into multiple stages organized in a directed acyclic graph (DAG). The stages are where the same processing can be done on many slices (Trino calls them splits, Spark calls them partitions, and Hive calls them blocks) of data simultaneously.

When needed to move to another stage of processing, the data has to be redistributed to allow processing to continue in a highly parallel fashion. Examples of when this is needed are common SQL operations such as GROUP BY, JOIN, and ORDER BY. If these concepts are new to you, I recommend you watch, at least, the first video in trino query plan analysis (video series) which I’m embedding next.

Initial Trino architecture

With that DAG / query plan knowledge fresh in your mind, take a look at the following image which represents the base architecture of why Trino runs fast.

Data leaving one stage to another is redistributed into the next stage (often referred to as shuffling). Trino’s speed benefit comes from not persisting that intermediary data to disk — it acts more like a streaming engine internally and sends the data on to the next stage directly. The consequence of this is if the query fails before completion, the initiator is returned an error message.

The design goals for interactive querying performance did not provide sufficient support for long-running and memory-intensive queries.

Long running queries unreliable: the all-or-nothing architecture makes it really hard to tolerate faults
Distributed memory limit: with streaming shuffle, aggregations and joins have to process all at once

Also, with the original architecture, it is really hard to apply classic techniques like adaptive query execution, speculative execution, and skew handling.

Fault-tolerant execution

As hive, trino & spark features (their journeys to sql, performance & durability) calls out, platforms continue to grow over time to address as many features and limitations as possible. Trino has been no different and Project Tardigrade added fault-tolerant execution (FTE) benefits into the architectural design.

The earlier identified issues with the initial architecture were overcome by storing the intermediary data produced by the parallel tasks within a stage. This model enforces the stage-by-stage execution model that engines like Hive were famous for (and gave it the reliability we are discussing now). The following diagram shows the flow of that intermediary data to configurable storage.

The same work is still accomplished within each stage, but since all stages are not actively running at the same time and the data needed for a subsequent stage is being persisted durably, the following benefits surface with FTE.

Tolerate individual task failures: if a worker fails, then only the work in the affected tasks of the currently executing stage need to be restarted
Reliability run queries of any size: all tasks within a stage do not have to be executing at the same time thus the all-in-memory constraint is eliminated
More flexible resource management: the Trino engine can be flexible enough to allow higher-priority queries to the front of the queue without cancelling others
Adapt query execution dynamically midstream: Trino has the ability to adjust according to different running conditions and optimize its execution strategy midstream

FTE is a cluster-wide setting, so in some organizations it would make sense to run 1+ FTE clusters as well as 1+ (classically) ‘interactive’ clusters.

Pipelines across the medallion architecture

As the diagram below shows, the medallion architecture is a common pattern we see in data lake environments and is focused on the flow of data from ingestion all the way to consumption for analytical, ML/AI, or other data application needs.

Data pipeline construction options

The next diagram suggests that this end-to-end flow of data constructs a comprehensive data pipeline. The diagram also shows that the E (extract) & L (load) from ELT is often referred to, in aggregate, as ingestion. Interestingly, the T (transform) output can be input for another transformation; or even more than one.

More interesting to note is that those series of transforms; those data pipelines, can be constructed with classical SQL (no surprise for the Trino query engine), but also with Python.

SQL

Probably the best argument for building your transforms and other data pipeline activities such as validating, enriching, and aggregating, is that almost everyone knows SQL.

Pros for SQL

Widespread knowledge
Compact
Easy to understand for simple logic

Cons for SQL

No real time debugging
Can’t perform other actions on results
No unit tests
Version diffs often hard to read
As complexity grows, SQL gets more and more difficult

Python

Some folks want to implement their data pipelines with a programming approach, and currently Python is the most popular language for data processing. For Python transformation jobs with Trino, we really are talking about using a Dataframe API.

A Dataframe is a two-dimensional, tabular data structure that can be conceptually understood as a table. The following example is implemented with the very popular PySpark distributed processing tool.

Pros for Python Dataframes

Reusable functions
Version diffs are easy to read
Easily create unit tests
Could combine Python libraries like Numpy & Scipy
Perform real-time debugging
Easy to follow complex logic
Allows for logging

Cons for Python Dataframes

A bit verbose
Need to be a Python programmer

For a Python Dataframe application to run on Trino, the framework has to convert the code into SQL that Trino can understand. This process can end up producing SQL that looks different than what you might create directly with SQL.

Here is a simple multi-table join query written by a human.

Programming that same functionality in Python using methods for each operation such as filtering, joining, or aggregating, can end up creating a query that looks more exhaustive.

Fortunately, Trino has a feature-rich Cost-Based Optimizer (CBO) which optimizes whatever query it receives to determine the best approach to run it quickly & efficiently. Back to our earlier ‘parallel processing’ discussion, this means that we expect the CBO to create the same query plan / DAG regardless of the SQL presented if both are trying to answer the same question.

Dataframe API alternatives

The primary options for programming to a Dataframe API using Python with the goal of running the work in a Trino cluster are PyStarburt and Ibis.

PyStarburst

PyStarburst features a PySpark-like syntax using the same lazy execution model (waits until an input/output ‘action’ to be called before actually submitting work to the cluster). At that point, the Python code gets converted to SQL so that the heavy lifting is done by Trino.

To learn more about PyStarburst, including seeing some programming examples, check out my PyStarburst tagged content.

Ibis

Ibis provides a common Dataframe API for data manipulation in Python, and compiling that API into a variety of compute engines.

With this tooling, Trino is just one of the options of where your Dataframe code could be executed. Check out ibis & trino (dataframe api part deux) for code examples running in Trino.

Comparing Dataframe APIs

The APIs themselves for these two frameworks are similar, but surely not identical, as the following shows.

While the SQL that gets generated by each tool might look different, they are both attempting to solve the same problem so we rely on the CBO to generate the same, or very similar, query plan / DAG.

Orchestration & scheduling

Trino does not provide a native orchestration and/or scheduling subsystem so you still need to provide your own solution to this space. Tools like Airflow, Dagster, and Prefect are popular and can have a mix of SQL, Python, and other tooling steps that need to be executed appropriately and with dependencies.

Summary

Key points

Trino is, and has been, an appropriate clustering technology for running your transformation processing jobs.

Fault-tolerant execution mode makes it even more robust
Data engineers can leverage SQL and/or Python when constructing their pipelines

Dataframe API recommendations

If you use Starburst products >> Choose PyStarburst

More similar to PySpark Dataframe API (easier to port from, or even back to, Spark)
More likely to be optimized better for Trino than Ibis since development community knows Trino very well

If you are using OSS Trino >> Choose Ibis

PyStarburst is not supported here
Gain optionality for the backend execution engine

If you could benefit from additional commentary on this topic, please check out my YouTube video below.

yarp: yet another rag post (this time using sql)

Sure, you’ve read a BUNCH of retrieval augmented generation (RAG) blog posts by now, but surely NOT ENOUGH of them focused on using SQL, instead of Python, to get it all working.

If you could use a refresher (or even a first-time explanation) of what RAG is, I hope my video helps.

Starburst’s new AI functions

For this post, I’m going to explore Starburst Enterprise‘s new AI functions in the examples below. They are broken up into 3 categories.

Task functions such as analyzing sentiment, classification, fixing grammar, masking text, and translating languages. The docs show good examples of these.
Embedding functions are all about creating mathematical representations of text that can be clumped together near similar text (see my video above if this is new to you).
Prompt functions — well, at this version there really is only one function. It is called prompt() and it does what it sounds like; you can send in an input prompt and get a response back from an LLM (think ChatGPT) with the answer.

I’ll show you how the generate_embeddings() function works shortly, but let’s just play with prompt() first. Remember, we are using SQL in this blog post! Let’s give it a try with a very important question.

SELECT starburst.ai.prompt(
  'What underlying technology does Starburst use?',
  'bedrock_claude35');

Yep, that’s right! Oh, did I say I am a Developer Advocate for Trino & Starburst? Well… I did now (disclaimer posted).

You probably noticed that the method I used has a second argument. The prompt() docs show that this is the name of an LLM model. This brings up a super cool feature in Starburst Enterprise; the ability to configure (and manage use of) multiple models. In fact, on-prem and/or cloud-based ones. But… let’s save all of that for another post.

Our use case

Imagine if you will that the cartoon character Fred Flintstone was keeping a personal diary.

If we asked things of a public LLM about Fred’s interests or experiences that are only in this diary, we will not get an accurate answer. For example, I happen to know Fred, like me, is an active Postcrossing member and sends/receives postcards all over the world.

Let’s see what happens if we ask an LLM if he likes “snailmail”.

SELECT starburst.ai.prompt(
  'Does Fred Flintstone like to send and receive mail?',
  'bedrock_claude35');

remember, results are probabilistic and can be different each time

Let’s see if we can build a data engineering pipeline to prepare Fred’s diary for use in a simple RAG workflow.

Data pipeline

As unstructured docs in ai (the wild west) identifies, ETL pipelines can get rather complicated, rather quickly, depending on the input data. Furthermore, exploring ai data pipelines (hands-on with datavolo), gives you some insight into some of the tooling options you have in this problem domain.

Test data

For our example, we have it pretty easy. Fred uses his stone-aged mobile device (hey, he’s a fictional character so why not give him cellular service and internet access!) to record his journal entries. The good news for this blog post is that we can just pick up at the point where there is an Apache Iceberg table that holds all the entries.

-- create a table for diary entries
CREATE TABLE diary_entries (
    owner varchar,
    entry_date timestamp,
    entry_text varchar
);
show create table diary_entries;

Let’s populate it with some example entries (again, this would have already been done for us — we just need some test data).

-- add initial entries 
INSERT INTO diary_entries (owner, entry_date, entry_text)
VALUES 
  ('Fred Flintstone', date_add('day', -99, current_timestamp), 'It is a new year and today I decided to start keeping a diary. Some may imagine that life is pretty boring here in Bedrock, but my family and friends make sure life is NEVER boring! As today is a Monday, I''m off to work at the quarry. I work there everyday until 5pm. Despite it being physical work, I do enjoy it. I was happy to play with Dino after work.'),
  ('Fred Flintstone', date_add('day', -98, current_timestamp), 'Ok, keeping a diary is something I''ll need to get used to. It seems I missed yesterday, but I''m still here. It is mid-week and I''m finally jotting down my notes this Wednesday evening. Barney and I just got back from our Loyal Order of Water Buffaloes meeting. Barney is running for Vice President this year and I fully support him.'),
  ('Fred Flintstone', date_add('day', -97, current_timestamp), 'Ahh... the weekend is finally here. I''ve been so busy (and well... I guess I simply forgot as well) to enter my thoughts into my daily journal. Off to the drive in movie tonight.'),
  ('Fred Flintstone', date_add('day', -96, current_timestamp), 'The weekends always go so fast. Family & friends continue to be the thing I spend my time on. I''m thinking I need a new hobby, too.'),
  ('Fred Flintstone', date_add('day', -95, current_timestamp), 'I have decided to try sending & receiving postcards as a new hobby. I found this website called Postcrossings that looks pretty interesting. The site says that over 800,000 people enjoy this hobby. Postcrossers, that''s what these hobbyists are called, have sent over 78 million postcards so far.'),
  ('Fred Flintstone', date_add('day', -94, current_timestamp), 'I was able to send 3 postcards today to people the system provided addresses to me for. This really is the hobby for me.'),
  ('Fred Flintstone', date_add('day', -93, current_timestamp), 'Fun Thursday night getting 2 more postcards in the mail.  I am loving this hobby even though Wilma thinks I''m nuts.'),
  ('Fred Flintstone', date_add('day', -92, current_timestamp), 'I went hiking over the weekend for one of my existing hobbies; waterfall viewing. I saw two of the largest ones in Bedrock; Big Falls and Prettybig Falls. I was even able to buy a few postcards at the gift shop.'),
  ('Fred Flintstone', date_add('day', -91, current_timestamp), 'I had so much fun looking at waterfalls that I visited the artificial one we have at the mall; Shopping Shoal. Silly name, but I did find out that shoals were usually rocks in rough water that paddlers enjoy traversing through.'),
  ('Fred Flintstone', date_add('day', -90, current_timestamp), 'This   was   a    second   journal entry for Jan 15. It shows    some   extra   spacing    that can    be cleaned up.'),
  ('Fred Flintstone', date_add('day', -89, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -88, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -87, current_timestamp), 'These last days have been a complete blur. So busy just life that I haven''t been able to focus on my new hobby of mailing postcards.');
select * from diary_entries;

Transformation logic

In our situation, we do not need to do complex parsing and chunking steps in our ETL pipeline since we already have text for each journal entry stored in a varchar field of an Iceberg table. We still do need to create embeddings from these (naturally created) text chunks.

We need a place to store these embeddings. You could always use a purpose-built vector database like I used in exploring ai data pipelines (hands-on with datavolo). Instead we will store our AI embeddings “lakeside”. Meaning as a column in our Iceberg table. Let’s add a column for them to be stored.

-- prep to store vector embeddings 
--  'lakeside' right in the Iceberg table
ALTER TABLE diary_entries 
  ADD COLUMN entry_text_embeddings array(double);

Now, we can leverage generate_embeddings() to create the vector embeddings and then just store them into the new column.

-- create and persist the embeddings
UPDATE diary_entries
   SET entry_text_embeddings = 
         starburst.ai.generate_embedding(
             entry_text, 'bedrock_titan'
         )
 WHERE entry_text_embeddings IS NULL;
select * from diary_entries;

Taking a deeper look into one of the entry_text_embeddings column’s values, you’ll notice this is just a very long array of numbers ranging from -1 to 1. Again, the video at the top of this post attempts to give a very high-level understanding of what these are representing.

For sake of this post, this array of values from -1 to 1 is a multi-dimensional mathematical representation of the contents in the corresponding entry_text column for each row. These can be used in our RAG workflow later when trying to find chunks of text (journal entries in our use case) that are similar to a request that will be asked about Fred Flintstone.

It is worthy of noticing 2 cool things we’ve just done.

We used SQL to interact with an AI model to generate vector embeddings!
We didn’t need a separate vector database to store those embeddings!! We stored them “lakeside” to use the phrase I hear everyday at Starburst.

RAG workflow

Now that we have our data prepared, we can focus on the SQL-based RAG workflow we are interested in.

The “R”

RAG starts with “R” so let’s see if we can use Starburst’s AI functions to retrieve diary entries that align with the initial question we asked in the earlier use case description.

SELECT
  owner,
  entry_date,
  entry_text,
  cosine_similarity(
    starburst.ai.generate_embedding (
      'Does Fred Flintstone like to send and receive mail?',
      'bedrock_titan'
    ),
    entry_text_embeddings
  ) AS similarity_score
FROM
  diary_entries
ORDER BY
  similarity_score DESC
LIMIT
  5;

Notice that the entry_text values below with the highest similarity_score values reference postcards. Our question didn’t mention postcards specifically, but that’s the power of the similarity searching based on embeddings, not just a simple LIKE operator. The cosine_similarity() function already existed in Trino.

The “AG”

The more full query below surfaces the query above as a CTE and uses it to create a second one, json_results, which just packages up the journal entries into a JSON document. Finally, the prompt() function is used after including the similar journal entries retrieved and adding some additional context directing the LLM to use them in the question being asked.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'bedrock_titan'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
),

-- Augment the results by converting  
--  them into a JSON object
json_results AS (
SELECT CAST(map_agg(to_iso8601(entry_date), json_object(
    key 'journal entry date' VALUE entry_date,
    key 'journal entry text' VALUE entry_text)) AS JSON) AS json_data
FROM
    vector_search
)

-- Generate an augmented response 
--  using the LLM
SELECT
    starburst.ai.prompt(concat(
        'Using the list of journal entries provided in JSON, ',
        'Does Fred Flintstone like to send and receive mail?',
        json_format(json_data)), 'bedrock_claude35')
FROM json_results;

Check out the results this time!

maybe a little overly chatty, but the response is more correct

Another question: How does Fred Flintstone feel about public transportation?

Base LLM response

Enhanced RAG response

Another question: What hobbies or activities is Fred Flintstone interested in?

Base LLM response

Better prompt engineering would merge these answers together in the RAG response.

Enhanced RAG response

Another question: How often does Fred Flintstone go to the movies?

Base LLM response

It is kind of funny that the animated theme song does show him going to the movies!

Enhanced RAG response

It is fair to say that even with this very primitive RAG workflow (again, written using SQL!), the responses are better after being augmented with non-public information.

Video version

As we all learn different ways, I put a quick demo together of the RAG workflow for those who prefer watching videos.

Parting thoughts

Clearly more work would need to go into this exercise to get it to production level quality, but that isn’t the point. The intention of this article was to show that data engineers and data analysts who know SQL already have a chance to do some investigatory GenAI efforts without having to know Python and/or to engage their over-committed data science team.

Furthermore, a SQL interface to the underlying AI tooling offers application developers and AI engineers even more options for when they are assembling production-grade GenAI applications.

To me, that all sounds pretty darn cool. What do you think?

trino query plan analysis (video series)

Just like every other query engine, you’ve made it to the big leagues when you can actually decipher the query plan (aka explain plan) that gets produced. I’ve been explaining how data lake analytics engines such as Hive, Spark, and Trino break their work down into stages that operate in massively parallel fashions — and detailing all the fun & expense of the data exchange happening between those stages.

These following high-level (and rather universal) concepts are explained in the 1st of 3 videos.

Understand how & why datasets are divided into splits
Explain the decomposition of a query into tasks, stages, and exchanges
Visualize multi-stage queries that tackle sorting, aggregation, and multiple types of joins
See the conceptual visualization of a parallelized query aligns with Trino’s generated visual plan at runtime

The middle video is focused on the particular syntax Trino’s CBO uses in the textual query plan. This video hits on topics such as the following.

Optimizations available with table statistics
Generating a textual and a visual query plan
Understand the format of the query plan output to include fragments, exchanges, distribution, estimates, and performance
Performance with, and without, accurate table statistics

The final video in the series is the FUN one! I squarely focus on walking through query plans since you have all that background from the earlier videos, with the mindset of spotting potential performance issues and remedies to overcome them. You’ll see first-hand topics such as the ones listed below.

Column pruning, predicate pushdown, and partition pruning
Trino’s approach to tackle sorting and aggregation across two stages
Join optimizations including broadcast, reordering, and dynamic filtering

In addition to the comments sections of the videos above and this blog post’s comments area, feel free to bring your concerns to Starburst Community Forum.