don’t lead your chat-based llm (it wants to please)

There’s a whole lot of talk about “AI slop” nowadays (deservedly, IMO), but what does it mean? In the earliest days of GenAI & RAG (not that long ago!) there was a LOT of talk about “hallucinations”, but what does that mean? Good eye; you noticed I answered a question with another question.

  1. Hallucinations
  2. Probabilistic
  3. AI slop
  4. I miss the 80’s
  5. Repeating myself
  6. Misleading humans
  7. Leading GenAI (we finally got to the title of this thing!)
    1. An example
      1. Lead GenAI
      2. Lead it in the opposing direction
      3. Be neutral
    2. What do I want?
  8. Thanks for reading

Hallucinations

Google’s Gemini answers the “what is genai hallucination” question this way…

That’s a decent answer, but what I personally don’t like about chat-based LLM (Large Language Models) interfaces is that too many, maybe most, people think they are actually interacting with some kind of intelligent being. In reality, these tools, like well-known ChatGPT, are really just doing some (very) fancy math that lets them find similarities in previously consumed content available on the internet and then predict how best to assemble their reply.

Probabilistic

Oh… and the answers are PROBABILISTIC, not DETERMINISTIC. Fancy terms that basically say it doesn’t “know” what the exact answer is, and it is very likely that the next time you ask the same question your answer could be different. Let’s ask Google’s Gemini that same question again.

The good news is that these two responses are pretty much saying the same thing. Maybe that’s not a big deal for this question, but “hallucinations” can be far more impactful.

I keep air-quoting “hallucinations” because the LLM isn’t just messing up sometimes and making stuff up — “hallucinations” are really how they work all the time (i.e. finding similarities in what they’ve seen before and generating a response mushing all of that back together). The LLM doesn’t know if it is right or wrong, much less is aware of the impact.

AI slop

These hallucinations (eh-hem, “normal output of the LLM”) have rolled into a larger phrase now; “AI slop”. If you’ve heard the term you might know it from the crappy looking videos on social media. In a business context it often refers to someone presenting you information from their GenAI tool of choice (aka these chat-based LLM interfaces like ChatGPT that we’re talking about) without performing any oversight activities.

You read it and think, “what the H. E. double-hockeysticks was that!?” I’m going to come back to it, but basically the warning is… DON’T TRUST THE RESPONSE AT FACE VALUE!

If you are going to use these tools to help you, then you have to accept that the output has YOUR name on it. If it is wrong, misleading, rambling (like my own blog posts that you can surely tell are not AI-generated), or even disorienting, it reflects badly on you. Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

I miss the 80’s

All of that is not inherently bad or good (okay, because most folks don’t really know what’s going on… it IS bad). For myself, I can handle this and despite wishing I was back in the 1980’s sometimes (Gen X folks know what I mean), these tools are here AND I find myself leveraging them, too.

Yes, that’s despite my grave concerns of the environmental impact of what all of this tech is doing to us, but I think we have to save that for another post — this one is already rambling way too much.

Repeating myself

If you’re still with me, let’s review…

  • GenAI is NOT really an intelligence; it is just amazingly sophisticated pattern matching
  • GenAI isn’t guessing/hallucinating SOMETIMES, it is just doing what it does ALL THE TIME
  • GenAI will likely predict a different answer to the same question each time you ask it

I can live with that (i.e. repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!) as I understand I am the owner of the final output I share with anyone. Then what am I all frustrated about? It comes down to two things.

Misleading humans

First, I wish I could count on everyone to understand that, but the tech companies who make all of this have realized it would be much more popular if they let people believe they were talking to some super-mind. Who doesn’t want a super-mind who only wants to help us be more productive and hopefully more accurate? Heck, I’d pay a LOT for that…

Whose at fault for #1? Like most things, there’s a big bowl of blame and yes, we OUGHT to be able to rely on the average person to know what they are using, but this stuff is a whole bunch more complicated that driving a car. Okay, I do blame the GenAI companies for misleading consumers.

Leading GenAI (we finally got to the title of this thing!)

Second, I absolutely HATE that the default behavior of these chat-based LLM tools is to do just that — chat it up with us. Who doesn’t like to chat? And if we are going to chat with someone/something, we sure want it to be entertaining, relatively nice, maybe even interesting, but absolutely FLATTERING! There’s the rub I have.

Couple a misunderstanding of what is happening under the covers with a overly-agreeable tool that is going to make sure it writes its responses in a way that makes you feel good — maybe even makes you feel smart (ex: “the super-mind agrees with me!”) and we’ve got a deadly combination.

An example

Lead GenAI

I work in the data engineering world (yes, yet another role that is figuring out our future with these GenAI tools) and someone sent me some GenAI output suggesting that Apache Iceberg would be a good option for an operational database. If you don’t know what either of these mean, that’s OK, they are just the backdrop (reference links intentionally left out). I consider myself very well versed in this particular technology domain and my immediate response was “that’s incorrect – it would NOT be a good option.”

I wanted to reproduce this AI slop/hallucination/inaccuracy/lie/normal-output so I went to Google Gemini and asked, “why is apache iceberg a good choice for an operational database” and was flabbergasted by the opening paragraph that said it was STRONG choice…

Yep, the darn thing was telling me what I wanted to hear; being overly-agreeable. Not because that’s really what it surmised, but because it could fit the narrative I wanted. I was “leading the witness” and there was no opposing attorney to object. It stroked my ego and rambled on and on of why this was a good idea.

Interestingly, the correct answer was in there if one could read between the lines & had enough domain knowledge to notice the nuance. Even more interestingly, waaaaaaaay down at the very bottom, it finally, buried within a long run-on AI slop paragraph, mentioned that it was NOT a replacement for an operational database which is exactly what the user need to hear at the very top of the misdirecting manifesto.

Lead it in the opposing direction

You guessed it; I decided to flip the way I was leading my witness by suggesting it was a good idea with, “why is apache iceberg not a good choice for an operational database”, and the darn thing did it again — it told me what I wanted to hear!!

Stop acting like a human! Stop stroking our egos! Stop being so overly-agreeable! Since most folks don’t really know what’s really going on, stop tricking people just to make a buck! Be more like Sergeant Friday!

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

Be neutral

If you want to get a better answer, it helps to not lead the tool. Just ask the base question without suggesting the answer. It wants to please you (the “it” is an algorithm and some software, not a conscious being — it doesn’t have feelings; even if it sounds like it does). It can easily take you down the wrong path.

I asked it one last time, but went with just “is apache iceberg a good choice for an operational database” which let the simulated-intelligence not have to act in a simulated-flattering way or be simulated-defensive with its pretty solid response.

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

What do I want?

From tool vendorsFrom end users
Take away all this nice/polite/overly-agreeable/human-ISH nonsense and just act like a too — be a darn computer and give me a computer-ISH answer.Assemble the best response you can by possibly asking the question a few ways & using more than one particular tool. Remember that you own the final product.

Repeat… DON’T TRUST THE RESPONSE AT FACE VALUE!

Thanks for reading

understanding iceberg deletion vectors (and enjoying some humble pie)

image source: https://www.starburst.io/blog/iceberg-v3/

My confession

While digging deeper into Apache Iceberg v3 deletion vectors, I’ve realized that one of my initial assumptions was wrong. Furthermore, letting the great Danica Fine record me pontificating on how it all works is forcing me to eat a big piece of humble pie! I was close, but surely not close enough. Here’s my previous thinking…

I sure SOUNDED confident, but I did mess up one very important concept. I ass-u-me-d that a deletion vector could span more than one data file. Unfortunately, that’s not the case!

TL;DR

For a given snapshot, there can be 0 or 1 deletion vector file per data file AND an Iceberg deletion vector file cannot span more than one data file. That said, having writers logically merge, & physically write, an updated deletion vector file for a new snapshot still goes a long way to make this a better solution than the v2 positional delete files.

Should you continue reading?

Okay, if that explains it all to you — you are free to go (thanks for your attention to this matter), but if it didn’t make sense let’s explore some more. AND IF the whole concept of Apache Iceberg v2 “positional delete files” is something new, please check out Alex Merced‘s Understanding Apache Iceberg Delete Files first.

Heck, if you need an Iceberg overview in general, the good folks at Starburst Academy put a nice & short video together, Understanding Apache Iceberg architecture, to fill in the blanks.

For the Missourians (the show-me state)

Yep, time to get into it and SEE what’s going on for those, like me, who want to see an example or three. I’m going to use Starburst Galaxy as it has the new Iceberg v3 implementation available to anyone — check out the free trial if you haven’t used it before.

Test data

We can use the tpch.tiny.customer table from the Trino TPC-H connector as a source for our testing data.

This has 1500 records whose custkey values so nicely range from 1 – 1500. As this is a very small example (and we want to create multiple data files), we can insert them into our test tables in 3 bands of 500 records each.

Additionally, when we delete records, we can delete evenly distributed batches of records based on how nicely the customer key value divides by 5 and 2. You’ll see!

Version 2 positional delete files

Set up a table

Create a test table, populate it with the 1500 rows, and verify 3 data files of 500 records each are present.

create table cust_v2
with (type='iceberg', format_version=2)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v2$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v2
 where mod(custkey, 5) = 0;

The bottom 3 files are the v2 positional delete files. One for each of the data files. Each referencing 100 records from the 300 total that have been deleted.

Delete some more

Now, delete the 600 records whose customer keys are even numbers and list the $files metadata table again.

delete from cust_v2
 where mod(custkey, 2) = 0;

The middle 3 files are the next round of v2 positional delete files. Like before, one for each of the data files. Each referencing 200 records from the 600 total that have been deleted this time.

What happens when we delete more?

Each time we run another DELETE command a new positional delete file will be created for each affected data file. This will continue to grow and sprawl until we can run a compaction operation to clean it all up.

Version 3 deletion vectors

Set up a table

Create a v3 test table, populate it with the same 1500 rows, and verify (just as before) 3 data files of 500 records each are present.

create table cust_v3
with (type='iceberg', format_version=3)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v3$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v3
 where mod(custkey, 5) = 0;

As before, the bottom 3 files represent the 300 records deleted. What is different is that these delete files are persisted as Puffin files, not Parquet ones.

Delete some more

Now, delete the 600 records with even numbers for their customer keys then list the $files metadata table again.

delete from cust_v3
 where mod(custkey, 2) = 0;

Notice that there are still only 3 deletion vector (Puffin) files. That’s because the compute engine merged the contents of the previous 100 affected rows per file with the new change deleting another 200 rows. The deletion vectors now include pointers to the comprehensive 300 records for each of their associated 500 record data files.

What happens when we delete more?

For any new deletions aligned to a data file with an existing deletion vector file, the compute engine will merge the comprehensive contents into a new Puffin file. This approach keeps a smaller number of delete files around that need to be read when querying the table. It will prevent the delete file spawl that occurs in the v2 implementation and will also be rolled up when a compaction process is run.

Why do we care?

With the sprawl of v2 positional delete files comes more and more file I/O which decreases performance and increases costs. The v3 deletion vector files ensure that for a given snapshot there will never be more than a single deletion file aligned to each data file. Performance doesn’t keep degrading as more and more deletions occur and we prevent cost spikes for data lake solutions who charge on the number of GET operations.

Did we have fun?

I don’t know about you, but I sure had a blast exploring!

If you want to go digging for even more, we didn’t even compare the difference in what is being stored in the Parquet-based v2 positional delete files vs the Puffin-based v3 vector deletion files. But hey… let’s get into that another time!

Happy Iceberging!!

my freewrite alternative (under $200)

I don’t know how Freewrite can use the word “free” in their company name. I adore the look & promise of their original Smart Typewriter, but $700 is far from free AND way too much!

Even their mid-tier Traveler comes in at $550 (I feel like Steve Martin in Father of the Bride when he declared, “A cake, Franck, is made of flour and water. My first car didn’t cost $1200” — THIS IS A CLAMSHELL KEYBOARD & A TINY E-INK SCREEN!!). I was actually hoping the Alpha would list for a reasonable $200 instead of its $350 sticker shock.

I’m probably digging the Alpha as it is the closest to the used Alphaworks Neo I picked up on eBay after remembering it while reading 9 Distraction-Free Writing Devices. In fact, it would have been exactly what I wanted, but we’re talking about some pretty old hardware and we all know how tough folks can be on keyboards. Mine had a few keys that didn’t work and thankfully the seller gave me a full refund.

The original Freewrite Smart Typewriter has been out there for a few years now and I know I’m not the only person drooling over it. I simply can’t understand why others haven’t entered into this market, including those who want to solely compete on price and bring this thing down. Again, the target price in my head is $200 and the manufacturer can still make some money.

There is the Zerowriter Ink with a pretty decent price of $230, but as far as I can tell it isn’t a REAL product I can buy and play with today. So, I went back to the “9 devices” post I referenced earlier and got me thinking about a small Android-based e-ink device coupled with a Bluetooth keyboard. Some digging around on Amazon ultimately got me to this setup — AND AT THE RIGHT PRICE!

Yes, you are right… this is NOT a completely dedicated device, but for me it is darn close as once I got it all setup I’m in the Google Docs app almost immediately and have a full-sized keyboard underneath my fingertips.

I also verified it worked great being online or offline as the Google app took care of all the magic. It is working great for me and already banged out my first body of work on it. Plus, it is so cute!!

The added bonus to me is the very nice feeling $30 keyboard as I can be as rough as I want on it without worry I might destroy my device.

building trino data pipelines (with sql or python)

When people think of Trino they naturally hone in the benefits of querying data with SQL in this “engine that runs at ludicrous speed”. When asked if Trino would be useful as a transformation processing engine for ETL/ELT workloads, not all would immediately answer YES!

Usually, the answer depends on if someone has only heard of Trino as opposed to those who have actually used Trino. The latter will usually say ABSOLUTELY it can! This post focused on ensuring all know Trino is well suited for constructing data pipelines AND that data engineers have their choice of SQL and/or Python when constructing them.

History of Trino for ETL workloads

The project website is a great place to start if Trino is new to you. The Starburst DevCenter is another useful resource. And, of course, I’m proud to offer my own blog posts discussing Trino and/or Starburst to the community. Please spend a little bit of time at some (or all) of these resources if needed before continuing.

Within a year after Trino (then known as Presto) was released into production at Facebook for interactive use cases, users started scheduling batch/ETL queries with Trino instead of Hive. Within a few more years, 50% of the existing ETL workloads were running on Trino as well as 85% of all new workloads. Soon others caught on, and companies like Salesforce and Lyft began utilizing Trino for their batch/ETL workloads, too.

Architectural concerns & remedies

Trino came out of the gate blowing Hive away on query execution speed because Trino focused more on performance as opposed to Hive’s focus on reliability. On a side note, hive, trino & spark features (their journeys to sql, performance & durability) points out that both of these engines are always improving. To understand all of that, let’s talk about the original architecture of Trino.

Parallel processing fundamentals

Fundamentally, the Trino engine works in a distributed processing mechanism similar to Hive and Spark. Meaning, it breaks work down into multiple stages organized in a directed acyclic graph (DAG). The stages are where the same processing can be done on many slices (Trino calls them splits, Spark calls them partitions, and Hive calls them blocks) of data simultaneously.

When needed to move to another stage of processing, the data has to be redistributed to allow processing to continue in a highly parallel fashion. Examples of when this is needed are common SQL operations such as GROUP BY, JOIN, and ORDER BY. If these concepts are new to you, I recommend you watch, at least, the first video in trino query plan analysis (video series) which I’m embedding next.

Initial Trino architecture

With that DAG / query plan knowledge fresh in your mind, take a look at the following image which represents the base architecture of why Trino runs fast.

Data leaving one stage to another is redistributed into the next stage (often referred to as shuffling). Trino’s speed benefit comes from not persisting that intermediary data to disk — it acts more like a streaming engine internally and sends the data on to the next stage directly. The consequence of this is if the query fails before completion, the initiator is returned an error message.

The design goals for interactive querying performance did not provide sufficient support for long-running and memory-intensive queries.

  • Long running queries unreliable: the all-or-nothing architecture makes it really hard to tolerate faults
  • Distributed memory limit: with streaming shuffle, aggregations and joins have to process all at once

Also, with the original architecture, it is really hard to apply classic techniques like adaptive query execution, speculative execution, and skew handling.

Fault-tolerant execution

As hive, trino & spark features (their journeys to sql, performance & durability) calls out, platforms continue to grow over time to address as many features and limitations as possible. Trino has been no different and Project Tardigrade added fault-tolerant execution (FTE) benefits into the architectural design.

The earlier identified issues with the initial architecture were overcome by storing the intermediary data produced by the parallel tasks within a stage. This model enforces the stage-by-stage execution model that engines like Hive were famous for (and gave it the reliability we are discussing now). The following diagram shows the flow of that intermediary data to configurable storage.

The same work is still accomplished within each stage, but since all stages are not actively running at the same time and the data needed for a subsequent stage is being persisted durably, the following benefits surface with FTE.

  • Tolerate individual task failures: if a worker fails, then only the work in the affected tasks of the currently executing stage need to be restarted
  • Reliability run queries of any size: all tasks within a stage do not have to be executing at the same time thus the all-in-memory constraint is eliminated
  • More flexible resource management: the Trino engine can be flexible enough to allow higher-priority queries to the front of the queue without cancelling others
  • Adapt query execution dynamically midstream: Trino has the ability to adjust according to different running conditions and optimize its execution strategy midstream

FTE is a cluster-wide setting, so in some organizations it would make sense to run 1+ FTE clusters as well as 1+ (classically) ‘interactive’ clusters.

Pipelines across the medallion architecture

As the diagram below shows, the medallion architecture is a common pattern we see in data lake environments and is focused on the flow of data from ingestion all the way to consumption for analytical, ML/AI, or other data application needs.

Data pipeline construction options

The next diagram suggests that this end-to-end flow of data constructs a comprehensive data pipeline. The diagram also shows that the E (extract) & L (load) from ELT is often referred to, in aggregate, as ingestion. Interestingly, the T (transform) output can be input for another transformation; or even more than one.

More interesting to note is that those series of transforms; those data pipelines, can be constructed with classical SQL (no surprise for the Trino query engine), but also with Python.

SQL

Probably the best argument for building your transforms and other data pipeline activities such as validating, enriching, and aggregating, is that almost everyone knows SQL.

Pros for SQL

  • Widespread knowledge
  • Compact
  • Easy to understand for simple logic

Cons for SQL

  • No real time debugging
  • Can’t perform other actions on results
  • No unit tests
  • Version diffs often hard to read
  • As complexity grows, SQL gets more and more difficult

Python

Some folks want to implement their data pipelines with a programming approach, and currently Python is the most popular language for data processing. For Python transformation jobs with Trino, we really are talking about using a Dataframe API.

A Dataframe is a two-dimensional, tabular data structure that can be conceptually understood as a table. The following example is implemented with the very popular PySpark distributed processing tool.

Pros for Python Dataframes

  • Reusable functions
  • Version diffs are easy to read
  • Easily create unit tests
  • Could combine Python libraries like Numpy & Scipy
  • Perform real-time debugging
  • Easy to follow complex logic
  • Allows for logging

Cons for Python Dataframes

  • A bit verbose
  • Need to be a Python programmer

For a Python Dataframe application to run on Trino, the framework has to convert the code into SQL that Trino can understand. This process can end up producing SQL that looks different than what you might create directly with SQL.

Here is a simple multi-table join query written by a human.

Programming that same functionality in Python using methods for each operation such as filtering, joining, or aggregating, can end up creating a query that looks more exhaustive.

Fortunately, Trino has a feature-rich Cost-Based Optimizer (CBO) which optimizes whatever query it receives to determine the best approach to run it quickly & efficiently. Back to our earlier ‘parallel processing’ discussion, this means that we expect the CBO to create the same query plan / DAG regardless of the SQL presented if both are trying to answer the same question.

Dataframe API alternatives

The primary options for programming to a Dataframe API using Python with the goal of running the work in a Trino cluster are PyStarburt and Ibis.

PyStarburst

PyStarburst features a PySpark-like syntax using the same lazy execution model (waits until an input/output ‘action’ to be called before actually submitting work to the cluster). At that point, the Python code gets converted to SQL so that the heavy lifting is done by Trino.

To learn more about PyStarburst, including seeing some programming examples, check out my PyStarburst tagged content.

Ibis

Ibis provides a common Dataframe API for data manipulation in Python, and compiling that API into a variety of compute engines.

With this tooling, Trino is just one of the options of where your Dataframe code could be executed. Check out ibis & trino (dataframe api part deux) for code examples running in Trino.

Comparing Dataframe APIs

The APIs themselves for these two frameworks are similar, but surely not identical, as the following shows.

While the SQL that gets generated by each tool might look different, they are both attempting to solve the same problem so we rely on the CBO to generate the same, or very similar, query plan / DAG.

Orchestration & scheduling

Trino does not provide a native orchestration and/or scheduling subsystem so you still need to provide your own solution to this space. Tools like Airflow, Dagster, and Prefect are popular and can have a mix of SQL, Python, and other tooling steps that need to be executed appropriately and with dependencies.

Summary

Key points

Trino is, and has been, an appropriate clustering technology for running your transformation processing jobs.

  • Fault-tolerant execution mode makes it even more robust
  • Data engineers can leverage SQL and/or Python when constructing their pipelines

Dataframe API recommendations

If you use Starburst products >> Choose PyStarburst

  • More similar to PySpark Dataframe API (easier to port from, or even back to, Spark)
  • More likely to be optimized better for Trino than Ibis since development community knows Trino very well

If you are using OSS Trino >> Choose Ibis

  • PyStarburst is not supported here
  • Gain optionality for the backend execution engine

If you could benefit from additional commentary on this topic, please check out my YouTube video below.

yarp: yet another rag post (this time using sql)

Sure, you’ve read a BUNCH of retrieval augmented generation (RAG) blog posts by now, but surely NOT ENOUGH of them focused on using SQL, instead of Python, to get it all working.

If you could use a refresher (or even a first-time explanation) of what RAG is, I hope my video helps.

Starburst’s new AI functions

For this post, I’m going to explore Starburst Enterprise‘s new AI functions in the examples below. They are broken up into 3 categories.

  • Task functions such as analyzing sentiment, classification, fixing grammar, masking text, and translating languages. The docs show good examples of these.
  • Embedding functions are all about creating mathematical representations of text that can be clumped together near similar text (see my video above if this is new to you).
  • Prompt functions — well, at this version there really is only one function. It is called prompt() and it does what it sounds like; you can send in an input prompt and get a response back from an LLM (think ChatGPT) with the answer.

I’ll show you how the generate_embeddings() function works shortly, but let’s just play with prompt() first. Remember, we are using SQL in this blog post! Let’s give it a try with a very important question.

SELECT starburst.ai.prompt(
  'What underlying technology does Starburst use?',
  'bedrock_claude35');

Yep, that’s right! Oh, did I say I am a Developer Advocate for Trino & Starburst? Well… I did now (disclaimer posted).

You probably noticed that the method I used has a second argument. The prompt() docs show that this is the name of an LLM model. This brings up a super cool feature in Starburst Enterprise; the ability to configure (and manage use of) multiple models. In fact, on-prem and/or cloud-based ones. But… let’s save all of that for another post.

Our use case

Imagine if you will that the cartoon character Fred Flintstone was keeping a personal diary.

If we asked things of a public LLM about Fred’s interests or experiences that are only in this diary, we will not get an accurate answer. For example, I happen to know Fred, like me, is an active Postcrossing member and sends/receives postcards all over the world.

Let’s see what happens if we ask an LLM if he likes “snailmail”.

SELECT starburst.ai.prompt(
  'Does Fred Flintstone like to send and receive mail?',
  'bedrock_claude35');
remember, results are probabilistic and can be different each time

Let’s see if we can build a data engineering pipeline to prepare Fred’s diary for use in a simple RAG workflow.

Data pipeline

As unstructured docs in ai (the wild west) identifies, ETL pipelines can get rather complicated, rather quickly, depending on the input data. Furthermore, exploring ai data pipelines (hands-on with datavolo), gives you some insight into some of the tooling options you have in this problem domain.

Test data

For our example, we have it pretty easy. Fred uses his stone-aged mobile device (hey, he’s a fictional character so why not give him cellular service and internet access!) to record his journal entries. The good news for this blog post is that we can just pick up at the point where there is an Apache Iceberg table that holds all the entries.

-- create a table for diary entries
CREATE TABLE diary_entries (
    owner varchar,
    entry_date timestamp,
    entry_text varchar
);
show create table diary_entries;

Let’s populate it with some example entries (again, this would have already been done for us — we just need some test data).

-- add initial entries 
INSERT INTO diary_entries (owner, entry_date, entry_text)
VALUES 
  ('Fred Flintstone', date_add('day', -99, current_timestamp), 'It is a new year and today I decided to start keeping a diary. Some may imagine that life is pretty boring here in Bedrock, but my family and friends make sure life is NEVER boring! As today is a Monday, I''m off to work at the quarry. I work there everyday until 5pm. Despite it being physical work, I do enjoy it. I was happy to play with Dino after work.'),
  ('Fred Flintstone', date_add('day', -98, current_timestamp), 'Ok, keeping a diary is something I''ll need to get used to. It seems I missed yesterday, but I''m still here. It is mid-week and I''m finally jotting down my notes this Wednesday evening. Barney and I just got back from our Loyal Order of Water Buffaloes meeting. Barney is running for Vice President this year and I fully support him.'),
  ('Fred Flintstone', date_add('day', -97, current_timestamp), 'Ahh... the weekend is finally here. I''ve been so busy (and well... I guess I simply forgot as well) to enter my thoughts into my daily journal. Off to the drive in movie tonight.'),
  ('Fred Flintstone', date_add('day', -96, current_timestamp), 'The weekends always go so fast. Family & friends continue to be the thing I spend my time on. I''m thinking I need a new hobby, too.'),
  ('Fred Flintstone', date_add('day', -95, current_timestamp), 'I have decided to try sending & receiving postcards as a new hobby. I found this website called Postcrossings that looks pretty interesting. The site says that over 800,000 people enjoy this hobby. Postcrossers, that''s what these hobbyists are called, have sent over 78 million postcards so far.'),
  ('Fred Flintstone', date_add('day', -94, current_timestamp), 'I was able to send 3 postcards today to people the system provided addresses to me for. This really is the hobby for me.'),
  ('Fred Flintstone', date_add('day', -93, current_timestamp), 'Fun Thursday night getting 2 more postcards in the mail.  I am loving this hobby even though Wilma thinks I''m nuts.'),
  ('Fred Flintstone', date_add('day', -92, current_timestamp), 'I went hiking over the weekend for one of my existing hobbies; waterfall viewing. I saw two of the largest ones in Bedrock; Big Falls and Prettybig Falls. I was even able to buy a few postcards at the gift shop.'),
  ('Fred Flintstone', date_add('day', -91, current_timestamp), 'I had so much fun looking at waterfalls that I visited the artificial one we have at the mall; Shopping Shoal. Silly name, but I did find out that shoals were usually rocks in rough water that paddlers enjoy traversing through.'),
  ('Fred Flintstone', date_add('day', -90, current_timestamp), 'This   was   a    second   journal entry for Jan 15. It shows    some   extra   spacing    that can    be cleaned up.'),
  ('Fred Flintstone', date_add('day', -89, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -88, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -87, current_timestamp), 'These last days have been a complete blur. So busy just life that I haven''t been able to focus on my new hobby of mailing postcards.');
select * from diary_entries;

Transformation logic

In our situation, we do not need to do complex parsing and chunking steps in our ETL pipeline since we already have text for each journal entry stored in a varchar field of an Iceberg table. We still do need to create embeddings from these (naturally created) text chunks.

We need a place to store these embeddings. You could always use a purpose-built vector database like I used in exploring ai data pipelines (hands-on with datavolo). Instead we will store our AI embeddings “lakeside”. Meaning as a column in our Iceberg table. Let’s add a column for them to be stored.

-- prep to store vector embeddings 
--  'lakeside' right in the Iceberg table
ALTER TABLE diary_entries 
  ADD COLUMN entry_text_embeddings array(double);

Now, we can leverage generate_embeddings() to create the vector embeddings and then just store them into the new column.

-- create and persist the embeddings
UPDATE diary_entries
   SET entry_text_embeddings = 
         starburst.ai.generate_embedding(
             entry_text, 'bedrock_titan'
         )
 WHERE entry_text_embeddings IS NULL;
select * from diary_entries;

Taking a deeper look into one of the entry_text_embeddings column’s values, you’ll notice this is just a very long array of numbers ranging from -1 to 1. Again, the video at the top of this post attempts to give a very high-level understanding of what these are representing.

For sake of this post, this array of values from -1 to 1 is a multi-dimensional mathematical representation of the contents in the corresponding entry_text column for each row. These can be used in our RAG workflow later when trying to find chunks of text (journal entries in our use case) that are similar to a request that will be asked about Fred Flintstone.

It is worthy of noticing 2 cool things we’ve just done.

  • We used SQL to interact with an AI model to generate vector embeddings!
  • We didn’t need a separate vector database to store those embeddings!! We stored them “lakeside” to use the phrase I hear everyday at Starburst.

RAG workflow

Now that we have our data prepared, we can focus on the SQL-based RAG workflow we are interested in.

The “R”

RAG starts with “R” so let’s see if we can use Starburst’s AI functions to retrieve diary entries that align with the initial question we asked in the earlier use case description.

SELECT
  owner,
  entry_date,
  entry_text,
  cosine_similarity(
    starburst.ai.generate_embedding (
      'Does Fred Flintstone like to send and receive mail?',
      'bedrock_titan'
    ),
    entry_text_embeddings
  ) AS similarity_score
FROM
  diary_entries
ORDER BY
  similarity_score DESC
LIMIT
  5;

Notice that the entry_text values below with the highest similarity_score values reference postcards. Our question didn’t mention postcards specifically, but that’s the power of the similarity searching based on embeddings, not just a simple LIKE operator. The cosine_similarity() function already existed in Trino.

The “AG”

The more full query below surfaces the query above as a CTE and uses it to create a second one, json_results, which just packages up the journal entries into a JSON document. Finally, the prompt() function is used after including the similar journal entries retrieved and adding some additional context directing the LLM to use them in the question being asked.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'bedrock_titan'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
),

-- Augment the results by converting  
--  them into a JSON object
json_results AS (
SELECT CAST(map_agg(to_iso8601(entry_date), json_object(
    key 'journal entry date' VALUE entry_date,
    key 'journal entry text' VALUE entry_text)) AS JSON) AS json_data
FROM
    vector_search
)

-- Generate an augmented response 
--  using the LLM
SELECT
    starburst.ai.prompt(concat(
        'Using the list of journal entries provided in JSON, ',
        'Does Fred Flintstone like to send and receive mail?',
        json_format(json_data)), 'bedrock_claude35')
FROM json_results;

Check out the results this time!

maybe a little overly chatty, but the response is more correct

Another question: How does Fred Flintstone feel about public transportation?

Base LLM response

Enhanced RAG response

Another question: What hobbies or activities is Fred Flintstone interested in?

Base LLM response

Better prompt engineering would merge these answers together in the RAG response.

Enhanced RAG response

Another question: How often does Fred Flintstone go to the movies?

Base LLM response

It is kind of funny that the animated theme song does show him going to the movies!

Enhanced RAG response

It is fair to say that even with this very primitive RAG workflow (again, written using SQL!), the responses are better after being augmented with non-public information.

Video version

As we all learn different ways, I put a quick demo together of the RAG workflow for those who prefer watching videos.

Parting thoughts

Clearly more work would need to go into this exercise to get it to production level quality, but that isn’t the point. The intention of this article was to show that data engineers and data analysts who know SQL already have a chance to do some investigatory GenAI efforts without having to know Python and/or to engage their over-committed data science team.

Furthermore, a SQL interface to the underlying AI tooling offers application developers and AI engineers even more options for when they are assembling production-grade GenAI applications.

To me, that all sounds pretty darn cool. What do you think?

trino query plan analysis (video series)

Just like every other query engine, you’ve made it to the big leagues when you can actually decipher the query plan (aka explain plan) that gets produced. I’ve been explaining how data lake analytics engines such as Hive, Spark, and Trino break their work down into stages that operate in massively parallel fashions — and detailing all the fun & expense of the data exchange happening between those stages.

These following high-level (and rather universal) concepts are explained in the 1st of 3 videos.

  • Understand how & why datasets are divided into splits
  • Explain the decomposition of a query into tasks, stages, and exchanges
  • Visualize multi-stage queries that tackle sorting, aggregation, and multiple types of joins
  • See the conceptual visualization of a parallelized query aligns with Trino’s generated visual plan at runtime

The middle video is focused on the particular syntax Trino’s CBO uses in the textual query plan. This video hits on topics such as the following.

  • Optimizations available with table statistics
  • Generating a textual and a visual query plan
  • Understand the format of the query plan output to include fragments, exchanges, distribution, estimates, and performance
  • Performance with, and without, accurate table statistics

The final video in the series is the FUN one! I squarely focus on walking through query plans since you have all that background from the earlier videos, with the mindset of spotting potential performance issues and remedies to overcome them. You’ll see first-hand topics such as the ones listed below.

  • Column pruning, predicate pushdown, and partition pruning
  • Trino’s approach to tackle sorting and aggregation across two stages
  • Join optimizations including broadcast, reordering, and dynamic filtering

In addition to the comments sections of the videos above and this blog post’s comments area, feel free to bring your concerns to Starburst Community Forum.

logo to company match game (data engineering open-source projects)

Just having a little bit of #FridayFun — can you match the open-source data engineering project logos on the left to the company names who are most affiliated with each?




Well… could you do it… if not, of you want to check your answers with mine…

S

c

r

o

l

l

W

a

y

D

o

w

n

H

e

r

e

T

o

S

e

e

M

y

A

n

s

w

e

r

s

((and let me know in the comments if you think I got anything wrong!))

delta lake time-travel (just reference the version)

This quick blog post is a follow-up to my prior delta lake in starburst galaxy (intro & integration) one. I wanted to capture the additional feature of time-travel which is possible due to how Delta Lake creates new versions each time content is changed. Please check the prior post to see what steps were taken to get us to version 2 of the dune_characters table we will work with below.

Starting point

Let’s make sure you are seeing the same results before we move forward.

SELECT * FROM dune_characters;
SELECT version, timestamp, operation
  FROM "dune_characters$history";

The table_changes function is pretty cool as it can identify the changes that have occurred between two versions. The example below compares the initial version with the most recent version.

SELECT
  *
FROM
  TABLE(
    system.table_changes(
      schema_name => 'dlblog1',
      table_name => 'dune_characters',
      since_version => 0
    )
  );

Make changes

Run some SQL to change the table’s contents.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (104, 'Wellington', 'Physician of Duke');

-- get rid of those with odd numbered id's
DELETE FROM dune_characters
 WHERE mod(id,2) = 1;

Verify the results.

SELECT * FROM dune_characters;

Notice 2 more versions are present (we ran 2 SQL statements).

SELECT version, timestamp, operation
  FROM "dune_characters$history";

Time-travel

Query the version of the table that we started this blog post with.

SELECT * FROM dune_characters 
FOR VERSION AS OF 2;

Yep, that’s right!

Additional features

Unfortunately, as of Trino 471, the Delta Lake connector does not yet allow for time-travel queries to be based on a timestamp as the Spark API has.

Additionally, the Trino connector for Delta Lake does not yet offer a rollback function as the Spark API has.

unstructured docs in ai (the wild west)

To say there is a LOT of confusion around GenAI systems among data engineers is the understatement of the year. Today’s “excitement” centers heavily around “agents” and “agentic workflows”, but most of the current information & projects out there are still heavily oriented on RAG systems. That information is centered on the application side of RAG, not necessarily on the data prep aspects. Those apps heavily rely on good information being stored in vector databases to provide additional context before an LLM query is made.

Note: If some/most/all of that above didn’t make much sense, please check out my understanding rag ai apps (and the pipelines that feed them) post I published last year as the developer advocate of a company who was focusing on the unstructured doc ETL pipeline side of the equation. I hope it will fill in any of the blanks you may have before proceeding with this post.

It’s about the chunking

I am NOT suggesting that constructing high-quality RAG applications is easy — on the contrary, they are incredibly hard to make “good enough”. I am saying that if the “chunks” of text that were created during the ETL pipeline and turned into embeddings (plus stored in a vector database) were garbage, then the answers your LLM will provide will also be garbage. AKA GIGO.

Of course, before you can chunk docs you have to parse them and there are many different approaches to this. Some solid articles are surfacing and it only takes a cursory read through solid ones like Extracting Text and Table Contents from PDF Documents and Stop Copy-Pasting. Turn PDFs into Data in Seconds to point out 1) there are many libraries out there, each with their own strengths & weaknesses, 2) that this is not the easiest thing to do for classical DEs who have focused on structured data, and 3) somehow we get the belief that the parsing is the part that matters the most.

The output of parsing is actually pretty easy to validate. We either did, or did not, accurately bring the text forward as well as convert the embedded tabular data correctly. Probably the hardest part of this is around images where we are reliant on other models to turn them into accurate enough text. Oh wait, there are many of these, too, each with their sweet spots. Ok, maybe parsing isn’t that easy after all… 😉

This was SUPPOSED to be a horror story about CHUNKING (not to be confused with Chucky who is pretty scary himself). If we address the parsing steps well and are able to validate the parsed document is indeed a good computer-ready representation of the original document as if we consumed it as humans, then chunking comes up next.

As you can see in articles such as 7 Chunking Strategies in RAG You Need to Know and 8 Types of Chunking for RAG Systems, this is NOT a one-solution pony ride. If this is all new to you, I encourage you to just read about & compare simple fixed-size vs semantic chunking to start to understand how this can all go sideways in a hurry.

An example

Assume a part of one of the documents you have parsed has this text inside it.

Our company does not support, in the strongest terms, these behaviors & beliefs.
– Ice cream bans are appropriate at work.
– Employees should feel comfortable returning to work without washing their hands.
– Managers should never allow for any DEI policies to influence their department hiring.
– There is nothing wrong with selling your kid’s fundraisers at the office.

This next table shows how this might become chunked with a fixed-size vs semantic chunking strategy.

fixed-size chunkssemantic chunks
Our company does not support, in the strongest terms, these behaviors & beliefs.
– Ice cream bans are appropriate at work.
– Employees should feel comfortable returning to work without washing their hands.
Our company does not support, in the strongest terms, these behaviors & beliefs.
– Ice cream bans are appropriate at work.
– Employees should feel comfortable returning to work without washing their hands.
– Managers should never allow for any DEI policies to influence their department hiring.
– There is nothing wrong with selling your kid’s fundraisers at the office.
Our company does not support, in the strongest terms, these behaviors & beliefs.
– Managers should never allow for any DEI policies to influence their department hiring.
– There is nothing wrong with selling your kid’s fundraisers at the office.

The second semantic chunk is carrying forward the semantically-related heading which means that the eventual retrieval of this text to be used as context to a LLM prompt will have the nuance needed to better prepare it to not misunderstand these positions.

For example, if someone was asking if it is acceptable for employees of your company to push cookies (sign me up for some Thin Mints, please!) on their coworkers, the fixed-size strategy would likely drive the LLM to say, “go for it”, while the semantic chunking gives the LLM more information that hopefully (remember it is probabilistic, not deterministic) lets it identify this is not a supported policy.

Wrap-up

This clearly wasn’t a how-to guide, but I hope the warning to not misunderstand the complexities of even the ETL sub-step of chunking will be heard. With today’s tools & technologies, these data pipelines aren’t simply plug/n/play. They need devoted AI engineers to construct them AND test their outputs via the GenAI applications they are fueling.

From my experience, this will be different for every “type” of document (financial 10K’s for example, not Word vs PDF) – especially in your early projects where you are building the heuristics to use in future efforts. Interestingly enough, if we can capture those heuristics (maybe by recording the starting effort, the iterations, and the final version) they might be the input of a future LLM model that eventually figures out how to make this plug/n/play.

Moral of the story… don’t assume this is simple OR easy and know that testing, testing, and more testing is required (and will need to be re-validated frequently) while we all figure this stuff out. Good luck on your efforts and good luck to us all that we don’t create the idiocracy I’m expecting soon enough. 😉

success (ability, fortitude & luck)

TL;DR — I absolutely agree that you need raw abilities along with the fortitude to stick with it, but I firmly believe that the 3rd leg on the “stool of success” is being fortunate enough to have luck on your side.

The ingredients of success

The parts you can control

I doubt many will disagree with my first two call-outs for what makes someone successful; ability & fortitude.

Sure, we are born with inherent abilities (and intelligence), but we can gain new abilities through effort. Unless you are an old dog, you can be taught new tricks.

I thought long & hard about what word was best suited for this next characteristic. I went with fortitude because strength itself is not enough. We have to be able to dig deep when we need it the most – even when the outcome looks grim.

Lest not forget the power of luck

Call it luck, fate, fortune, blessings, destiny, karma, or even happenstance, luck is a powerful thing.

We all know incredibly talented people with unstoppable motivation & drive who were thrown a twist in life (positive or negative) that catapulted them to a meteoric rise, or a devastating fall.

Is everyone else wrong?

Some quick googling for the ingredients of success lets up know there are many, many, many thoughts on this topic. Many of them adamantly profess that luck has absolutely nothing to do with it. My $0.02’s says they actually haven’t done enough self-reflection to see (and accept) that they had dose of good luck.

So do I know it all? Of course not, but I’m also not foolish enough to believe that my own trivial successes in life weren’t benefited by a sprinkling of good luck, too.

We could all benefit from a bit more self-reflection and absolutely could be just a bit more understanding for those with a shed-full of bad luck. Y’all be kind out there…