yarp: yet another rag post (this time using sql)

Sure, you’ve read a BUNCH of retrieval augmented generation (RAG) blog posts by now, but surely NOT ENOUGH of them focused on using SQL, instead of Python, to get it all working.

If you could use a refresher (or even a first-time explanation) of what RAG is, I hope my video helps.

Starburst’s new AI functions

For this post, I’m going to explore Starburst Enterprise‘s new AI functions in the examples below. They are broken up into 3 categories.

Task functions such as analyzing sentiment, classification, fixing grammar, masking text, and translating languages. The docs show good examples of these.
Embedding functions are all about creating mathematical representations of text that can be clumped together near similar text (see my video above if this is new to you).
Prompt functions — well, at this version there really is only one function. It is called prompt() and it does what it sounds like; you can send in an input prompt and get a response back from an LLM (think ChatGPT) with the answer.

I’ll show you how the generate_embeddings() function works shortly, but let’s just play with prompt() first. Remember, we are using SQL in this blog post! Let’s give it a try with a very important question.

SELECT starburst.ai.prompt(
  'What underlying technology does Starburst use?',
  'bedrock_claude35');

Yep, that’s right! Oh, did I say I am a Developer Advocate for Trino & Starburst? Well… I did now (disclaimer posted).

You probably noticed that the method I used has a second argument. The prompt() docs show that this is the name of an LLM model. This brings up a super cool feature in Starburst Enterprise; the ability to configure (and manage use of) multiple models. In fact, on-prem and/or cloud-based ones. But… let’s save all of that for another post.

Our use case

Imagine if you will that the cartoon character Fred Flintstone was keeping a personal diary.

If we asked things of a public LLM about Fred’s interests or experiences that are only in this diary, we will not get an accurate answer. For example, I happen to know Fred, like me, is an active Postcrossing member and sends/receives postcards all over the world.

Let’s see what happens if we ask an LLM if he likes “snailmail”.

SELECT starburst.ai.prompt(
  'Does Fred Flintstone like to send and receive mail?',
  'bedrock_claude35');

remember, results are probabilistic and can be different each time

Let’s see if we can build a data engineering pipeline to prepare Fred’s diary for use in a simple RAG workflow.

Data pipeline

As unstructured docs in ai (the wild west) identifies, ETL pipelines can get rather complicated, rather quickly, depending on the input data. Furthermore, exploring ai data pipelines (hands-on with datavolo), gives you some insight into some of the tooling options you have in this problem domain.

Test data

For our example, we have it pretty easy. Fred uses his stone-aged mobile device (hey, he’s a fictional character so why not give him cellular service and internet access!) to record his journal entries. The good news for this blog post is that we can just pick up at the point where there is an Apache Iceberg table that holds all the entries.

-- create a table for diary entries
CREATE TABLE diary_entries (
    owner varchar,
    entry_date timestamp,
    entry_text varchar
);
show create table diary_entries;

Let’s populate it with some example entries (again, this would have already been done for us — we just need some test data).

-- add initial entries 
INSERT INTO diary_entries (owner, entry_date, entry_text)
VALUES 
  ('Fred Flintstone', date_add('day', -99, current_timestamp), 'It is a new year and today I decided to start keeping a diary. Some may imagine that life is pretty boring here in Bedrock, but my family and friends make sure life is NEVER boring! As today is a Monday, I''m off to work at the quarry. I work there everyday until 5pm. Despite it being physical work, I do enjoy it. I was happy to play with Dino after work.'),
  ('Fred Flintstone', date_add('day', -98, current_timestamp), 'Ok, keeping a diary is something I''ll need to get used to. It seems I missed yesterday, but I''m still here. It is mid-week and I''m finally jotting down my notes this Wednesday evening. Barney and I just got back from our Loyal Order of Water Buffaloes meeting. Barney is running for Vice President this year and I fully support him.'),
  ('Fred Flintstone', date_add('day', -97, current_timestamp), 'Ahh... the weekend is finally here. I''ve been so busy (and well... I guess I simply forgot as well) to enter my thoughts into my daily journal. Off to the drive in movie tonight.'),
  ('Fred Flintstone', date_add('day', -96, current_timestamp), 'The weekends always go so fast. Family & friends continue to be the thing I spend my time on. I''m thinking I need a new hobby, too.'),
  ('Fred Flintstone', date_add('day', -95, current_timestamp), 'I have decided to try sending & receiving postcards as a new hobby. I found this website called Postcrossings that looks pretty interesting. The site says that over 800,000 people enjoy this hobby. Postcrossers, that''s what these hobbyists are called, have sent over 78 million postcards so far.'),
  ('Fred Flintstone', date_add('day', -94, current_timestamp), 'I was able to send 3 postcards today to people the system provided addresses to me for. This really is the hobby for me.'),
  ('Fred Flintstone', date_add('day', -93, current_timestamp), 'Fun Thursday night getting 2 more postcards in the mail.  I am loving this hobby even though Wilma thinks I''m nuts.'),
  ('Fred Flintstone', date_add('day', -92, current_timestamp), 'I went hiking over the weekend for one of my existing hobbies; waterfall viewing. I saw two of the largest ones in Bedrock; Big Falls and Prettybig Falls. I was even able to buy a few postcards at the gift shop.'),
  ('Fred Flintstone', date_add('day', -91, current_timestamp), 'I had so much fun looking at waterfalls that I visited the artificial one we have at the mall; Shopping Shoal. Silly name, but I did find out that shoals were usually rocks in rough water that paddlers enjoy traversing through.'),
  ('Fred Flintstone', date_add('day', -90, current_timestamp), 'This   was   a    second   journal entry for Jan 15. It shows    some   extra   spacing    that can    be cleaned up.'),
  ('Fred Flintstone', date_add('day', -89, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -88, current_timestamp), 'Woke up, fell out of bed. Dragged a comb across my head. Found my way downstairs and drank a cup and looking up, I noticed I was late. Found my coat and grabbed my hat. Made the bus in seconds flat. Found my way upstairs and had a smoke and somebody spoke and I went into a dream.'),
  ('Fred Flintstone', date_add('day', -87, current_timestamp), 'These last days have been a complete blur. So busy just life that I haven''t been able to focus on my new hobby of mailing postcards.');
select * from diary_entries;

Transformation logic

In our situation, we do not need to do complex parsing and chunking steps in our ETL pipeline since we already have text for each journal entry stored in a varchar field of an Iceberg table. We still do need to create embeddings from these (naturally created) text chunks.

We need a place to store these embeddings. You could always use a purpose-built vector database like I used in exploring ai data pipelines (hands-on with datavolo). Instead we will store our AI embeddings “lakeside”. Meaning as a column in our Iceberg table. Let’s add a column for them to be stored.

-- prep to store vector embeddings 
--  'lakeside' right in the Iceberg table
ALTER TABLE diary_entries 
  ADD COLUMN entry_text_embeddings array(double);

Now, we can leverage generate_embeddings() to create the vector embeddings and then just store them into the new column.

-- create and persist the embeddings
UPDATE diary_entries
   SET entry_text_embeddings = 
         starburst.ai.generate_embedding(
             entry_text, 'bedrock_titan'
         )
 WHERE entry_text_embeddings IS NULL;
select * from diary_entries;

Taking a deeper look into one of the entry_text_embeddings column’s values, you’ll notice this is just a very long array of numbers ranging from -1 to 1. Again, the video at the top of this post attempts to give a very high-level understanding of what these are representing.

For sake of this post, this array of values from -1 to 1 is a multi-dimensional mathematical representation of the contents in the corresponding entry_text column for each row. These can be used in our RAG workflow later when trying to find chunks of text (journal entries in our use case) that are similar to a request that will be asked about Fred Flintstone.

It is worthy of noticing 2 cool things we’ve just done.

We used SQL to interact with an AI model to generate vector embeddings!
We didn’t need a separate vector database to store those embeddings!! We stored them “lakeside” to use the phrase I hear everyday at Starburst.

RAG workflow

Now that we have our data prepared, we can focus on the SQL-based RAG workflow we are interested in.

The “R”

RAG starts with “R” so let’s see if we can use Starburst’s AI functions to retrieve diary entries that align with the initial question we asked in the earlier use case description.

SELECT
  owner,
  entry_date,
  entry_text,
  cosine_similarity(
    starburst.ai.generate_embedding (
      'Does Fred Flintstone like to send and receive mail?',
      'bedrock_titan'
    ),
    entry_text_embeddings
  ) AS similarity_score
FROM
  diary_entries
ORDER BY
  similarity_score DESC
LIMIT
  5;

Notice that the entry_text values below with the highest similarity_score values reference postcards. Our question didn’t mention postcards specifically, but that’s the power of the similarity searching based on embeddings, not just a simple LIKE operator. The cosine_similarity() function already existed in Trino.

The “AG”

The more full query below surfaces the query above as a CTE and uses it to create a second one, json_results, which just packages up the journal entries into a JSON document. Finally, the prompt() function is used after including the similar journal entries retrieved and adding some additional context directing the LLM to use them in the question being asked.

-- Retrieve the top 5 most relevant journal
--  entries based on semantic similarity
WITH vector_search AS(
    SELECT
        owner, entry_date, entry_text,
        cosine_similarity(
            starburst.ai.generate_embedding(
                'Does Fred Flintstone like to send and receive mail?', 
                'bedrock_titan'
            ),
            entry_text_embeddings
        ) AS similarity_score
    FROM diary_entries
    ORDER BY similarity_score DESC
    LIMIT 5
),

-- Augment the results by converting  
--  them into a JSON object
json_results AS (
SELECT CAST(map_agg(to_iso8601(entry_date), json_object(
    key 'journal entry date' VALUE entry_date,
    key 'journal entry text' VALUE entry_text)) AS JSON) AS json_data
FROM
    vector_search
)

-- Generate an augmented response 
--  using the LLM
SELECT
    starburst.ai.prompt(concat(
        'Using the list of journal entries provided in JSON, ',
        'Does Fred Flintstone like to send and receive mail?',
        json_format(json_data)), 'bedrock_claude35')
FROM json_results;

Check out the results this time!

maybe a little overly chatty, but the response is more correct

Another question: How does Fred Flintstone feel about public transportation?

Base LLM response

Enhanced RAG response

Another question: What hobbies or activities is Fred Flintstone interested in?

Base LLM response

Better prompt engineering would merge these answers together in the RAG response.

Enhanced RAG response

Another question: How often does Fred Flintstone go to the movies?

Base LLM response

It is kind of funny that the animated theme song does show him going to the movies!

Enhanced RAG response

It is fair to say that even with this very primitive RAG workflow (again, written using SQL!), the responses are better after being augmented with non-public information.

Video version

As we all learn different ways, I put a quick demo together of the RAG workflow for those who prefer watching videos.

Parting thoughts

Clearly more work would need to go into this exercise to get it to production level quality, but that isn’t the point. The intention of this article was to show that data engineers and data analysts who know SQL already have a chance to do some investigatory GenAI efforts without having to know Python and/or to engage their over-committed data science team.

Furthermore, a SQL interface to the underlying AI tooling offers application developers and AI engineers even more options for when they are assembling production-grade GenAI applications.

To me, that all sounds pretty darn cool. What do you think?

yarp: yet another rag post (this time using sql)

Starburst’s new AI functions

Our use case

Data pipeline

Test data

Transformation logic

RAG workflow

The “R”

The “AG”

Video version

Parting thoughts

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

Starburst’s new AI functions

Our use case

Data pipeline

Test data

Transformation logic

RAG workflow

The “R”

The “AG”

Video version

Parting thoughts

Share this:

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)