
To say there is a LOT of confusion around GenAI systems among data engineers is the understatement of the year. Today’s “excitement” centers heavily around “agents” and “agentic workflows”, but most of the current information & projects out there are still heavily oriented on RAG systems. That information is centered on the application side of RAG, not necessarily on the data prep aspects. Those apps heavily rely on good information being stored in vector databases to provide additional context before an LLM query is made.
Note: If some/most/all of that above didn’t make much sense, please check out my understanding rag ai apps (and the pipelines that feed them) post I published last year as the developer advocate of a company who was focusing on the unstructured doc ETL pipeline side of the equation. I hope it will fill in any of the blanks you may have before proceeding with this post.
It’s about the chunking
I am NOT suggesting that constructing high-quality RAG applications is easy — on the contrary, they are incredibly hard to make “good enough”. I am saying that if the “chunks” of text that were created during the ETL pipeline and turned into embeddings (plus stored in a vector database) were garbage, then the answers your LLM will provide will also be garbage. AKA GIGO.

Of course, before you can chunk docs you have to parse them and there are many different approaches to this. Some solid articles are surfacing and it only takes a cursory read through solid ones like Extracting Text and Table Contents from PDF Documents and Stop Copy-Pasting. Turn PDFs into Data in Seconds to point out 1) there are many libraries out there, each with their own strengths & weaknesses, 2) that this is not the easiest thing to do for classical DEs who have focused on structured data, and 3) somehow we get the belief that the parsing is the part that matters the most.
The output of parsing is actually pretty easy to validate. We either did, or did not, accurately bring the text forward as well as convert the embedded tabular data correctly. Probably the hardest part of this is around images where we are reliant on other models to turn them into accurate enough text. Oh wait, there are many of these, too, each with their sweet spots. Ok, maybe parsing isn’t that easy after all… 😉
This was SUPPOSED to be a horror story about CHUNKING (not to be confused with Chucky who is pretty scary himself). If we address the parsing steps well and are able to validate the parsed document is indeed a good computer-ready representation of the original document as if we consumed it as humans, then chunking comes up next.
As you can see in articles such as 7 Chunking Strategies in RAG You Need to Know and 8 Types of Chunking for RAG Systems, this is NOT a one-solution pony ride. If this is all new to you, I encourage you to just read about & compare simple fixed-size vs semantic chunking to start to understand how this can all go sideways in a hurry.
An example
Assume a part of one of the documents you have parsed has this text inside it.
| Our company does not support, in the strongest terms, these behaviors & beliefs. – Ice cream bans are appropriate at work. – Employees should feel comfortable returning to work without washing their hands. – Managers should never allow for any DEI policies to influence their department hiring. – There is nothing wrong with selling your kid’s fundraisers at the office. |
This next table shows how this might become chunked with a fixed-size vs semantic chunking strategy.
| fixed-size chunks | semantic chunks |
| Our company does not support, in the strongest terms, these behaviors & beliefs. – Ice cream bans are appropriate at work. – Employees should feel comfortable returning to work without washing their hands. | Our company does not support, in the strongest terms, these behaviors & beliefs. – Ice cream bans are appropriate at work. – Employees should feel comfortable returning to work without washing their hands. |
| – Managers should never allow for any DEI policies to influence their department hiring. – There is nothing wrong with selling your kid’s fundraisers at the office. | Our company does not support, in the strongest terms, these behaviors & beliefs. – Managers should never allow for any DEI policies to influence their department hiring. – There is nothing wrong with selling your kid’s fundraisers at the office. |
The second semantic chunk is carrying forward the semantically-related heading which means that the eventual retrieval of this text to be used as context to a LLM prompt will have the nuance needed to better prepare it to not misunderstand these positions.
For example, if someone was asking if it is acceptable for employees of your company to push cookies (sign me up for some Thin Mints, please!) on their coworkers, the fixed-size strategy would likely drive the LLM to say, “go for it”, while the semantic chunking gives the LLM more information that hopefully (remember it is probabilistic, not deterministic) lets it identify this is not a supported policy.
Wrap-up
This clearly wasn’t a how-to guide, but I hope the warning to not misunderstand the complexities of even the ETL sub-step of chunking will be heard. With today’s tools & technologies, these data pipelines aren’t simply plug/n/play. They need devoted AI engineers to construct them AND test their outputs via the GenAI applications they are fueling.
From my experience, this will be different for every “type” of document (financial 10K’s for example, not Word vs PDF) – especially in your early projects where you are building the heuristics to use in future efforts. Interestingly enough, if we can capture those heuristics (maybe by recording the starting effort, the iterations, and the final version) they might be the input of a future LLM model that eventually figures out how to make this plug/n/play.
Moral of the story… don’t assume this is simple OR easy and know that testing, testing, and more testing is required (and will need to be re-validated frequently) while we all figure this stuff out. Good luck on your efforts and good luck to us all that we don’t create the idiocracy I’m expecting soon enough. 😉
One thought on “unstructured docs in ai (the wild west)”