Building a RAG-Based SQL Agent on Messy Enterprise Data
2025 - present
NLP, cloud, enterprise, LLM, agentic AI
Mature enterprise data is messy. Often architected before the AI era by multiple architects over many years, it takes substantial effort to build clean, LLM ready data, especially without breaking existing pipelines. Beyond this, while powerful at general tasks, LLMs often stumble over internal jargon, vocabulary, and definitions. In this project, I propose a solution for all of these issues.
Here's the problem I had
Enterprise data is messy and domain-specific. Without proper context, LLMs will hallucinate definitions.
Here's the obvious solution I considered
Building a business glossary, semantic layer by hand
In breaking down this issue, I defined two sub-issues which guided the planning. The first of these is the response usefulness. Can the agent understand what you are asking? Does it understand your internal business jargon and definitions. The second is response accuracy, specifically in the returned data. Even if it understands what you are asking, can the agent actually find the correct data to answer your query?
There are several things required to give an AI agent proper context of your data and its role. In general, the idea is give the agent a sort of knowledge bank to draw from when providing answers, almost like an employee handbook. It needs to know all of the jargon that the organizations members come to be familiar with, all of that jargon that isn't readily available on the internet and that is crucial for provide useful answers.
My first approach was to build this by hand. I basically made a table in our database which included several business terms, their definitions, synonyms and some other metadata. The contents of this table were included by being appended to the bottom of the user prompt. Sure, this helped make the answers from the agent more useful, but this fine-tuning approach was highly error-prone and it risked introducing bias.
Next, before writing any SQL, the agent needs more context about the data itself. It needs to know all of the in's and out's of the data, the required joins, the specific table names, the actual formulas used to calculate specific pertinent values. The agent needs some examples of querying your data, a prompting technique known as few-shot generation. This helps to direct the agents output towards more correct results.
The first approach I took was to build a business
Here's why I did it differently
Manually building a custom business glossary is a lot of work. Gaps in personal knowledge produce gaps in the LLMs knowledge. Traditional semantic layering doesn't provide enough signal to ward off hallucinations. Data discovery is must also be documented and maintained for future sources.
Here's what I built
I scraped our account_usage.query_history to build and parsed together a knowledge base (business glossary, target table definitions, validated queries), embedded it, and provided it as context to the LLM