Fine-tuning is not the answer to a bad retrieval system.
When a RAG system gives wrong answers, the instinct is to reach for the model. Almost every time, the problem is somewhere upstream — in how the documents were chunked, how the query was routed, or what the retrieval layer was actually asked to find.
ML EngineeringNovember 202513 min read
The failure mode arrives quietly. A RAG system goes live. The early demos looked good — clean, confident answers with cited sources. Two months in, the support tickets start. Users report that the system is getting things wrong. Not obviously wrong — the answers are fluent and well-structured and they cite a real document. They are just not quite right. Missing a clause. Conflating two policies. Giving the procedure for the wrong machine model.
The team looks at the model. They try a larger one, or a fine-tuned variant, or a different prompt template that adds more instruction about accuracy. Some of these changes move the numbers slightly. None of them fix the problem sustainably, because the problem was never in the model.
It was in the retrieval layer — specifically, in what got passed to the model in the first place. A model can only generate from what it receives. If the context it receives contains the wrong chunk, a truncated procedure, or a fragment that happens to be semantically close to the query but is missing the critical clause, the model will produce a confident, fluent, wrong answer. It will do so regardless of its size, its fine-tuning history, or how carefully you wrote the system prompt.
Fine-tuning a model on top of broken retrieval does not fix the retrieval. It teaches the model to be more articulate about the wrong information.
Why retrieval fails silently
The most dangerous property of retrieval failure is that it looks like success. When a dense retrieval system returns the wrong chunk, it does not return an error. It returns a similarity score — typically somewhere between 0.7 and 0.9 — that the generation layer treats as a confidence signal. A score of 0.84 means "this is probably relevant." It does not mean "this is the correct chunk for this query." The generation layer does not know the difference. It receives context and generates from it.
This is what makes retrieval failures so insidious in production. A system that errors out visibly can be debugged. A system that returns fluent, cited, wrong answers erodes trust gradually — users stop relying on it not because they saw an obvious failure, but because they stopped trusting the answers after two or three quiet misses they only noticed later.
A similarity score near 0.84 is not a confidence signal. It is a proximity signal. The two are not the same thing, and the generation layer treats them as if they are.
The research is unambiguous on the scale of this problem. Dense retrievers underperform BM25 on out-of-domain corpora by 11.7 NDCG points on average across the BEIR benchmark. Single-vector retrieval surfaces all the evidence needed to answer a multi-hop question only 44% of the time on HotpotQA. These are not edge cases — they are the normal operating conditions of most enterprise RAG deployments, which run on heterogeneous corpora with specialised terminology that the embedding model was not trained on, and where a significant share of real user queries require evidence from more than one document.
A model fine-tuned under these conditions learns one thing: how to produce fluent outputs when the context is incomplete. That is not a capability. It is a liability.
Where retrieval actually breaks — and what to look at first
Retrieval failures in production RAG systems are not random. They cluster around a small set of structural causes, and each cause has a diagnostic signature that is visible if you look at the right layer. The instinct to jump to fine-tuning usually happens because teams are looking at the model's output — the thing that is wrong — rather than the retrieval layer that produced the input the model was given.
The causes below are ordered by how frequently they appear in the systems I have worked on. They are also ordered, roughly, by how easily they are fixed — which makes the last one on the list the most important to check first, because it is both the most common root cause and the one most often overlooked.
Where RAG systems actually fail — causes, signatures, and what they are not
Chunking strategy mismatched to document structure
Uniform token splitting is the default in most RAG implementations. It is also wrong for most enterprise document types. A policy document split on 512-token boundaries will routinely cut a numbered clause in half — the first chunk contains the condition, the second contains the consequence. Neither chunk retrieves well for a query about the rule, because neither chunk contains the complete logical unit. The retrieval system returns the best available fragment. The model generates from it. The answer is incomplete. This is not a model problem. It is a chunking problem, and the fix is a content-aware chunking strategy: section-aware splits for policy documents, step-preserving splits for procedures, QA-pair preservation for knowledge base articles. The chunk is the unit of retrieval. Its boundaries must respect the unit of meaning in the source document.
Dense-only retrieval on entity-heavy queries
Dense retrievers are trained to capture semantic similarity. They are not trained to capture exact term overlap. When a user asks about a specific contract clause number, a machine model identifier, or a named regulation — "what does section 4.2 of the vendor SLA require?" — dense retrieval looks for semantically similar content rather than exact matches. It finds something close. Close is wrong. BM25, which operates on term frequency, handles exact entity queries precisely and quickly. A retrieval system that routes all queries through a single dense encoder is systematically failing on a class of queries that constitutes a significant fraction of real enterprise traffic. The fix is a hybrid retrieval layer — dense for semantic queries, BM25 for entity and exact-term queries, with a weighted fusion that lets each strategy contribute where it is strong.
Single-vector retrieval on multi-hop questions
Some questions require evidence from more than one document. "What are the approval thresholds for vendor contracts that involve data processing?" requires the procurement policy, the data governance framework, and possibly the legal addendum that overrides both. No single chunk contains that answer. A single query vector finds the most similar chunk — probably the procurement policy — and the model generates an answer that omits the data governance requirements entirely. The answer sounds complete. It is not. The fix is query decomposition: breaking the multi-hop question into atomic sub-queries, running retrieval independently on each, and fusing the candidate sets before generation. This is architecturally more complex than single-vector retrieval. It is also the only approach that reliably handles multi-hop queries, which constitute roughly 40% of enterprise knowledge-base traffic.
Stale or missing metadata on indexed documents
When a policy is updated, the old version frequently remains in the index alongside the new one. Without date metadata, version tags, or supersession signals on the chunks, the retrieval system has no way to prefer the current version. Both chunks compete on semantic similarity, and the older version often wins because it has been in the index longer and its language is more stable. The model generates from the outdated policy. The answer was sourced from a real document. It was the wrong document. The fix is not a model fix. It is an indexing pipeline fix: mandatory date metadata on every chunk, version-aware retrieval that boosts recency, and an ingestion workflow that marks superseded documents as deprecated rather than simply adding the new version alongside them.
No query complexity classification before routing
Not all queries need the same retrieval strategy. A factual lookup — "what is the PTO entitlement for full-time employees?" — needs a single dense retrieval call. Running query decomposition on it adds latency and produces redundant sub-queries that retrieve the same chunk multiple times. A multi-hop comparative query needs decomposition, hybrid retrieval, and reranking. Running only dense retrieval on it produces an incomplete answer. A retrieval system that applies the same strategy to every query is simultaneously over-engineering simple lookups and under-engineering complex ones. The fix is a query classifier — lightweight, deterministic, operating before any retrieval call — that reads the structural signals in the query and routes it to the appropriate retrieval strategy. This is the architectural decision that makes everything else scalable.
What all five of these have in common is that they are invisible at the model output layer. The model receives bad context and generates a fluent, confident response from it. An engineer looking only at the output sees a wrong answer and reaches for the model. An engineer who looks at what the retrieval layer returned before the model saw it sees a chunking failure, a routing failure, or a metadata failure — and fixes the right thing.
The diagnostic audit — where to look before you touch the model
The retrieval diagnostic is not a sophisticated process. It requires logging the full context that was passed to the model on failed queries — the actual chunks, their source documents, their similarity scores, and the query that retrieved them — and then reading that context with the question in mind. This is tedious. It is the work. Every retrieval failure I have investigated was visible in this log if you looked at it with the right question: is the answer to this query present in the context the model was given?
If the answer is no, the problem is retrieval. No amount of model tuning will fix a problem that is not caused by the model.