The failure mode arrives quietly. A RAG system goes live. The early demos looked good — clean, confident answers with cited sources. Two months in, the support tickets start. Users report that the system is getting things wrong. Not obviously wrong — the answers are fluent and well-structured and they cite a real document. They are just not quite right. Missing a clause. Conflating two policies. Giving the procedure for the wrong machine model.

The team looks at the model. They try a larger one, or a fine-tuned variant, or a different prompt template that adds more instruction about accuracy. Some of these changes move the numbers slightly. None of them fix the problem sustainably, because the problem was never in the model.

It was in the retrieval layer — specifically, in what got passed to the model in the first place. A model can only generate from what it receives. If the context it receives contains the wrong chunk, a truncated procedure, or a fragment that happens to be semantically close to the query but is missing the critical clause, the model will produce a confident, fluent, wrong answer. It will do so regardless of its size, its fine-tuning history, or how carefully you wrote the system prompt.

Fine-tuning a model on top of broken retrieval does not fix the retrieval. It teaches the model to be more articulate about the wrong information.


Why retrieval fails silently

The most dangerous property of retrieval failure is that it looks like success. When a dense retrieval system returns the wrong chunk, it does not return an error. It returns a similarity score — typically somewhere between 0.7 and 0.9 — that the generation layer treats as a confidence signal. A score of 0.84 means "this is probably relevant." It does not mean "this is the correct chunk for this query." The generation layer does not know the difference. It receives context and generates from it.

This is what makes retrieval failures so insidious in production. A system that errors out visibly can be debugged. A system that returns fluent, cited, wrong answers erodes trust gradually — users stop relying on it not because they saw an obvious failure, but because they stopped trusting the answers after two or three quiet misses they only noticed later.

A similarity score near 0.84 is not a confidence signal. It is a proximity signal. The two are not the same thing, and the generation layer treats them as if they are.

The research is unambiguous on the scale of this problem. Dense retrievers underperform BM25 on out-of-domain corpora by 11.7 NDCG points on average across the BEIR benchmark. Single-vector retrieval surfaces all the evidence needed to answer a multi-hop question only 44% of the time on HotpotQA. These are not edge cases — they are the normal operating conditions of most enterprise RAG deployments, which run on heterogeneous corpora with specialised terminology that the embedding model was not trained on, and where a significant share of real user queries require evidence from more than one document.

A model fine-tuned under these conditions learns one thing: how to produce fluent outputs when the context is incomplete. That is not a capability. It is a liability.


Where retrieval actually breaks — and what to look at first

Retrieval failures in production RAG systems are not random. They cluster around a small set of structural causes, and each cause has a diagnostic signature that is visible if you look at the right layer. The instinct to jump to fine-tuning usually happens because teams are looking at the model's output — the thing that is wrong — rather than the retrieval layer that produced the input the model was given.

The causes below are ordered by how frequently they appear in the systems I have worked on. They are also ordered, roughly, by how easily they are fixed — which makes the last one on the list the most important to check first, because it is both the most common root cause and the one most often overlooked.

Where RAG systems actually fail — causes, signatures, and what they are not
Chunking strategy mismatched to document structure
Uniform token splitting is the default in most RAG implementations. It is also wrong for most enterprise document types. A policy document split on 512-token boundaries will routinely cut a numbered clause in half — the first chunk contains the condition, the second contains the consequence. Neither chunk retrieves well for a query about the rule, because neither chunk contains the complete logical unit. The retrieval system returns the best available fragment. The model generates from it. The answer is incomplete. This is not a model problem. It is a chunking problem, and the fix is a content-aware chunking strategy: section-aware splits for policy documents, step-preserving splits for procedures, QA-pair preservation for knowledge base articles. The chunk is the unit of retrieval. Its boundaries must respect the unit of meaning in the source document.
Dense-only retrieval on entity-heavy queries
Dense retrievers are trained to capture semantic similarity. They are not trained to capture exact term overlap. When a user asks about a specific contract clause number, a machine model identifier, or a named regulation — "what does section 4.2 of the vendor SLA require?" — dense retrieval looks for semantically similar content rather than exact matches. It finds something close. Close is wrong. BM25, which operates on term frequency, handles exact entity queries precisely and quickly. A retrieval system that routes all queries through a single dense encoder is systematically failing on a class of queries that constitutes a significant fraction of real enterprise traffic. The fix is a hybrid retrieval layer — dense for semantic queries, BM25 for entity and exact-term queries, with a weighted fusion that lets each strategy contribute where it is strong.
Single-vector retrieval on multi-hop questions
Some questions require evidence from more than one document. "What are the approval thresholds for vendor contracts that involve data processing?" requires the procurement policy, the data governance framework, and possibly the legal addendum that overrides both. No single chunk contains that answer. A single query vector finds the most similar chunk — probably the procurement policy — and the model generates an answer that omits the data governance requirements entirely. The answer sounds complete. It is not. The fix is query decomposition: breaking the multi-hop question into atomic sub-queries, running retrieval independently on each, and fusing the candidate sets before generation. This is architecturally more complex than single-vector retrieval. It is also the only approach that reliably handles multi-hop queries, which constitute roughly 40% of enterprise knowledge-base traffic.
Stale or missing metadata on indexed documents
When a policy is updated, the old version frequently remains in the index alongside the new one. Without date metadata, version tags, or supersession signals on the chunks, the retrieval system has no way to prefer the current version. Both chunks compete on semantic similarity, and the older version often wins because it has been in the index longer and its language is more stable. The model generates from the outdated policy. The answer was sourced from a real document. It was the wrong document. The fix is not a model fix. It is an indexing pipeline fix: mandatory date metadata on every chunk, version-aware retrieval that boosts recency, and an ingestion workflow that marks superseded documents as deprecated rather than simply adding the new version alongside them.
No query complexity classification before routing
Not all queries need the same retrieval strategy. A factual lookup — "what is the PTO entitlement for full-time employees?" — needs a single dense retrieval call. Running query decomposition on it adds latency and produces redundant sub-queries that retrieve the same chunk multiple times. A multi-hop comparative query needs decomposition, hybrid retrieval, and reranking. Running only dense retrieval on it produces an incomplete answer. A retrieval system that applies the same strategy to every query is simultaneously over-engineering simple lookups and under-engineering complex ones. The fix is a query classifier — lightweight, deterministic, operating before any retrieval call — that reads the structural signals in the query and routes it to the appropriate retrieval strategy. This is the architectural decision that makes everything else scalable.

What all five of these have in common is that they are invisible at the model output layer. The model receives bad context and generates a fluent, confident response from it. An engineer looking only at the output sees a wrong answer and reaches for the model. An engineer who looks at what the retrieval layer returned before the model saw it sees a chunking failure, a routing failure, or a metadata failure — and fixes the right thing.


The diagnostic audit — where to look before you touch the model

The retrieval diagnostic is not a sophisticated process. It requires logging the full context that was passed to the model on failed queries — the actual chunks, their source documents, their similarity scores, and the query that retrieved them — and then reading that context with the question in mind. This is tedious. It is the work. Every retrieval failure I have investigated was visible in this log if you looked at it with the right question: is the answer to this query present in the context the model was given?

If the answer is no, the problem is retrieval. No amount of model tuning will fix a problem that is not caused by the model.

Is the correct information present anywhere in the index? This is the baseline check. Search the index directly for the answer to the failed query, bypassing the retrieval pipeline. If the answer is not in the index at all, the problem is ingestion — a document that was not indexed, a chunk that was too large and drowned out the relevant sentence, or a file format that the ingestion pipeline failed to parse. The model cannot generate from information it was never given.

If the information is indexed, is it being retrieved? Retrieve the top-10 chunks for the failed query and read them. Is the correct chunk in the top 10? If not, is it in the top 50? The position tells you which retrieval failure you are looking at. Not in top 10 but in top 50 suggests a reranking problem. Not in top 50 at all suggests a strategy mismatch — the query type and the retrieval strategy are incompatible.

If the correct chunk is being retrieved, is it intact? Read the retrieved chunk in isolation, without the document it came from. Does it contain a complete logical unit — a full policy clause, a complete procedure step, a self-contained QA pair? Or does it cut off at an arbitrary token boundary, leaving the critical part of the rule in the adjacent chunk that was not retrieved? If the chunk is a fragment, the problem is chunking strategy.

Is the retrieved chunk current? Check the metadata on the retrieved chunk — the document date, the version, the ingestion timestamp. If the answer was sourced from a document that has been superseded, the problem is index freshness. The correct information exists in the index. The wrong version of it was ranked higher.

Is the query type matched to the retrieval strategy? Log the retrieval strategy that was applied to the failed query — dense only, hybrid, decomposed. Was the query a multi-hop question that was routed to dense-only retrieval? Was it an entity lookup that was sent through a semantic encoder? A mismatch between query type and retrieval strategy will produce systematic failures on a class of queries, not random ones. Systematic failures are the diagnostic signal. They tell you exactly which part of the routing logic to fix.

None of these checks require model expertise. They require reading logs and thinking about what the retrieval layer was asked to do and whether it was capable of doing it. The model question comes after all five checks come back clean — which means the context was correct, complete, and current, and the model still produced a wrong answer. That is a narrow failure mode. It happens. But it is not the usual case in a system that has not had its retrieval layer audited.


What QueryForge is actually solving

When I designed QueryForge, the starting observation was straightforward: the majority of enterprise RAG deployments between 2023 and 2024 used a single retrieval strategy — embed the query, find the nearest chunks, pass them to the model. That pattern works well for simple factual lookups, which make up about 58% of enterprise query traffic. It fails on the other 42%: comparative queries, multi-hop questions, temporal queries, and entity lookups that require exact term matching rather than semantic approximation.

The failure is silent. Dense retrieval returns something. The model generates from it. The answer looks right until the user checks it against the source.

The retrieval diagnosis — what to fix vs. what to ignore
Retrieval problems — fix upstream
The context was wrong before the model saw it

These failures are invisible at the output layer. They require log inspection at the retrieval layer. No model change will fix them.

Wrong chunk retrieved — similarity ≠ relevance
Incomplete chunk — token boundary cuts a logical unit
Stale chunk — outdated version ranked above current one
Missing evidence — multi-hop answer needs more than one document
Strategy mismatch — entity query sent through semantic encoder
Model problems — fix the model
The context was correct but the output was wrong

These failures are rare in a well-designed retrieval layer. They are the only failures that fine-tuning or prompt engineering can actually address.

Correct context retrieved, answer draws wrong inference
Correct context retrieved, answer ignores a constraint in it
Correct context retrieved, domain terminology misinterpreted
Correct context retrieved, numerical reasoning fails

QueryForge addresses this by putting a query classifier in front of the retrieval layer. The classifier reads structural signals in each query — entity pairs, temporal cues, document domain signals, query complexity — and routes it to the appropriate strategy: dense-only for simple factual lookups, hybrid for entity and exact-term queries, decomposition plus hybrid plus reranking for multi-hop questions. Each routing decision is logged with full signal provenance so the diagnostic audit is not a manual exercise — the system records why each query was routed the way it was.

The outcome is not a smarter model. It is a retrieval layer that systematically gives the model the right context for each query type. A model with correct context is a different system from a model with approximate context. The difference is not in the model. It is in what the model was given to work with.


The right time to reach for fine-tuning

This is not an argument that fine-tuning is never the answer. It is an argument that fine-tuning is the answer to a specific and narrow set of retrieval system failures — the ones where the retrieval layer is working correctly and the model is still underperforming.

Those cases exist. A base model that has not been exposed to specialised domain terminology will misinterpret retrieved chunks that use that terminology correctly. A model without financial reasoning capabilities will receive correct numerical data and make arithmetic errors. A model without domain-specific instruction following will receive complete, accurate context and produce outputs that violate the format or structure the application requires.

In each of those cases, fine-tuning is the right response — because the input is correct and the model's processing of it is the failure point. But these cases are distinguishable from retrieval failures if you look at the context that was provided. Correct context, wrong output: model problem. Incomplete or incorrect context, wrong output: retrieval problem. The diagnostic distinction is not subtle. It is visible in the logs.

Fix the retrieval layer first. When the model is receiving the right information and still producing wrong answers — that is when you reach for fine-tuning.

The reason teams skip this sequence is not laziness. It is that fine-tuning feels like the kind of technical intervention that should fix a quality problem. It is sophisticated work. It produces measurable output. It has a clear narrative: we improved the model, and now the system performs better. The retrieval audit is less satisfying — it involves reading logs, finding the chunking strategy that was configured two months ago, and changing a YAML parameter. The fix is unglamorous. The impact is real.

Most RAG systems in production today are not limited by their models. They are limited by what their retrieval layers give those models to work with. That is where the leverage is, and that is where the work should start.

References & Further Reading

Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021. arXiv:2104.08663. — The 11.7 NDCG point gap between dense and BM25 on out-of-domain corpora. The benchmark result that makes hybrid retrieval architecturally necessary, not optional.

Yang, Z. et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP 2018. arXiv:1809.09600. — The 44.3% single-vector retrieval success rate on multi-hop questions. The result that makes query decomposition necessary for a significant fraction of enterprise queries.

Cormack, G.V., Clarke, C.L.A., & Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. — The original RRF paper. The fusion method that makes multi-strategy retrieval practical without a labelled training set to tune fusion weights.

Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496. — On HyDE and the conditions under which hypothetical document embedding improves retrieval. Also documents the failure mode where LLM hallucination in the hypothetical degrades recall — the case for running it in parallel with standard dense retrieval rather than as a replacement.

Continue reading

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch