Everyone talks about RAG as something you add to a system. Nobody talks about the four architectural trade-offs you lock in the moment you choose it — usually without realising that's what you're doing.
ML EngineeringNovember 202513 min read
When I started designing VaultRAG — a retrieval system for factory floor use, running entirely on-prem — the first question I was asked was which embedding model to use. It was a reasonable question from someone who had read the right things. Embedding model quality has a real impact on retrieval performance. The benchmarks are published. The comparisons are legible. It felt like the right place to start.
It was not the right place to start.
Before the embedding model mattered, I had to decide whether the index could ever leave the facility network. Before that, I had to decide whether the system needed to answer questions about documents that were updated monthly, or weekly, or in real time as equipment status changed. Before that, I had to understand whether a query from a technician standing next to a machine with an alarm going off could afford a 4-second response, or whether 4 seconds was already too slow.
Each of those questions was a constraint. And each constraint eliminated certain architectures before any model comparison was relevant. By the time I got to embedding model selection, half the options were already off the table — not because they underperformed, but because the constraints had made them architecturally impossible.
That sequence — constraints first, components second — is the thing most RAG discussions skip. And skipping it is why so many RAG systems are built competently, deployed correctly, and still fail to do what the business needed.
What RAG actually is, before the marketing gets to it
Retrieval-Augmented Generation is a pattern for grounding a language model's responses in a specific document corpus at inference time. Instead of relying on what the model learned during training, you retrieve relevant chunks from your own documents and pass them to the model as context. The model generates its answer from that context rather than from memory.
That is a genuinely useful capability. It solves a real problem: language models are trained at a point in time, on data that isn't yours, and they hallucinate when asked about things they don't know. RAG addresses all three of those limitations simultaneously.
The problem is the way RAG gets sold — as a capability you bolt on. A feature. You have a model, you add a vector store, you write some chunking logic, and now your model knows about your documents. This framing is technically accurate and practically misleading, because it presents RAG as a component decision when it is actually a systems decision. The moment you choose RAG, you have committed to a position on freshness, latency, data residency, and cost — whether you realised you were deciding those things or not.
Choosing RAG is not a decision about retrieval. It is four decisions at once, made simultaneously, whether you intended to make them or not.
The teams that build RAG systems well understand this going in. The teams that build RAG systems and then spend six months patching them usually discover it the other way around.
The four decisions you make when you choose RAG
None of these are model decisions. They are architecture decisions — and they have to be made deliberately, because the defaults are almost never right for a production environment.
The four implicit decisions inside every RAG architecture choice
Decision 1 — Freshness
RAG retrieves from an index. An index is a snapshot of your documents at a point in time. The moment a document changes, your index is stale — and it stays stale until you re-embed and re-index that document. For a manufacturing plant with equipment manuals that are revised quarterly, this is manageable. For a system that needs to answer questions about yesterday's inventory or today's machine status, a snapshot-based index is architecturally wrong. You have decided, implicitly, that your answers can be as stale as your last ingestion run. If that cadence doesn't match the business requirement, you will have built the right system for the wrong problem.
Decision 2 — Latency budget
A RAG pipeline adds latency relative to a model call alone. You are embedding the query, running a similarity search, retrieving chunks, stuffing them into the context window, and then generating. On a well-tuned cloud setup, this can be fast. On a constrained on-prem server running a smaller model, each of those steps has a cost. For VaultRAG, the target was under eight seconds from voice query to cited response — a technician with a machine alarm going off cannot wait longer than that. That constraint shaped every component choice: the model size, the vector store, the chunk size, the number of retrieved documents. The latency budget is not a performance requirement you optimise for at the end. It is a constraint you design to from the start.
Decision 3 — Data residency
To use a cloud RAG service, your documents leave your network. They go to an embedding API, they get stored in a managed vector store, they pass through a cloud inference endpoint. For most organisations, this is fine. For manufacturing firms in aerospace, automotive, and defence supply chains, it is often contractually prohibited. Proprietary process documentation, maintenance procedures, and equipment specifications are not documents that can be sent to a third-party API — not because of security anxiety, but because of legally binding data sovereignty requirements. This constraint does not make cloud RAG wrong. It makes it unavailable. The architecture has to be designed around what is actually permitted, not what is most convenient.
Decision 4 — Ongoing cost structure
Cloud RAG services price on API calls, storage, and compute. At low volume this is negligible. At the volume of a facility with hundreds of machines, dozens of technicians, and queries being run continuously during production hours, the cost compounds quickly. More importantly, the cost is ongoing and variable — it scales with usage, which means a successful deployment costs more than a quiet one. On-prem inference changes the cost structure entirely: higher upfront capital cost, near-zero marginal cost per query. For a facility that plans to run this system for five years, the total cost of ownership calculation is completely different from the cloud version. That difference is not a preference. It is a financial constraint that determines which architecture is viable.
None of these decisions appear in the average RAG tutorial. The tutorials start at the embedding model, because that is where the interesting technical work begins. But the constraints above determine whether the system you build can ever be deployed — and whether, once deployed, it will last.
What happens when constraints are discovered late
The most common version of this failure looks like this: a team builds a RAG proof of concept on cloud infrastructure, demonstrates it to stakeholders, gets approval to productionise, and then discovers that the documents it needs to index are classified under a data governance policy that prohibits sending them to an external API. The proof of concept is not a foundation to build on. It is a demonstration of something that cannot be deployed as built.
A less dramatic version — but more common — looks like this: a team builds a RAG system that works well in testing, deploys it, and then finds that users stop trusting it after a few weeks. The answers are technically grounded in real documents. They are just consistently a few weeks out of date. The index refresh cadence was set to weekly because nobody thought carefully about how frequently the underlying documents change. Users who discover that the system confidently cites a procedure that was updated four days ago do not go back and ask different questions. They stop using the system.
A RAG system that users cannot trust is not a retrieval problem or a generation problem. It is a freshness contract that was never defined.
In both cases, the failure is traceable to the same root: the constraint conversation happened after the architecture was chosen rather than before it. Once the architecture is chosen, the constraints either fit or they don't — and retrofitting an architecture to constraints it was not designed for is expensive work with uncertain results.
The constraint audit — before architecture is chosen
The right time to surface these constraints is before any component is selected. Not in the design review. Not in the proof of concept. Before the whiteboard has anything on it. The questions below are not technical questions — they are business and operational questions, and they need to be answered by the people who own the documents, the infrastructure, the budget, and the user experience.