There is a version of the on-prem versus cloud conversation that I used to have a lot, and it always went the same way. Someone would propose building an AI system using a cloud API. Someone else would raise on-prem as an alternative, and the debate would immediately become a cost conversation — API costs at scale, GPU infrastructure costs, total cost of ownership models, break-even query volumes. The right numbers would get traded back and forth, and the team would eventually land on whichever option had the better spreadsheet.
What that conversation almost never included was the question that should have come first: is cloud even an option?
For a manufacturing firm operating under an aerospace supply chain contract with data sovereignty clauses, the answer is no. Not "no, it's expensive." Just no — the contract prohibits sending certain document categories to third-party APIs, and that prohibition does not have a price. It is not a preference to be weighed against per-query savings. It is a legal constraint that removes cloud from the decision table before the cost model is opened.
The same is true for organisations operating under GDPR or PDPA with personal data in queries, for financial institutions where model weight ownership is a regulatory requirement, for defence supply chain contractors under ITAR, and for healthcare organisations whose patient data classification makes cloud inference a compliance violation rather than a cost item. In every one of these cases, the architecture is not chosen. It is determined. The debate about cost comes later, if it comes at all.
What sovereignty actually means in practice
Sovereignty, in the context of AI deployment, means control over three things: where your data goes, who can access your model weights, and whether your system's behaviour can be audited and demonstrated on demand. Each of these has a practical consequence for architecture, and each of them is determined by legal and contractual constraints rather than technical preference.
Data sovereignty is the most commonly understood of the three. When a user submits a query to an AI system, that query — along with any context it contains — travels to wherever the inference endpoint is hosted. For a cloud API, that means it leaves your network, passes through a third-party infrastructure provider's systems, and generates a response that is returned to you. If the query contains proprietary process documentation, personal data, or information subject to a contractual confidentiality clause, you have just sent that information outside your legal boundary. The question is not whether that is acceptable on a cost basis. The question is whether it is permitted at all.
Model weight sovereignty is less often discussed but equally important in regulated environments. The EU AI Act's high-risk provisions require operators to demonstrate weight ownership, training data lineage, and the ability to freeze and version the deployed model for audit purposes. A proprietary API satisfies none of these requirements. The weights are inaccessible. The training data went to a third-party server. The model version is controlled by the vendor, who may update it without notice. For a system that must be demonstrably compliant to a regulator, these are not inconveniences — they are disqualifying properties of the architecture.
Regulatory defensibility is not a property you add to a system after it is built. It is a property the architecture either has or does not have from the first design decision.
Audit sovereignty — the ability to explain, on demand, exactly what the system did and why — is the third dimension. In a cloud deployment, the inference process is a black box you do not control. You can log inputs and outputs. You cannot log the internal reasoning pathway, the version of the model that was active at the time of a specific decision, or the exact weights that produced a given output. For a system used in credit scoring, insurance risk assessment, or clinical decision support, the inability to reconstruct a specific inference for audit purposes is not an operational inconvenience. It is a legal exposure.
The three forcing functions — and what they actually force
Across the systems I have designed, the forcing functions that mandate on-prem architecture fall into three categories. They are distinct in their mechanism, but they share the same property: they are not negotiable on cost grounds. No cost model makes them go away. The architecture must accommodate them before any other design decision is made.
What these four forcing functions have in common is that they are discovered by asking the right questions at the start of a project — not by running a cost comparison. A team that opens with "cloud or on-prem?" and immediately goes to spreadsheets will miss them entirely until the architecture is already built and a legal or compliance review surfaces them. At that point, the architecture has to be rebuilt. That is the expensive version of the lesson.
When cost does become the argument
The forcing functions above establish when on-prem is mandatory. There is a separate class of situations where on-prem is not mandatory but is architecturally superior — and in those situations, cost is the legitimate argument. It is just not the only argument, and it is not the first one.
At sufficient query volume, the cost structure of on-prem inference becomes decisive. A fully-loaded total cost of ownership analysis across the projects I have worked on shows on-prem running at roughly a tenth of the cost of proprietary API inference at production scale — not because on-prem is cheap, but because cloud API pricing scales with query volume and on-prem pricing does not. The marginal cost of a query against a locally-running model is approximately zero. The marginal cost of a cloud API query is the per-token price multiplied by the query length, compounded across millions of daily queries and across the full planned lifespan of the system.
For a manufacturing facility running continuous queries across a production floor for five years, this arithmetic is not close. The on-prem architecture has a higher upfront cost and a far lower total cost over the deployment horizon. But that argument only lands when the sovereignty questions have been asked and answered first — because if on-prem is mandatory regardless, the cost comparison is academic. And if on-prem is not mandatory, the cost argument needs to be made against the full TCO of cloud, including the engineering labour to maintain the on-prem infrastructure, not just the GPU hardware cost.
If any of these answers is yes, the architecture is on-prem regardless of cost. These questions must be asked before the cost model is opened.
If all sovereignty questions are answered no, the architecture is a genuine choice. Cost, capability, and operational complexity are the legitimate variables.
The reason this sequencing matters is practical. A team that treats the decision as primarily a cost question will often land on cloud for the right cost reasons — lower upfront investment, managed infrastructure, no GPU procurement cycle. That is a rational choice under the assumption that cloud is available. But the assumption is doing enormous work, and it is the assumption that most teams do not check explicitly.
What this looked like for VaultRAG and AlignR
For VaultRAG — the local-first RAG system for manufacturing — the sovereignty question was settled in the first conversation. The target deployment environment was a manufacturing facility in an aerospace supply chain. The documents to be indexed were ISO-controlled SOPs, equipment procedures, and maintenance records — categories subject to contractual confidentiality provisions that prohibited transmission to third-party services. Cloud RAG was not a consideration. The architecture was on-prem before any component was selected.
What that constraint determined downstream was substantial. The embedding model had to run locally — which meant a smaller model than would be used in a cloud deployment, with the quality implications that entails. The vector store had to be embedded, with persistent local storage, no external dependencies at query time. The language model had to run on commodity hardware — a single facility server, not a GPU cluster — which meant the 3B parameter Llama variant rather than a 70B model. The speech-to-text had to run locally via Whisper rather than through a cloud STT API. Every component choice flowed from the original sovereignty constraint.
For AlignR — the alignment pipeline for regulated enterprise AI — the forcing function was different in mechanism but identical in effect. The EU AI Act's high-risk provisions require documented weight ownership and a frozen, versioned model that can be reconstructed for audit purposes. A proprietary API cannot satisfy those requirements. The architecture is open-weight, self-hosted, with all training and inference running inside a private VPC boundary. No model weights or preference data leave the enterprise network.
In both cases the architecture was not chosen. It was determined. The sovereignty constraint was the first design decision, and every component choice that followed was downstream of it.
The cost story, in both cases, is favourable — on-prem at production scale over a multi-year deployment is significantly cheaper than cloud API equivalents at the query volumes involved. But that is a consequence of the architecture, not the reason for it. If the cost arithmetic had run the other way, the architecture would have been the same. The sovereignty constraints do not have a price.
The question that changes the conversation
Most cloud versus on-prem debates are resolved by the cost model. The better question, asked before the cost model is opened, is simpler: can the data leave the network?
Not should it. Not is it preferable that it doesn't. Can it, under the legal, contractual, and regulatory obligations that govern this organisation and this data? If the answer is no, the architecture is on-prem. If the answer is yes, the architecture is a genuine choice and cost is a legitimate primary variable.
The organisations that get this wrong are almost always the ones that assumed the answer was yes without checking. Not because they were careless — but because the question is easy to overlook when the technology conversation is moving quickly and the cloud options are genuinely good. The forcing functions are buried in contracts and regulatory frameworks that ML teams are not always close to, and they do not surface naturally in a project kickoff unless someone specifically goes and finds them.