On-prem AI is not a cost decision. It's a sovereignty decision.

There is a version of the on-prem versus cloud conversation that I used to have a lot, and it always went the same way. Someone would propose building an AI system using a cloud API. Someone else would raise on-prem as an alternative, and the debate would immediately become a cost conversation — API costs at scale, GPU infrastructure costs, total cost of ownership models, break-even query volumes. The right numbers would get traded back and forth, and the team would eventually land on whichever option had the better spreadsheet.

What that conversation almost never included was the question that should have come first: is cloud even an option?

For a manufacturing firm operating under an aerospace supply chain contract with data sovereignty clauses, the answer is no. Not "no, it's expensive." Just no — the contract prohibits sending certain document categories to third-party APIs, and that prohibition does not have a price. It is not a preference to be weighed against per-query savings. It is a legal constraint that removes cloud from the decision table before the cost model is opened.

The same is true for organisations operating under GDPR or PDPA with personal data in queries, for financial institutions where model weight ownership is a regulatory requirement, for defence supply chain contractors under ITAR, and for healthcare organisations whose patient data classification makes cloud inference a compliance violation rather than a cost item. In every one of these cases, the architecture is not chosen. It is determined. The debate about cost comes later, if it comes at all.

What sovereignty actually means in practice

Sovereignty, in the context of AI deployment, means control over three things: where your data goes, who can access your model weights, and whether your system's behaviour can be audited and demonstrated on demand. Each of these has a practical consequence for architecture, and each of them is determined by legal and contractual constraints rather than technical preference.

Data sovereignty is the most commonly understood of the three. When a user submits a query to an AI system, that query — along with any context it contains — travels to wherever the inference endpoint is hosted. For a cloud API, that means it leaves your network, passes through a third-party infrastructure provider's systems, and generates a response that is returned to you. If the query contains proprietary process documentation, personal data, or information subject to a contractual confidentiality clause, you have just sent that information outside your legal boundary. The question is not whether that is acceptable on a cost basis. The question is whether it is permitted at all.

Model weight sovereignty is less often discussed but equally important in regulated environments. The EU AI Act's high-risk provisions require operators to demonstrate weight ownership, training data lineage, and the ability to freeze and version the deployed model for audit purposes. A proprietary API satisfies none of these requirements. The weights are inaccessible. The training data went to a third-party server. The model version is controlled by the vendor, who may update it without notice. For a system that must be demonstrably compliant to a regulator, these are not inconveniences — they are disqualifying properties of the architecture.

Regulatory defensibility is not a property you add to a system after it is built. It is a property the architecture either has or does not have from the first design decision.

Audit sovereignty — the ability to explain, on demand, exactly what the system did and why — is the third dimension. In a cloud deployment, the inference process is a black box you do not control. You can log inputs and outputs. You cannot log the internal reasoning pathway, the version of the model that was active at the time of a specific decision, or the exact weights that produced a given output. For a system used in credit scoring, insurance risk assessment, or clinical decision support, the inability to reconstruct a specific inference for audit purposes is not an operational inconvenience. It is a legal exposure.

The three forcing functions — and what they actually force

Across the systems I have designed, the forcing functions that mandate on-prem architecture fall into three categories. They are distinct in their mechanism, but they share the same property: they are not negotiable on cost grounds. No cost model makes them go away. The architecture must accommodate them before any other design decision is made.

The three sovereignty forcing functions — what they are and what they eliminate

Contractual data classification

Supply chain contracts in aerospace, automotive, and defence routinely contain clauses that prohibit transmitting specified document categories — manufacturing procedures, process parameters, quality specifications — to third-party systems. These are not security preferences. They are contractual obligations with defined consequences for breach. When the documents that need to be indexed and queried fall into a prohibited category, cloud RAG services are not an option to be weighed. They are contractually excluded. The architecture is on-prem by definition, and the only design questions are about which on-prem components are viable given the available hardware and network constraints.

Regulatory data residency

GDPR Article 44 restricts transfers of personal data to third countries without adequate protections. Singapore's PDPA Section 26 imposes similar restrictions on cross-border data transfers. When user queries contain personal data — names, account numbers, health identifiers, employment records — sending those queries to a cloud inference endpoint in a non-compliant jurisdiction is a regulatory violation. This is not resolved by choosing a cloud provider with a local data centre, because the data still passes through that provider's global infrastructure for processing. The only architecture that guarantees data residency with certainty is one where inference runs on hardware you control, in a network boundary you control, with data that never leaves it. On-prem is not the conservative choice in this context. It is the compliant one.

Model weight ownership for high-risk AI

The EU AI Act classifies AI systems in credit scoring, insurance, healthcare decision support, and employment management as high-risk. High-risk systems require documented evidence of five obligations: data governance, technical documentation, transparency, human oversight, and accuracy verification. Every one of these requires access to the model weights — to demonstrate that the training data was appropriate, that the model version was frozen at a specific point, that the inference process can be reconstructed and explained. A proprietary API satisfies none of these requirements because the weights are inaccessible and vendor-controlled. The obligation is not to use a good AI system. It is to use one whose behaviour can be demonstrated on demand to a regulator. That property only exists in an architecture where you own and control the weights.

Air-gap network requirements

Certain operational environments are physically isolated from external networks — not as a security preference but as an engineering requirement or a contractual mandate. A factory floor with classified production processes, a military logistics system, a nuclear facility's maintenance infrastructure. In these environments, cloud inference is not expensive or slow. It is physically impossible. The architecture must operate without any external network dependency, at inference time and at index update time. Every component — the embedding model, the vector store, the language model, the speech interface — must run locally. This constraint is binary: either the system runs entirely on-prem, or it does not run at all.

What these four forcing functions have in common is that they are discovered by asking the right questions at the start of a project — not by running a cost comparison. A team that opens with "cloud or on-prem?" and immediately goes to spreadsheets will miss them entirely until the architecture is already built and a legal or compliance review surfaces them. At that point, the architecture has to be rebuilt. That is the expensive version of the lesson.

When cost does become the argument

The forcing functions above establish when on-prem is mandatory. There is a separate class of situations where on-prem is not mandatory but is architecturally superior — and in those situations, cost is the legitimate argument. It is just not the only argument, and it is not the first one.

At sufficient query volume, the cost structure of on-prem inference becomes decisive. A fully-loaded total cost of ownership analysis across the projects I have worked on shows on-prem running at roughly a tenth of the cost of proprietary API inference at production scale — not because on-prem is cheap, but because cloud API pricing scales with query volume and on-prem pricing does not. The marginal cost of a query against a locally-running model is approximately zero. The marginal cost of a cloud API query is the per-token price multiplied by the query length, compounded across millions of daily queries and across the full planned lifespan of the system.

For a manufacturing facility running continuous queries across a production floor for five years, this arithmetic is not close. The on-prem architecture has a higher upfront cost and a far lower total cost over the deployment horizon. But that argument only lands when the sovereignty questions have been asked and answered first — because if on-prem is mandatory regardless, the cost comparison is academic. And if on-prem is not mandatory, the cost argument needs to be made against the full TCO of cloud, including the engineering labour to maintain the on-prem infrastructure, not just the GPU hardware cost.

Cloud vs. on-prem — the real decision framework, in order

Start here — sovereignty questions

These are binary. They determine what is possible.

If any of these answers is yes, the architecture is on-prem regardless of cost. These questions must be asked before the cost model is opened.

Do contractual clauses prohibit sending these documents to a third party?

Does regulatory data residency apply to the queries this system will process?

Does the deployment context require model weight ownership for audit or compliance?

Is the network environment air-gapped or otherwise isolated from external APIs?

Only then — cost and capability questions

These are trade-offs. They determine what is optimal.

If all sovereignty questions are answered no, the architecture is a genuine choice. Cost, capability, and operational complexity are the legitimate variables.

What is the projected query volume over the deployment lifespan?

What is the fully-loaded TCO of on-prem including engineering labour?

Does the team have the infrastructure capability to maintain on-prem serving?

What latency budget does the application require, and can local hardware meet it?

The reason this sequencing matters is practical. A team that treats the decision as primarily a cost question will often land on cloud for the right cost reasons — lower upfront investment, managed infrastructure, no GPU procurement cycle. That is a rational choice under the assumption that cloud is available. But the assumption is doing enormous work, and it is the assumption that most teams do not check explicitly.

What this looked like for VaultRAG and AlignR

For VaultRAG — the local-first RAG system for manufacturing — the sovereignty question was settled in the first conversation. The target deployment environment was a manufacturing facility in an aerospace supply chain. The documents to be indexed were ISO-controlled SOPs, equipment procedures, and maintenance records — categories subject to contractual confidentiality provisions that prohibited transmission to third-party services. Cloud RAG was not a consideration. The architecture was on-prem before any component was selected.

What that constraint determined downstream was substantial. The embedding model had to run locally — which meant a smaller model than would be used in a cloud deployment, with the quality implications that entails. The vector store had to be embedded, with persistent local storage, no external dependencies at query time. The language model had to run on commodity hardware — a single facility server, not a GPU cluster — which meant the 3B parameter Llama variant rather than a 70B model. The speech-to-text had to run locally via Whisper rather than through a cloud STT API. Every component choice flowed from the original sovereignty constraint.

For AlignR — the alignment pipeline for regulated enterprise AI — the forcing function was different in mechanism but identical in effect. The EU AI Act's high-risk provisions require documented weight ownership and a frozen, versioned model that can be reconstructed for audit purposes. A proprietary API cannot satisfy those requirements. The architecture is open-weight, self-hosted, with all training and inference running inside a private VPC boundary. No model weights or preference data leave the enterprise network.

In both cases the architecture was not chosen. It was determined. The sovereignty constraint was the first design decision, and every component choice that followed was downstream of it.

The cost story, in both cases, is favourable — on-prem at production scale over a multi-year deployment is significantly cheaper than cloud API equivalents at the query volumes involved. But that is a consequence of the architecture, not the reason for it. If the cost arithmetic had run the other way, the architecture would have been the same. The sovereignty constraints do not have a price.

The question that changes the conversation

Most cloud versus on-prem debates are resolved by the cost model. The better question, asked before the cost model is opened, is simpler: can the data leave the network?

Not should it. Not is it preferable that it doesn't. Can it, under the legal, contractual, and regulatory obligations that govern this organisation and this data? If the answer is no, the architecture is on-prem. If the answer is yes, the architecture is a genuine choice and cost is a legitimate primary variable.

The organisations that get this wrong are almost always the ones that assumed the answer was yes without checking. Not because they were careless — but because the question is easy to overlook when the technology conversation is moving quickly and the cloud options are genuinely good. The forcing functions are buried in contracts and regulatory frameworks that ML teams are not always close to, and they do not surface naturally in a project kickoff unless someone specifically goes and finds them.

What is the data classification of the documents and queries this system will process? Not a general description — the specific classification used by your legal or compliance team, with reference to the policies and contracts that define it. If the classification includes categories subject to confidentiality provisions, data residency requirements, or cross-border transfer restrictions, those provisions define the network boundary the architecture must operate within.

Does any governing contract prohibit sending these data categories to a third-party service? This is a legal question, and it requires a legal answer — not an inference by the ML team. Supply chain contracts, NDAs, and data processing agreements are the primary sources. The answer is binary. If it is yes, the architecture is on-prem regardless of the cost model.

Does the regulatory framework governing this deployment require model weight ownership or the ability to reconstruct a specific inference for audit? EU AI Act high-risk provisions, financial services model risk management frameworks, and clinical AI governance requirements all impose audit obligations that proprietary APIs cannot satisfy. If the deployment falls within scope of any of these frameworks, the architecture must support weight-level auditability.

What is the network environment the system must operate in? Is external network access available at inference time? At index update time? At all? The answer to this question determines not just cloud versus on-prem, but which components can have external dependencies and which must be entirely self-contained. An air-gapped environment eliminates not just cloud APIs but any component that phones home for updates, licensing checks, or telemetry.

The cost model comes after all four questions. If the answers leave cloud as a viable option, the cost model is the right tool for the final decision. If they don't, the cost model is irrelevant — and building it first, before the sovereignty questions are asked, is how teams end up redesigning a system that was already built.

On-prem AI is not a conservative choice or a cost optimisation. In the environments where it is the right architecture, it is the only architecture. The question is not whether to build on-prem. The question is whether you discovered that early enough to design for it from the start.

References & Further Reading

European Parliament and Council. (2024). Regulation (EU) 2024/1689 — Artificial Intelligence Act. Official Journal of the European Union. — The high-risk classification and conformity assessment obligations that make model weight ownership architecturally mandatory for certain deployment contexts. Articles 9–17 are the operative provisions.

Personal Data Protection Commission (Singapore). (2012, amended 2020). Personal Data Protection Act 2012. Section 26: Transfer Limitation Obligation. — The cross-border data transfer restrictions that create data residency requirements for systems processing personal data of Singapore residents.

U.S. Department of State. (2022). International Traffic in Arms Regulations (ITAR), 22 CFR Parts 120–130. — The export control framework that governs technical data in defence supply chains. The relevant constraint for on-prem AI is the prohibition on transferring controlled technical data to foreign nationals or foreign cloud infrastructure without a licence.

Nygard, M. (2011). Documenting Architecture Decisions. thinkrelevance.com. — The Architecture Decision Record format used across the projects referenced here. When the forcing function for an architectural decision is a legal constraint rather than a technical preference, documenting it as an ADR is the mechanism that makes the constraint traceable and auditable over the system's lifespan.

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch