There is a phrase that appears in almost every enterprise AI governance document: human in the loop. It is usually followed by a description of a process — a workflow step, a review queue, an approval button in a UI. The process is documented. The documentation is filed. The project proceeds. And in that sequence of events, without anyone intending it, the human oversight requirement has been converted from an architectural constraint into a procedural hope.
This article is about the difference between those two things. Not because the difference is subtle — it is not — but because the cost of confusing them is substantial, specific, and increasingly measurable. EU AI Act Article 14 does not require a documented process. It requires that a natural person be technically able to oversee, interpret, and override AI outputs. Those are different requirements. The first can be satisfied with a policy document. The second requires something in the code.
The architectural pattern that satisfies the second requirement has a name: the write-path lock. Understanding what it is, how it works, and what it costs to not have it is the substance of what follows.
What Article 14 actually requires from an architect
The EU AI Act's Article 14 is a human oversight requirement for high-risk AI systems. It is frequently misread by engineering teams as a UX requirement — add a feedback mechanism, surface model confidence, include an override button. These are not wrong, but they are insufficient. The article requires that human oversight be technically enabled in a way that is proportionate to the risks and consequences of the AI system's outputs. The word technically is load-bearing.
A feedback button is bypassable. A review queue is skippable under deadline pressure. A UI approval step can be automated away by a downstream process that someone built to "improve efficiency." None of these satisfies Article 14 for a high-risk system — which includes AI used in employment decisions, credit scoring, financial services, critical infrastructure, and the administration of justice — because all of them can be removed from the execution path without a code change, a review, or a record.
What Article 14 actually requires, translated into architectural terms, is that the system be incapable of taking a consequential action without a committed human oversight record. Not unlikely to. Not required by policy to. Incapable of. The oversight is not a step in a workflow. It is a precondition on a function. The function either has the precondition satisfied or it does not execute.
You cannot explain a decision after the fact and count that as Article 14 compliance. The explanation is the precondition, not the postscript. And the oversight record is the lock, not the log.
This framing has a second implication that is equally important: the explainability record — the SHAP values, the feature importance scores, whatever method the system uses to make its reasoning legible — must also be a precondition, not a postscript. The human reviewer cannot exercise meaningful oversight of a decision they cannot understand. Writing the explanation after the approval has been granted is not compliance. It is paperwork. The SHAP record must be written to an immutable store before the HITL checkpoint is created, so that the human reviewing the decision has access to the reasoning that produced it.
Process compliance versus structural compliance
The distinction between process compliance and structural compliance is the central architectural question in AI governance. It determines whether a system is actually governed or merely documented as governed. The difference does not show up in demos. It shows up in regulatory examinations, post-mortems, and the quiet accumulation of non-compliant decisions that nobody noticed until someone looked.
The compliance document states that all AI-assisted financial decisions above a certain threshold require human review. A review queue exists. The queue is monitored. For the first six months, every decision is reviewed. Then the business unit reports that the queue is causing a three-day delay in payment processing. A workaround is implemented: the queue threshold is raised. Then raised again. Eighteen months after deployment, a regulatory examination finds that 40% of decisions in a category covered by Annex III have no human oversight record. There was no breach. There was no malicious intent. There was a process that eroded under operational pressure, and an architecture that made that erosion possible.
Structural compliance makes this failure mode impossible. Not unlikely. Impossible. The payment processing function requires a hitl_id parameter. The parameter can only be supplied by the approval service. The approval service only issues an ID after a human reviewer has committed a decision. The chain is unbroken. There is no threshold to raise. There is no queue to bypass. If the queue is causing delay, the delay is visible, measured, and must be addressed by adding reviewers or renegotiating the autonomy boundary — not by removing the requirement.
The HITL as a state machine node, not a UI feature
The most precise way to understand what structural compliance requires is to think about where the HITL lives in the execution graph. In a process-compliance architecture, the HITL lives in the UI: a button, a queue, a page that a human navigates to. In a structural-compliance architecture, the HITL lives in the agent state machine: it is a node on a conditional edge, and the condition that determines whether the edge fires is not a UI event but a computed confidence score compared against a calibrated per-category threshold.
This matters for a reason that is easy to miss. A UI feature can be bypassed by anything that does not go through the UI. A downstream process, a batch job, an API call that skips the front end — all of these can circumvent a UI-layer HITL without the oversight system registering that anything unusual has happened. A state machine node cannot be bypassed. The graph topology does not have an edge that goes around it. The only path from the decision to the write is through the node.
The diagram below shows the LangGraph state machine topology for a single agent module — the kind that handles, say, a revenue recognition classification or a procurement exception triage. The HITL is a first-class interrupt node on the conditional edge between inference and action. The condition is not "the user clicked approve." The condition is "a human reviewer committed a decision record that satisfies the Article 14 oversight requirement." Those are different conditions with different security properties.
The calibration problem: one threshold is not a compliance posture
There is a subtlety in the confidence threshold that matters more than most teams realise when they first implement a HITL system. A single global threshold — say, route everything below 0.85 confidence to human review — is not a compliance posture. It is a starting point that will, over time, produce both over-interception (routing decisions that a competent agent handles correctly, adding unnecessary friction and cost) and under-interception (allowing the agent to proceed autonomously on decision categories where the epistemic risk is higher than 0.85 captures).
EU AI Act Article 9 requires that the oversight mechanism be proportionate to the risks and consequences of the system's outputs. Proportionality means the threshold must be calibrated per decision category, not set globally. A query that influences a career-relevant outcome — a hiring recommendation, a performance assessment — carries different epistemic risk than a query that retrieves a contract clause for a standard transaction. These should not share a threshold. The first requires a higher bar; the second can proceed with a lower one without compromising the proportionality requirement.
The confidence score itself also requires attention. Raw cosine similarity scores are not probabilities. A score of 0.87 does not mean the model is 87% confident in its output in any meaningful sense. Platt scaling — a post-hoc calibration technique that maps raw scores to properly calibrated probabilities — is a prerequisite for using confidence scores as HITL threshold inputs. Without it, the threshold is arbitrary in a way that cannot be defended in a regulatory examination.
| Decision category | Trigger type | Confidence metric | Threshold | Rationale |
|---|---|---|---|---|
| Revenue classification ASC 606 multi-element arrangement |
Confidence-triggered | XGBoost probability × SHAP completeness score · Platt-scaled | 0.92 | Directly affects GL entry and reported revenue. Highest epistemic risk. EU AI Act Annex III high-risk classification applies. |
| Contract risk flag Clause anomaly detection |
Confidence-triggered | Composite: Document AI confidence × XGBoost risk score · Platt-scaled to [0,1] | 0.88 | Legal consequence if a genuine risk clause is missed. Lower than revenue classification but above standard retrieval given downstream consequence. |
| Payment exception triage AP anomaly routing |
Confidence-triggered | Isolation Forest anomaly score · calibrated to probability | 0.85 | Financial control requirement. Dual-reviewer HITL above a second value threshold regardless of confidence score. |
| Standard document retrieval Policy, contract, procedure lookup |
Confidence-triggered | Retrieval recall@10 — no synthesis component | 0.75 | Factual retrieval with no inferential synthesis step. Lower threshold appropriate; HITL reserved for genuine knowledge gaps. |
| Cross-jurisdiction data access Geo-exclusion OPA policy path |
Policy-triggered | Not confidence-based — OPA policy evaluation | Always HITL | Data sovereignty obligations. No confidence score can override a policy-layer intercept. OPA policy rule fires unconditionally. |
| GDPR erasure request Data subject rights — cascade scope approval |
Policy-triggered | Not confidence-based — lifecycle event trigger | Always HITL | GDPR Art. 17 retention review obligation. DPO must approve cascade scope before any deletion. SLA: 72h statutory. |
Two intercept types in the table above deserve particular attention because they illustrate a principle that confidence-threshold frameworks miss entirely: some decisions must always be reviewed by a human, regardless of how confident the model is. Cross-jurisdiction data access and GDPR erasure requests are not routed to HITL because the model might be wrong. They are routed to HITL because the consequence of proceeding without human review is a category of harm — data sovereignty violation, unlawful erasure — that no model confidence level can authorise. The threshold for these categories is not a calibrated number. It is a hard requirement expressed as a policy rule that fires unconditionally.
This distinction — between confidence-triggered and policy-triggered HITL — is the one that compliance officers understand and engineering teams sometimes miss. The confidence-triggered HITL manages epistemic uncertainty. The policy-triggered HITL manages categorical risk. Both are necessary. Neither can substitute for the other.
The SHAP record as a precondition, not a dashboard
The explainability requirement in Article 14 — that the human reviewer be capable of understanding and evaluating the AI's output — has an architectural implication that is not obvious from the text of the regulation. The explanation must be available to the reviewer at the moment of review. Not reconstructable after the fact. Not generatable on demand. Available, in a committed record, when the reviewer opens the HITL queue.
This matters because model outputs are not deterministic across calls. If the SHAP explanation is generated when the reviewer requests it, rather than when the model made the decision, there is no guarantee that the explanation corresponds to the decision being reviewed. The model may have been updated. The feature values may have changed. The retrieved context may differ. The explanation that the reviewer sees may be accurate as of the moment it was generated, but not accurate as of the moment the decision was made.
The correct pattern is: SHAP record written to BigQuery as an atomic operation, timestamped, immutable, before the HITL node is created. The HITL queue entry contains a reference to the SHAP record, not a generation request. The reviewer sees exactly the explanation that corresponds to the decision they are reviewing. The record is in BigQuery, append-only, retrievable for audit purposes at any point in the future.