The €40M problem: why the most expensive engineering failures are invisible until they're not

There is a category of engineering failure that never appears in incident reports. It generates no alerts. It produces no error logs. It does not manifest as a system outage or a degraded service. It manifests as a line in the annual accounts: a reserve against warranty claims that is larger than it should be, held with greater uncertainty than it needs to be, because the engineering organisation cannot predict something that is, in principle, entirely predictable.

The specific version of this failure I want to examine is a €40M annual warranty over-reserve held by a mid-market medical imaging company against unplanned failures in a fleet of 12,000 MRI and CT units deployed across 34 countries. The reserve is not the product of unusual equipment unreliability or poor engineering. It is the product of an architectural gap: the absence of a unified telemetry pipeline that would make the failure prediction problem solvable. The units generate data. The data exists. It is just sitting in six different regional systems with six different schemas, inaccessible to any model that could act on it.

This problem is interesting to me not because €40M is an unusually large number — for a €1.2B revenue company running a complex deployed fleet, it is actually a fairly contained overrun — but because it is representative of a pattern I have seen across multiple industries. The most expensive engineering failures are usually not system failures. They are architectural gaps: missing components in a data or intelligence stack that would be straightforward to build, if the decision had been made to build them. The cost is not the cost of a catastrophic event. It is the accumulated cost of not being able to predict, and therefore prevent, a class of events that happens continuously and quietly across a large deployed base.

The diagnosis before the design: why the reserve exists

The first step in fixing an architectural gap is being precise about why it exists. "We don't have predictive maintenance" is a symptom, not a cause. The question an architect asks is: what is the structural reason that predictive maintenance is not possible here, and what would need to change for it to become possible?

In this case, the structural reason is the absence of a unified telemetry schema across the fleet. Each MRI unit generates DICOM service events — standardised medical imaging protocol messages that include operational telemetry alongside clinical data. These events are valuable. They contain operating temperature readings, gradient coil usage counts, RF amplifier power measurements, cryogen top-up records, and error codes that correlate with future failure modes. The data is rich and the fleet is large enough that a regression model trained on it would have substantial statistical power.

But the events land in six different regional service management systems — one for Central Europe, one for Southern Europe, one for APAC, one for North America, one for the Middle East, one for the rest of the world — each with its own field naming conventions, its own schema for error codes, its own aggregation and filtering logic that was applied before the data was stored. The signals that would feed a Remaining Useful Life model are present. They are just not comparable across regions, and no single system sees the whole fleet.

The reserve exists not because failures are unpredictable in principle, but because they are unpredictable in practice given the current data architecture. The finance team, unable to model failure distribution from the data, applies a conservative actuarial estimate based on industry failure rate statistics and the company's historical claims pattern. The reserve is larger than it needs to be because the uncertainty is larger than it needs to be. The uncertainty is larger than it needs to be because the data architecture was never designed with cross-fleet predictive analytics in mind.

The €40M is not the cost of a failure. It is the cost of an architectural decision that was never made — the decision to treat field telemetry as an engineering asset rather than a service management side-effect. That decision, made differently at the start, would have cost a fraction of what the reserve costs annually.

Figure 1 — Current state: six disconnected regional systems, no unified signal

The current state is not a data absence problem. The DICOM service events contain rich operational telemetry — coil usage, temperature profiles, RF amplifier readings, error codes — that correlates directly with failure modes. The problem is that six incompatible schemas mean no model can see the whole fleet. The reserve compensates for the architectural gap with financial slack.

The architecture review: what needs to change and in what order

When I look at a problem like this from an architecture perspective, the first question is not "what model should we train?" It is "what does the data path need to look like before any model is worth training?" The model is the last thing to specify. The schema, the ingestion pipeline, the normalisation logic, the quality controls, the feature engineering — all of this comes before the model and constrains what the model can do.

In this case, the architecture review produces a clear sequence. The starting point is not predictive maintenance. It is a unified canonical event schema — a single, versioned definition of what a telemetry event from a deployed unit looks like, regardless of which regional system it originated from. This schema needs to be agreed before any code is written, because it is the contract that every downstream component depends on. Changing it after the pipeline is live requires migrating all historical data, which is expensive and error-prone.

The second step is the ingestion pipeline: a Pub/Sub topic per region, with a normalisation layer (Dataflow) that transforms each regional schema to the canonical format and performs TFX schema validation before the event reaches the Feature Store. The normalisation logic needs to be tested against real historical data from each regional system before it goes live, because the schema differences between regions are not just naming conventions — some regions encode error codes as integers, others as strings; some aggregate temperature readings before emitting, others emit raw sensor values. Getting this wrong silently means getting predictions wrong downstream.

The third step — and the one most teams reach for first, when they should reach for it last — is the model. A Remaining Useful Life regression model trained on the unified feature set: operating hours since last service, error code frequency over a rolling 90-day window, temperature deviation from baseline, gradient coil usage rate, cryogen top-up interval trend. With 12,000 units and several years of historical data from the regional systems, the training set is large enough that this is a well-posed regression problem.

The pipeline in detail: from raw event to RUL prediction

The diagram below shows the target architecture: the unified telemetry pipeline from raw DICOM event to RUL prediction to Field Service Manager alert, with the schema validation, quarantine, anomaly detection, and HITL layers that govern each step. The critical design decision at every stage is the same one that appears throughout this series: governance is built into the pipeline, not bolted on after it.

Figure 2 — Target architecture: unified telemetry pipeline, from raw DICOM event to RUL prediction

The pipeline has five stages. The schema normalisation and TFX validation stages are the unglamorous ones that most teams underspecify — and the ones whose failure produces the most expensive downstream consequences. The quarantine-then-review design is critical: in a sensor system, a schema-violating record is not necessarily a bad record. It is a record whose provenance is uncertain. Discarding it means losing a data point that might be the signal that predicts the failure you were trying to prevent. ISO 13485 medical device regulations require human approval on any maintenance scheduling decision — the HITL is regulatory, not optional.

The three architectural decisions that determine whether this works

Building this pipeline is not technically complex. The GCP components are mature and well-documented. The model architecture for RUL prediction on time-series sensor data is well-understood. What determines whether a pipeline like this actually delivers on its promise — actually reduces the reserve, actually predicts failures before they occur — is three architectural decisions that are easy to get wrong and expensive to fix after the fact.

Decision one: quarantine, not discard

The first and most important architectural decision is what happens to a schema-violating record at the TFX validation stage. The obvious answer — the one that keeps the pipeline simple — is to discard it and emit an error metric. This is wrong in a sensor system, and the reason is irreversibility. A telemetry event from an MRI unit carries a timestamp and a unit identifier that cannot be reconstructed. If that event carries a novel error code that indicates a new failure mode, and it was discarded because the error code field had a type mismatch with the current schema version, then that signal is gone. The unit will fail. The failure will be unexpected. The warranty claim will be processed. And somewhere in a discarded log will be the event that would have predicted it.

The correct behaviour is quarantine: the record is held in a quarantine queue, a data steward is notified, the schema violation is reviewed, and if the record is valid but the schema needs updating (as opposed to the record being genuinely malformed), the record is reinstated and the schema is evolved. This adds operational overhead. It is worth it, because the operational overhead of managing a quarantine queue is small relative to the financial consequence of losing the signal that would have predicted an unplanned failure on a deployed medical imaging unit.

Decision two: the RUL model is not the anomaly detector, and the anomaly detector is not the RUL model

The second decision is model separation. There is a temptation, in a pipeline that is already complex, to build one model that does both jobs: predicts remaining useful life and flags anomalous behaviour. This is a mistake for the same reason that combining episodic and semantic memory is a mistake in agent systems: the two problems have different training objectives, different feature relevance profiles, and different failure modes.

The RUL regression model is predicting a continuous outcome — how many operating hours remain before a maintenance event is likely to be needed. Its training signal is historical maintenance records paired with the telemetry features observed in the weeks before each service event. It needs a large training set and tolerates gradual feature drift well.

The Isolation Forest anomaly detector is identifying unit behaviour that deviates significantly from the fleet baseline — not predicting failure, but flagging that something unusual is happening that warrants inspection, even if the RUL model does not predict an imminent failure. It is trained on the normal operating distribution and is sensitive to sudden deviations. It should be re-evaluated more frequently and on a different feature subset than the RUL model.

Running them separately, with separate inference pipelines and separate SHAP records, means that the HITL reviewer sees two distinct signals with distinct explanations — and can act on one without the other being confounded by it.

Decision three: the HITL is ISO 13485, not a UX convenience

The third decision is the nature of the HITL checkpoint in the maintenance scheduling workflow. This is a medical device system operating under ISO 13485 — the quality management standard for medical devices — which requires that any decision affecting the maintenance schedule of a deployed medical device involve a qualified human reviewer who can evaluate the recommendation against their knowledge of the specific unit and the operational context. The HITL is not a UX feature that can be removed to improve latency. It is a regulatory requirement that must be implemented as a non-bypassable step in the workflow.

This means the same pattern described throughout this series applies here: the HITL is a state machine node, not a UI feature. The maintenance scheduling API requires a maintenance_approval_id parameter that can only be supplied after a qualified Field Service Manager has reviewed the RUL prediction and its SHAP explanation and committed an approval. The approval is logged in Chronicle. The maintenance schedule is issued only after the approval record exists. This is not optional, and it is not something that can be added after the system is deployed. It has to be in the architecture from day one.

The architecture review framework: from P&L problem to architecture problem

The pattern in this article is one that repeats across every industry where large deployed assets generate telemetry: manufacturing equipment, commercial vehicles, energy infrastructure, telecommunications hardware. The specific numbers differ. The structure is the same. A €40M warranty reserve, an 18% unplanned downtime rate, a 3.2× reactive-to-preventive maintenance cost ratio — these are all P&L expressions of the same underlying condition: the data exists, the models are trainable, but the architecture that would connect the data to the decisions has not been built.

Figure 3 — Symptom-to-cause-to-fix: the architectural translation of a P&L problem

P&L symptom	Architectural cause	What the architecture review must specify
€40M warranty over-reserve Annual reserve larger than actuarial necessity	No unified telemetry schema Six regional systems, six incompatible schemas. Fleet-wide failure distribution cannot be modelled. Reserve is set conservatively to cover the uncertainty.	Canonical event schema defined before any pipeline is built. Schema versioned and governed as a first-class artefact. TFX validation enforced at ingestion. Quarantine-not-discard for violations.
3.2× reactive/preventive cost ratio Reactive maintenance costs 3.2× preventive at equivalent maintenance scope	No predictive signal Field Service dispatched reactively after failure because no system predicts imminent failures. Route optimisation and parts pre-positioning impossible.	RUL regression model trained on unified Feature Store. Inference on 30-day rolling horizon. Field Service Manager HITL before schedule issuance. Parts pre-positioning triggered at 60-day RUL threshold.
Low anomaly detection coverage Novel failure modes not caught until unit failure occurs	No fleet-wide baseline Without unified telemetry, there is no normal operating envelope to deviate from. Anomaly detection requires a baseline; baseline requires unified data.	Isolation Forest anomaly detector trained on fleet-wide normal distribution. Separate from RUL model; separate inference pipeline. SHAP records for every flagged anomaly. Threshold calibrated quarterly.
Post-failure compliance exposure Inability to demonstrate due diligence in maintenance scheduling under ISO 13485	No structured HITL audit trail Maintenance scheduling decisions made informally by Field Service Managers with no structured record of what prediction they were given and what they decided.	HITL as state machine node. Field Service Manager reviews RUL prediction + SHAP explanation. Approval logged in Chronicle with `maintenance_approval_id`. Maintenance schedule unreachable without approval record. ISO 13485 satisfied structurally.
Data loss on schema evolution New error codes discarded because they don't match current schema version	Discard-on-violation design Pipeline drops records that fail schema validation rather than quarantining them. Novel signals lost at the validation boundary.	Quarantine queue with steward notification on every schema violation. Reinstatement workflow. Schema evolution governed — new error codes require Architecture Board review before addition to canonical schema.

The lesson that transfers beyond supply chain

The most important generalisation from this problem is not about supply chain architecture. It is about the relationship between P&L problems and architectural decisions. Every large financial line item in an engineering organisation that cannot be explained by unavoidable market conditions or genuine technical complexity deserves an architectural diagnosis. Not a software development response, not a data science response — an architectural one. The question is: what structural absence in the current system produces this cost, and what would need to be true about the architecture for that cost to be eliminated or substantially reduced?

In the €40M case, the structural absence is a unified telemetry schema and the pipeline that enforces it. In other organisations I have worked with, the equivalent absences produce different P&L symptoms: a customer churn rate that is higher than it should be because there is no system that sees the early-warning signals in support interactions; a working capital position that is larger than it needs to be because accounts receivable has no model that predicts which invoices will be paid late; an inventory overrun because the demand planning system sees product-level data but not the customer-level leading indicators that would make the forecast more accurate.

These are all the same problem in different clothes. The data that would make the decision better exists. The model that could act on it is trainable. The architecture that would connect the data to the decision has not been built. The P&L absorbs the cost of the missing connection in the form of reserves, buffers, and overruns that exist precisely to cover the uncertainty that the missing intelligence layer would have resolved.

Identifying that connection — naming the architectural gap that produces a specific financial cost — is the skill that distinguishes an architect who creates business value from one who creates technical solutions. The technical solution comes second. The architectural diagnosis comes first. And the diagnosis starts with a P&L line item and works backwards to the structural absence that produced it.

The data existed. The models were trainable. The architecture that would have connected them was never built. The reserve is what that decision cost, year after year, until someone translated the P&L problem into an architecture problem and decided to solve it.

References & Further Reading

ISO 13485:2016, Medical devices — Quality management systems — Requirements for regulatory purposes. The standard that mandates structured human review of decisions affecting maintenance scheduling for deployed medical devices. The HITL requirement described in this article is a structural expression of this standard's Section 7.5 (Production and service provision) requirements.

DICOM Standard, Digital Imaging and Communications in Medicine — The medical imaging standard that defines the service event format referenced in this article. DICOM service events include structured operational telemetry alongside clinical imaging metadata; the specific fields described (error codes, gradient coil usage, cryogen records) are standard DICOM attributes in the service log domain.

Google Cloud, Pub/Sub and Dataflow documentation — The streaming ingestion and normalisation pattern described in Figure 2. The schema normalisation approach (Dataflow transforming regional schemas to a canonical format) follows the standard Pub/Sub-to-BigQuery streaming pipeline pattern.

TensorFlow Extended (TFX) documentation — The schema validation and quarantine pattern described in the article. TFX's ExampleValidator component implements the schema drift detection that drives the quarantine queue decision.

Liu, F.T., Ting, K.M., and Zhou, Z-H. (2008). "Isolation Forest." IEEE International Conference on Data Mining (ICDM). — The foundational paper for the Isolation Forest anomaly detection algorithm used in the fleet telemetry anomaly layer. The key property — anomalies are isolated in fewer splits than normal observations — makes it well-suited to telemetry data with high-dimensional feature spaces and non-parametric anomaly distributions.

Saxena, A. et al. (2008). "Damage propagation modeling for aircraft engine run-to-failure simulation." International Conference on Prognostics and Health Management. — The NASA CMAPSS benchmark that established the standard evaluation methodology for Remaining Useful Life regression on sensor data. The feature engineering approach described in this article follows the RUL literature convention of rolling-window statistics over raw sensor readings.

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch