A few years into doing this work, a stakeholder walked into a project kickoff with a deck already built. Slide three had the solution on it: a recommendation engine. The business problem — buried somewhere in the appendix — was that sales conversion from their product catalogue had dropped 18% over two quarters. The recommendation engine was going to fix it.

Nobody in the room asked why.

We spent six weeks on data pipelines, embedding models, and similarity scoring before someone finally pulled the CRM data and noticed that the drop was concentrated in a single customer segment — mid-market accounts that had recently been reassigned to a new sales team. The problem wasn't product discovery. It was onboarding. A recommendation engine was not going to move that number by a single percentage point, and we had just spent six weeks proving it the expensive way.

That experience changed how I start every ML project. Not with the data. Not with the model. With a much more uncomfortable question: is the thing we are being asked to build actually connected to the outcome we are being asked to improve?


The solution arrives before the problem is understood

This is the most common failure mode in applied ML, and it is so pervasive that most teams have stopped noticing it. A business stakeholder reads about a technology — a large language model, a churn prediction system, a computer vision pipeline — and arrives at the project with a solution in mind. The ML team's job, as it gets handed to them, is to build it.

The business problem is treated as the context for the build, not the thing the build is supposed to solve. And those are not the same thing.

What makes this particularly difficult is that the stakeholder is not being irrational. They have read the case studies. They have seen what these systems can do at other companies. The technology is real, the capability is real, and the desire to apply it is understandable. The failure is not in the enthusiasm. It is in the skipped step — the step where you ask whether the causal chain between the proposed solution and the stated problem actually holds.

The question is never "can we build this?" The question is "if we build this perfectly, does the number we care about actually move?"

When you cannot answer that question with confidence before the project starts, you are not doing machine learning. You are doing expensive prototyping in the direction of a hypothesis that nobody has verified.


Why the business problem and the ML problem are not the same thing

Every ML project involves two distinct translations, and both of them can go wrong independently. The first is translating a business outcome into a well-posed problem. The second is translating that well-posed problem into an ML task. Most teams are reasonably careful about the second translation. Almost no teams are careful enough about the first.

A business outcome is something like: reduce customer churn, increase operational throughput, catch defective units before they ship, shorten the time it takes a new employee to become productive. These are real, measurable outcomes with financial weight behind them. They are also not ML problems. They are business problems, and the path from business problem to ML problem requires a causal model — an explicit claim about which mechanism, if modified, would produce the desired outcome.

That causal model is almost never made explicit. It is assumed, usually by the person who proposed the solution, and then the assumption gets carried forward into the project like a piece of undocumented debt. Everything built on top of it is technically sound. The foundation is a guess.

The Two Translations — Where ML Projects Actually Break
What teams skip
Business outcome → well-posed problem

The causal claim. If we intervene on X, does Y move? This is a business question, not an ML question — and it must be answered before any data is touched.

What is the actual mechanism driving the outcome?
Does the proposed intervention sit on that mechanism?
Is there evidence — even weak evidence — that this link exists?
Who in the business can falsify this assumption?
What teams focus on
Well-posed problem → ML task

The technical translation. Classification or regression? Supervised or unsupervised? What is the label? What is the feature space? Teams are thorough here — on top of a foundation they never verified.

Task framing: classification, regression, retrieval, generation
Label definition and quality assessment
Feature engineering and availability at inference time
Evaluation metric selection and baseline definition

The uncomfortable truth is that the second translation — the one teams are careful about — is largely recoverable. A wrong task framing can be corrected in the third week of a project. A wrong causal assumption, discovered in the third week, usually means starting over.


What a bad question looks like

After enough projects, the signs of a poorly-framed question become recognisable before any data is pulled. These are not technical red flags. They show up in the language used in the first meeting, in the way the problem gets handed over, in what happens when you ask why.

Question framing failures — what they sound like and what they produce
The solution is named before the problem
When a stakeholder opens with "we need an LLM" or "we want to build a recommendation engine," the question has already been pre-answered. The ML team is being asked to justify a conclusion rather than reach one. The actual problem — if it surfaces at all — gets retrofitted to the solution someone already decided on.
The metric and the outcome are not the same thing
A team is asked to maximise click-through rate. The business wants to increase revenue. These are not the same objective, and optimising one can actively harm the other. When the metric being handed to the ML team is a proxy — and proxies always are — the distance between the proxy and the actual outcome needs to be interrogated explicitly. It almost never is.
"More data" is the proposed solution to a bad model
When a model underperforms and the first instinct is to collect more data, it usually means the question being asked of the model is wrong. More data answers the question you are already asking — it does not tell you whether that is the right question. A model trained on ten million examples of the wrong problem is not better than a model trained on ten thousand examples of the right one.
The baseline is not defined
If you cannot describe what the business does today — without ML, without any model, with the simplest possible rule — then you do not understand the problem well enough to solve it. The baseline is not a benchmark for the model. It is the check on whether ML is the right tool at all. Sometimes a lookup table and two business rules outperform a fine-tuned transformer. The project should discover that early, not at deployment.
Nobody can say what "good" looks like
If you ask the stakeholder "how will you know, six months after deployment, whether this worked?" and they cannot answer — not vaguely, not with a feeling, but with a specific number that moves in a specific direction — the problem definition is not finished. You are being asked to build something with no agreed-upon definition of success. That is not an ML problem. It is an unresolved business conversation that has been handed to the wrong team.

None of these require a bad-faith stakeholder to produce. They are the natural output of organisations that have learned to be enthusiastic about ML faster than they have learned to think carefully about what they want it to do. The patterns are structural, not personal — which means they will repeat on every project until someone builds a different habit at the front of the process.


The question audit — before any data is touched

The fix is not a longer discovery phase. It is a specific set of questions, asked early, that force the causal assumptions into the open where they can be examined. These are not questions the ML team can answer alone — they require the people who own the business outcome to be in the room. That is the point. The moment these questions cannot be answered without a business owner present is exactly the moment you know you were about to build the wrong thing.

What is the outcome you are trying to move, and how is it currently measured? Not the problem statement — the actual number on the dashboard that will look different if this project succeeds. If there is no number, the outcome is not defined. Define it before proceeding.

What do you believe is causing that outcome to be where it is today? This is the causal assumption. It should be stated explicitly, in plain language, and it should be specific enough to be falsifiable. "Customers are not engaging with the right products" is not specific. "Customers in the mid-market segment cannot find relevant products because the search function does not account for their industry context" is specific.

If an ML system addressed that specific mechanism perfectly, what is the maximum possible impact on the outcome? This is the ceiling check. If the mechanism you are targeting accounts for 10% of the variance in the outcome, a perfect solution buys you 10% improvement. Is that worth the project? Sometimes yes. But the ceiling should be known before the project starts, not after it delivers.

What does the business do today, without any model, to address this problem? Describe the current state completely — the manual process, the rule, the heuristic, the human judgment call. This is your baseline. Every claim of model improvement is a claim relative to this baseline, and this baseline needs to be measured, not assumed.

How will you know, in production, whether the deployed system is working? The evaluation does not end at model accuracy. It ends when the outcome number moves. Define the measurement plan — how frequently, on what segment, with what comparison group — before the model is built. If this cannot be designed upfront, the problem definition has a gap.

These questions are not additions to the project plan. They are replacements for the assumptions the project plan was making silently. The assumption that the proposed solution addresses the right mechanism. The assumption that the metric being optimised is connected to the outcome being chased. The assumption that improvement over the current state is even possible from the angle being taken.

Every one of those assumptions, left unexamined, is a project risk that grows compounding interest until it surfaces at the wrong moment — usually in a review meeting, six weeks in, when someone finally pulls the data that should have been pulled in week one.


What this looks like in practice — FederaQ

One of the systems I built that taught me this most directly was a query federation engine for sales intelligence. The original brief was a RAG system — a retrieval-augmented generation setup that would let sales reps ask questions about deals, accounts, and pipeline in natural language.

The question audit surfaced something uncomfortable early: the sales reps did not have a retrieval problem. They knew where the information was. They had a synthesis problem. The data they needed to answer a pre-call question — account history, open opportunities, recent email threads, CRM notes — existed in four different systems, none of which talked to each other. A RAG system would have indexed static documents and given them answers that were out of date by the time the question was asked.

The right ML problem was not retrieval. It was live federation — routing a natural language question to the right operational systems, generating structured sub-queries against each, and synthesising a coherent answer from data that was current at the moment of the ask.

RAG was the answer to the question everybody assumed was being asked. The actual question, once examined, required a different architecture entirely.

That distinction — between a static retrieval problem and a live federation problem — is invisible if you skip the question audit. It becomes very visible when you ask "what do sales reps actually need to do their job better?" and sit with that question long enough to get a real answer rather than a convenient one.

The resulting system — FederaQ — is built as a federated query synthesis engine, not a RAG. The architectural difference is not cosmetic. It determines what data sources can be connected, what freshness the answers carry, what failure modes exist. The entire design follows from getting the question right.


The thing that makes this hard

There is a category of resistance that no amount of methodological rigour addresses, and it is worth naming honestly: the stakeholder who already knows the answer and needs the ML team to confirm it.

This is not bad intent. It is how large organisations often work. A senior leader has committed to a technology direction — publicly, in a board update, in a budget line item. The project team's job, as it lands on the ML practitioner's desk, is to execute that direction, not interrogate it. The question audit feels adversarial in this context even when it is not meant to be.

The framing that has worked for me in these situations is not to challenge the solution but to test the assumption behind it. "I want to make sure we build this in the way that gives it the best chance of moving the outcome. Can we spend an hour confirming the mechanism before we start on the architecture?" That is not a rejection of the proposal. It is a request to strengthen it. Most reasonable stakeholders will take that conversation.

The ones who won't are also telling you something important — not about the technology, but about the conditions under which the project will be evaluated. That information is worth having before you start, not after you deliver.


What changes when you get the question right

If the question were just a formality — a box to check before the real work begins — the solution would be a longer discovery meeting. More stakeholder interviews, a better requirements document, a more detailed project brief.

But the question is not a formality. It is the load-bearing structure underneath every technical decision that follows. The task framing, the feature selection, the evaluation metric, the deployment design, the success criteria — all of it is downstream of the question. Get the question wrong and you are building a structurally sound system on a misaligned foundation. It will work. It will not help.

When you get the question right, something changes in the project. The data exploration becomes purposeful rather than exploratory. The model selection is constrained by the actual problem rather than by what is fashionable. The evaluation is tied to a business outcome rather than a benchmark. And the conversation at deployment — the one where you explain what the system does and why it should be trusted — becomes straightforward because everyone already agreed on what success looked like before the build started.

The stakeholder who walked in with a recommendation engine on slide three had a real problem. The 18% conversion drop was real, the pressure to fix it was real, and the desire to use the technology available to them was entirely rational. The failure was not the recommendation engine. The failure was the assumption — shared and unexamined — that the engine was connected to the problem. That assumption lived in the space between the business outcome and the proposed solution, in the gap that the question audit is designed to close.

That gap is where most ML projects fail. Not in the model, not in the data, not in the engineering. In the step before all of it, where the question either gets asked or gets assumed. It usually gets assumed. And the project pays for it later, in the currency of weeks.

References & Further Reading

Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015. — On the compounding cost of undocumented assumptions in ML pipelines. The business problem framing layer is the earliest and most expensive form of this debt.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. — On substitution bias: the tendency to answer an easier question than the one being asked. Most ML solution proposals are substitution answers to questions that were never properly posed.

Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books. — The clearest available treatment of why correlation-based systems fail to produce business outcomes when the causal model is wrong. Directly relevant to the mechanism problem described in §2.

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. — Chapter 2 on framing ML problems from business objectives is the most practical treatment of the first translation problem I have found. Required reading before any project kickoff.

Continue reading

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch