There is a conversation I have had, in different forms, on almost every ML project I have been brought into. It usually starts with someone asking which model the team is planning to use. Sometimes it is GPT-4 versus an open-source alternative. Sometimes it is which embedding model, or which fine-tuning approach, or whether to use a 7-billion or 13-billion parameter variant. The question is always asked with genuine interest and technical seriousness. It is also almost always the wrong first question.

Not because the model doesn't matter. It does, eventually. But by the time a model decision is actually consequential, there are usually five or six earlier decisions that have already determined whether the project can succeed at all — and none of them involve picking a model. They involve understanding the problem well enough to know what you are building, assembling data that is fit for purpose, defining what success looks like before you start, and thinking through how the system fails before it is in front of users.

The model is the part of the project that gets the most attention and the least leverage. The parts that actually determine outcomes are quieter, less glamorous, and almost entirely invisible in the way the industry talks about this work.


Why the model gets all the attention

The model selection conversation is appealing for understandable reasons. Models are comparable. They have benchmarks. There is published research, leaderboards, community consensus. Choosing a model feels like making a decision you can defend — you can point to numbers, cite papers, reference deployment case studies from companies with engineering teams far larger than yours.

The earlier work — problem framing, data quality, evaluation design, failure mode analysis — is murkier. There are no leaderboards for whether your problem statement is correct. Nobody publishes a benchmark for label quality. The decisions are harder to make and harder to justify because they are judgment calls rather than comparisons. They require sitting with uncertainty rather than resolving it by pointing at a chart.

So the conversation gravitates toward the model, where the ground feels firmer, while the actual load-bearing decisions accumulate unexamined underneath it.

The model is the most legible part of an ML project. Legibility and importance are not the same thing.

This is not a criticism of teams that fall into this pattern. It is a structural feature of how the technology is marketed, discussed, and written about. The fix is not better judgment in individual projects — it is a different sequence. Start with the things that determine whether the project is viable. Get to the model when it is actually the thing that needs deciding.


What actually determines whether an ML project succeeds

Across the systems I have designed and built — a local-first RAG for manufacturing, a federated query engine for enterprise sales, a model merging architecture for financial analytics, an alignment pipeline for regulated output — the pattern that determines outcome is consistent. It is not the model. It is the work that happens before the model is selected, and the infrastructure that surrounds it once it is deployed.

That work falls into five categories. Each one is upstream of model selection. Each one, if it fails, cannot be fixed by a better model.

What determines ML project outcomes — in order of leverage
1 — Problem framing
The most consequential decision in any ML project is whether the problem being solved is the right one. A model trained perfectly on the wrong problem does not produce partial value — it produces confident, useless outputs that erode trust faster than a system that simply didn't exist. Problem framing is not a kickoff activity. It is an ongoing discipline that should be revisited every time the data tells you something unexpected about what is actually going on.
2 — Data quality and fitness
Data quality is not a data engineering problem. It is an ML problem. The model you eventually train will learn from whatever signal is in the data — including noise, bias, and labelling inconsistency. A model trained on clean, well-labelled data that accurately represents the deployment distribution will outperform a model trained on a larger, noisier dataset almost every time. The time spent on data quality is not preparation for the real work. It is the real work.
3 — Baseline definition
Before any model is trained, you need to know what you are trying to beat. Not a published benchmark — the baseline for your specific problem, in your specific environment, with your specific data. This is usually a simple rule, a lookup, a heuristic, or human performance on the task. Every claim of model improvement is a claim relative to this baseline. If the baseline is not defined and measured, the claim is not verifiable. In several projects I have seen, a well-designed baseline outperformed the first three model iterations. That is not a failure — it is the system working correctly.
4 — Evaluation design
The evaluation metric you choose is the thing the model will optimise for. Choose the wrong metric and you will build a system that is excellent at something that does not matter. For every project I have built, the evaluation design came before any model was trained — not as a formality, but as a constraint. The metric had to be connected to the business outcome, measurable in production, and defined precisely enough that two people looking at the same output would agree on whether it passed. Where those conditions could not be met, the problem definition was not finished.
5 — Failure mode analysis
Every ML system fails in specific ways. The question is whether those failure modes are understood before deployment or discovered after. For a factory floor system, a wrong answer delivered with confidence is not a retrieval miss — it is a safety incident waiting to happen. For a financial analytics system, a hallucinated figure in a client memo is not a quality issue — it is a regulatory exposure. Understanding the failure modes in advance shapes the guardrail design, the confidence thresholds, the human-in-the-loop decisions, and the deployment scope. None of that is possible if the failure modes are only discovered after the model is in production.

The model sits below all five of these. It is not irrelevant — a model that cannot perform the task at all will fail regardless of how well everything else is designed. But within the range of models that are capable of the task, the difference between them is usually smaller than the difference between good and poor execution on any of the five stages above.


The AlignR lesson — the pipeline is the demonstration

One of the clearest examples of this in my own work is AlignR — an alignment pipeline for regulated enterprise AI. The system runs preference data collection, reward model training, a DPO loop, red-team evaluation, and compliance artefact generation. It is a seven-stage pipeline. The model it runs on in the MVP is a small distilled model that fits on a Kaggle T4 GPU for free.

There is a line in the AlignR documentation that captures the point exactly: the pipeline is the demonstration. The model is a parameter.

What that means in practice is that every stage of the pipeline — the preference data collection methodology, the reward model training approach, the red-team suite, the compliance artefact bundle — was designed and built to production standards regardless of model size. The same scripts, the same configuration files, the same evaluation harness, and the same artefact schemas run unchanged when the model is upgraded from a small distilled variant to a production-grade 7B or 13B model on proper GPU infrastructure.

The model is swappable. The pipeline is not. The pipeline is where the value lives.

A well-designed pipeline with a small model will outperform a poorly-designed pipeline with a large one. The pipeline is the system. The model is a component.

This is not a novel insight. It is what every experienced ML practitioner discovers after their first few production deployments. What is novel is how rarely it changes the conversation — which keeps starting at the model rather than at the pipeline.


The data quality problem that no model can fix

On one of the fine-tuning projects I built, the team arrived with a training dataset of approximately 2,000 question-answer pairs. The pairs had been generated synthetically by a large teacher model on a product knowledge base. The generation quality looked good on inspection — fluent, detailed, apparently accurate. The team wanted to jump to fine-tuning and model selection.

The data quality pipeline told a different story.

MinHash locality-sensitive hashing identified that roughly 18% of the training examples were near-duplicates — different phrasings of the same underlying question with the same answer. The effective training set was smaller than it appeared. More significantly, the teacher model had generated a number of answers that were factually incorrect on specific SKU-level product details — details where the correct answer required exact recall rather than plausible reasoning. The teacher model was reasoning to a confident, wrong conclusion.

A model fine-tuned on that dataset would have learned to be confidently wrong about specific facts at a predictable rate. No model choice would have fixed that. The only fix was upstream — catching the teacher hallucinations in the quality filter before they entered the training set, running deduplication before the training split was created, and verifying fact-level accuracy on the subset of questions where exact recall mattered.

Where project time is actually spent vs. where it should be spent
Where time gets spent
Model selection and tuning

The visible, comparable, benchmark-driven work. Easy to discuss in reviews. Hard to justify stopping.

Model comparison and benchmark evaluation
Hyperparameter tuning and ablation runs
Architecture decisions between similar-performing options
Prompt engineering on top of a poorly-framed problem
Where leverage actually is
Pipeline design and data quality

The invisible, judgment-driven work. Hard to show in a demo. Determines whether the demo matters.

Data deduplication, coverage analysis, quality filtering
Evaluation metric design tied to business outcome
Failure mode taxonomy before deployment
Baseline measurement before any model is trained

The uncomfortable version of this observation is that most ML project timelines are inverted. The first half is spent on data wrangling that should have been planned upfront. The second half is spent on model iteration that is trying to compensate for problems that live in the data and the evaluation design. The model is being asked to solve problems that it cannot solve, because the problems are upstream of it.


Evaluation is not the last step

One of the most consistent mistakes in ML project design is treating evaluation as something that happens after the model is built. You train the model, you evaluate it, you iterate. The evaluation tells you how good the model is. That is the sequence, and it seems logical, and it almost guarantees that your evaluation metric will drift away from what actually matters.

The reason is straightforward. When evaluation is designed after training, it tends to be designed around what the model produces — which means it ends up measuring what the model is good at rather than what the business needs. The metric gets optimised, the metric improves, and the business outcome stays flat. The team is confused. The model is not confused. It did exactly what it was told.

The evaluation design for every project I have built comes before the first training run — not as a checkbox, but as a constraint on what the model is permitted to optimise for. For the fine-tuning system, this meant three independent metrics: exact match for factual recall, ROUGE-L for coverage, and a held-out judge model for hallucination detection. The judge model was explicitly a different model family from the teacher to prevent the evaluation from being gamed by a model that had learned to produce outputs that scored well on its own teacher's preferences.

Each of those design decisions was made before any training data was generated. The evaluation was not measuring how good the model was. It was defining what good meant. That definition had to be in place before the model was trained, or the model would define it by default — in its own favour.


What the pre-model checklist looks like in practice

Across the projects in my portfolio, the sequence before any model is selected or trained follows a consistent pattern. It is not a methodology — it is a set of conditions that have to be true before a model choice is consequential. Until they are true, the model choice is a guess on top of unverified assumptions.

The problem is framed as an ML task with a specific input, output, and success criterion. Not a business goal — an ML task. "Improve customer retention" is not an ML task. "Predict, at 30-day lead time, which accounts in the mid-market segment have a greater than 60% probability of churning, using the features available in the CRM at the time of prediction" is an ML task. The specificity is the point. Vague task definitions produce models that solve something adjacent to the actual problem.

A baseline exists, has been measured, and is documented. What does the business do today without a model? What is the performance of the simplest possible rule on the task? That number is the floor. Every model claim is relative to it. A model that does not beat the baseline is not worth deploying, regardless of how impressive its absolute performance looks.

The training data has been audited for quality, coverage, and leakage. Quality: are the labels correct? Coverage: does the data represent the full distribution of inputs the model will see in production, including edge cases and low-frequency events? Leakage: is there any signal in the training data that would not be available at inference time? Each of these, left unexamined, is a silent failure waiting to surface after deployment.

The evaluation metric is tied to the business outcome and defined precisely enough to be unambiguous. Two people looking at the same model output should be able to agree, independently, on whether it passes. If that is not possible, the metric is underspecified. An underspecified metric will be interpreted differently by different people, which means the team is not actually optimising toward a shared goal.

The three most likely failure modes are documented, and a response to each is designed. Not an exhaustive failure analysis — the three most likely ones. What happens when the model is wrong and confident? What happens when it is asked a question outside its training distribution? What happens when the underlying data changes and the model's training becomes stale? Each of these needs a designed response before deployment, not a reactive patch after an incident.

When all five conditions are met, the model selection decision becomes relatively straightforward. You have a well-defined task, a baseline to beat, clean data with known coverage, an unambiguous evaluation metric, and a failure mode response designed. At that point you can compare models against each other meaningfully — because you know what you are comparing them on, and you know what the stakes are if they fail.

Before those conditions are met, model comparison is noise. You are benchmarking candidates for a role whose requirements have not been written.


The model question, asked at the right time

None of this means the model is unimportant. There are genuine decisions to make — between open-weight and proprietary models when data sovereignty matters, between retrieval-augmented and fine-tuned approaches when knowledge freshness is the constraint, between a 3-billion and 7-billion parameter variant when the latency budget and hardware envelope are fixed. These are real decisions with real consequences.

The point is that they are decisions about fitting a tool to a well-understood problem. When the problem is well-understood — when the task is precise, the baseline is measured, the data is clean, the evaluation is defined, and the failure modes are documented — the model decision is constrained enough to be made well. The space of viable options is smaller, the evaluation is credible, and the judgment call is actually a judgment call rather than a guess.

The conversation about which model to use is not the wrong conversation. It is the right conversation, asked at the wrong time. Ask it first, and you are picking a tool before you know what you are building. Ask it after the five conditions are met, and you are making a decision that the project has actually earned.

In my experience, most teams that are rigorous about the pre-model work find that the model decision is less exciting than they expected — because the problem is specific enough that only a few options are viable, the evaluation is tight enough that differences between them are measurable, and the baseline is clear enough that the bar for what counts as good is not ambiguous. The model gets selected efficiently and the team moves on to the harder work of making the system reliable in production.

That is what it should feel like. The model is a parameter. The system is the work.

References & Further Reading

Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015. — On the compound cost of undocumented assumptions in ML pipelines. The data quality and evaluation design problems described here are the earliest and most expensive form of this debt.

Sambasivan, N. et al. (2021). "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. ACM CHI 2021. — The most thorough empirical study of data quality failures in production ML systems. The pattern it describes is exactly the one this article is arguing against.

Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL 2020. — On the gap between benchmark accuracy and real-world robustness. Directly relevant to the evaluation design section.

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. — The most practically useful book on production ML systems currently available. Chapters 4 and 6, on training data and model evaluation respectively, are the clearest treatment of the pre-model work described here.

Continue reading

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch