There is a conversation I have had, in different forms, on almost every ML project I have been brought into. It usually starts with someone asking which model the team is planning to use. Sometimes it is GPT-4 versus an open-source alternative. Sometimes it is which embedding model, or which fine-tuning approach, or whether to use a 7-billion or 13-billion parameter variant. The question is always asked with genuine interest and technical seriousness. It is also almost always the wrong first question.
Not because the model doesn't matter. It does, eventually. But by the time a model decision is actually consequential, there are usually five or six earlier decisions that have already determined whether the project can succeed at all — and none of them involve picking a model. They involve understanding the problem well enough to know what you are building, assembling data that is fit for purpose, defining what success looks like before you start, and thinking through how the system fails before it is in front of users.
The model is the part of the project that gets the most attention and the least leverage. The parts that actually determine outcomes are quieter, less glamorous, and almost entirely invisible in the way the industry talks about this work.
Why the model gets all the attention
The model selection conversation is appealing for understandable reasons. Models are comparable. They have benchmarks. There is published research, leaderboards, community consensus. Choosing a model feels like making a decision you can defend — you can point to numbers, cite papers, reference deployment case studies from companies with engineering teams far larger than yours.
The earlier work — problem framing, data quality, evaluation design, failure mode analysis — is murkier. There are no leaderboards for whether your problem statement is correct. Nobody publishes a benchmark for label quality. The decisions are harder to make and harder to justify because they are judgment calls rather than comparisons. They require sitting with uncertainty rather than resolving it by pointing at a chart.
So the conversation gravitates toward the model, where the ground feels firmer, while the actual load-bearing decisions accumulate unexamined underneath it.
The model is the most legible part of an ML project. Legibility and importance are not the same thing.
This is not a criticism of teams that fall into this pattern. It is a structural feature of how the technology is marketed, discussed, and written about. The fix is not better judgment in individual projects — it is a different sequence. Start with the things that determine whether the project is viable. Get to the model when it is actually the thing that needs deciding.
What actually determines whether an ML project succeeds
Across the systems I have designed and built — a local-first RAG for manufacturing, a federated query engine for enterprise sales, a model merging architecture for financial analytics, an alignment pipeline for regulated output — the pattern that determines outcome is consistent. It is not the model. It is the work that happens before the model is selected, and the infrastructure that surrounds it once it is deployed.
That work falls into five categories. Each one is upstream of model selection. Each one, if it fails, cannot be fixed by a better model.
The model sits below all five of these. It is not irrelevant — a model that cannot perform the task at all will fail regardless of how well everything else is designed. But within the range of models that are capable of the task, the difference between them is usually smaller than the difference between good and poor execution on any of the five stages above.
The AlignR lesson — the pipeline is the demonstration
One of the clearest examples of this in my own work is AlignR — an alignment pipeline for regulated enterprise AI. The system runs preference data collection, reward model training, a DPO loop, red-team evaluation, and compliance artefact generation. It is a seven-stage pipeline. The model it runs on in the MVP is a small distilled model that fits on a Kaggle T4 GPU for free.
There is a line in the AlignR documentation that captures the point exactly: the pipeline is the demonstration. The model is a parameter.
What that means in practice is that every stage of the pipeline — the preference data collection methodology, the reward model training approach, the red-team suite, the compliance artefact bundle — was designed and built to production standards regardless of model size. The same scripts, the same configuration files, the same evaluation harness, and the same artefact schemas run unchanged when the model is upgraded from a small distilled variant to a production-grade 7B or 13B model on proper GPU infrastructure.
The model is swappable. The pipeline is not. The pipeline is where the value lives.
A well-designed pipeline with a small model will outperform a poorly-designed pipeline with a large one. The pipeline is the system. The model is a component.
This is not a novel insight. It is what every experienced ML practitioner discovers after their first few production deployments. What is novel is how rarely it changes the conversation — which keeps starting at the model rather than at the pipeline.
The data quality problem that no model can fix
On one of the fine-tuning projects I built, the team arrived with a training dataset of approximately 2,000 question-answer pairs. The pairs had been generated synthetically by a large teacher model on a product knowledge base. The generation quality looked good on inspection — fluent, detailed, apparently accurate. The team wanted to jump to fine-tuning and model selection.
The data quality pipeline told a different story.
MinHash locality-sensitive hashing identified that roughly 18% of the training examples were near-duplicates — different phrasings of the same underlying question with the same answer. The effective training set was smaller than it appeared. More significantly, the teacher model had generated a number of answers that were factually incorrect on specific SKU-level product details — details where the correct answer required exact recall rather than plausible reasoning. The teacher model was reasoning to a confident, wrong conclusion.
A model fine-tuned on that dataset would have learned to be confidently wrong about specific facts at a predictable rate. No model choice would have fixed that. The only fix was upstream — catching the teacher hallucinations in the quality filter before they entered the training set, running deduplication before the training split was created, and verifying fact-level accuracy on the subset of questions where exact recall mattered.
The visible, comparable, benchmark-driven work. Easy to discuss in reviews. Hard to justify stopping.
The invisible, judgment-driven work. Hard to show in a demo. Determines whether the demo matters.
The uncomfortable version of this observation is that most ML project timelines are inverted. The first half is spent on data wrangling that should have been planned upfront. The second half is spent on model iteration that is trying to compensate for problems that live in the data and the evaluation design. The model is being asked to solve problems that it cannot solve, because the problems are upstream of it.
Evaluation is not the last step
One of the most consistent mistakes in ML project design is treating evaluation as something that happens after the model is built. You train the model, you evaluate it, you iterate. The evaluation tells you how good the model is. That is the sequence, and it seems logical, and it almost guarantees that your evaluation metric will drift away from what actually matters.
The reason is straightforward. When evaluation is designed after training, it tends to be designed around what the model produces — which means it ends up measuring what the model is good at rather than what the business needs. The metric gets optimised, the metric improves, and the business outcome stays flat. The team is confused. The model is not confused. It did exactly what it was told.
The evaluation design for every project I have built comes before the first training run — not as a checkbox, but as a constraint on what the model is permitted to optimise for. For the fine-tuning system, this meant three independent metrics: exact match for factual recall, ROUGE-L for coverage, and a held-out judge model for hallucination detection. The judge model was explicitly a different model family from the teacher to prevent the evaluation from being gamed by a model that had learned to produce outputs that scored well on its own teacher's preferences.
Each of those design decisions was made before any training data was generated. The evaluation was not measuring how good the model was. It was defining what good meant. That definition had to be in place before the model was trained, or the model would define it by default — in its own favour.
What the pre-model checklist looks like in practice
Across the projects in my portfolio, the sequence before any model is selected or trained follows a consistent pattern. It is not a methodology — it is a set of conditions that have to be true before a model choice is consequential. Until they are true, the model choice is a guess on top of unverified assumptions.