The problem arrived as a deployment cost complaint, which is usually how the interesting architectural problems arrive. An enterprise finance team had two fine-tuned models in production: one trained on earnings transcripts, 10-Ks, and financial covenants — the quantitative reasoning workhorse — and one trained on client communication, meeting summaries, and investor-facing language. Both models were good at their respective jobs. Neither could do the other's job reliably.
The workflow that broke it was a simple one on paper: an analyst needed to draft a client memo that explained a DCF analysis. To do that well, you need quantitative precision and readable prose in the same output — the kind of thing that feels natural when a skilled analyst writes it, and falls apart when you pipeline two models through a string hand-off. The quant model produces precise numbers. The comms model receives a text string and generates fluent language around it. What it does not receive is the intermediate attention states and logits that encode the quantitative reasoning. The boundary between the two models destroys exactly the thing the output needed most.
The obvious answer was to fine-tune a single model that could do both. That answer came with a cost: curating a training dataset that covered both domains adequately, running a training job on a sufficiently large model, evaluating the result, iterating. Weeks of work and thousands of dollars in GPU compute — for a problem where both of the required capabilities already existed, in trained form, in two models that shared the same base architecture.
That is when the merge became worth examining seriously.
What weight merging actually is
Model merging is the practice of combining the weights of two or more fine-tuned models into a single model — without any training, without any labelled data, and without any gradient updates. The result is a single model that, when the conditions are right, retains meaningful capability from each of its source models.
The intuition behind why this works starts with task vectors. When you fine-tune a model, the fine-tuning changes the weights relative to the base model. The difference between the fine-tuned weights and the base model weights — the task vector — encodes everything the fine-tuning taught the model. If two models were fine-tuned from the same base, you can represent each of them as a base model plus a task vector. Adding those task vectors together and applying the sum to the base model produces a merged model that, in principle, inherits both fine-tunings.
In practice, it is more complicated than addition. Task vectors interfere with each other. When two models disagree on the sign of a weight update for the same parameter — one pushes the weight up, the other pushes it down — naive averaging produces cancellation. Neither specialisation is preserved. The mid-to-late transformer layers, where domain-specific representations concentrate, show the highest conflict rates. Managing that interference is where the interesting engineering lives.
Fine-tuning encodes new capabilities into a model. Merging combines capabilities that already exist. These are solutions to different problems — and confusing them is expensive.
SLERP, TIES-Merging, and DARE are three techniques that address the interference problem with increasing sophistication. SLERP interpolates between two models geometrically, layer by layer, with no explicit interference handling — it is the simplest merge and establishes the quality floor. TIES-Merging adds sign conflict resolution: it trims low-magnitude delta parameters and elects a consensus sign for each parameter before merging, reducing the cancellation that naive averaging produces. DARE adds a prior sparsification step — randomly zeroing out a fraction of each model's delta parameters and rescaling the survivors, which reduces the number of parameters competing for the same weight slot before TIES resolves the conflicts. DARE-TIES — the combination — consistently outperforms either technique independently on domains with significant inter-model interference.
The four conditions that make merging viable
Merging is not a universal substitute for fine-tuning. It works under a specific set of conditions, and when those conditions do not hold, it fails in predictable ways. The decision to merge rather than fine-tune should be an explicit check against these conditions, not a default or an optimisation shortcut.
When all four conditions are met — shared base, existing capabilities, orthogonal domains, acceptable quality ceiling — merging is not just a viable alternative to fine-tuning. It is the faster, cheaper, and operationally simpler choice. The merge runs in roughly twenty minutes on a CPU. It produces a static model file that deploys identically to any other model. There is no training pipeline to maintain, no retraining cadence to manage, no GPU budget to allocate for future training runs.
The problem that merging actually solves
The core problem that merging is designed to address is the multi-specialist deployment pattern — the situation where an organisation has invested in fine-tuning separate models for separate tasks, and now needs outputs that require both specialisations simultaneously.
This pattern is more common than it might seem. Between 2023 and 2025, the standard practice at AI-forward financial institutions was to fine-tune one model per workflow: one for quantitative analysis, one for client communication, one for regulatory language, one for internal reporting. Each model is good at its task. Real analyst work crosses all of them constantly. The pipeline that tries to chain these models together loses coherence at every hand-off, because a text string carries far less information than the intermediate attention states it replaced.
The alternative approaches to this problem — before merging is considered — each have significant costs. Running a logit-level ensemble, where both models decode simultaneously and their log-probabilities are combined, preserves the quality of both specialists but doubles the serving cost and requires synchronised decoding infrastructure. Running a task-routing classifier routes each query to the best specialist for that task, but produces no hybrid output and fails on queries that genuinely require both domains. Running multiple LoRA adapters on a shared base swaps adapters per query, but adds 80–200 milliseconds of adapter loading latency and requires stateful serving infrastructure.
Fine-tuning adds what a model does not have. It is the right tool when the domain knowledge, task format, or output style you need has not been trained into any model you can access.
Merging combines what already exists. It is the right tool when the required capabilities are already trained, just distributed across models that share a common base.
Weight merging produces a single static model file. It deploys like any other model. It has no runtime dependency on a second model, no adapter swap latency, no synchronised decoding requirement. At the quality level FusionLM achieves — 98.5% of pipeline baseline at half the serving cost, with no training run — merging is not a compromise. For the specific conditions it was designed for, it is a better architecture than the alternatives it replaced.
The interference problem — and why layer position matters
The reason DARE-TIES outperforms simpler merge techniques is not arbitrary. It is grounded in the structure of where interference occurs in a transformer model, and understanding that structure is what makes the technique selection principled rather than trial-and-error.
Early transformer layers — roughly the first quarter of the model's blocks — encode general-purpose representations that both specialists share. These layers are not fighting over the same weight space, because both models learned similar things there. The sign conflict rate in early layers is relatively low. Aggressive sparsification in these layers is unnecessary and potentially harmful, because you are pruning parameters that are not actually interfering.
Mid-to-late layers — roughly blocks 12 through 24 in a 32-block model — are where domain-specific representations concentrate. This is where the quant specialist learns that a particular attention pattern corresponds to a financial ratio calculation, and where the comms specialist learns that a different attention pattern corresponds to professional register. These representations are legitimately different. They compete for the same weight space. Sign conflict rates in these layers peak at 0.44 to 0.52 — nearly half the parameters in these blocks have task vectors pointing in opposite directions.
DARE addresses this by sparsifying before sign election: randomly zeroing out 40% of the delta parameters in each model (density=0.6), then rescaling the survivors. The zeroing is not targeted — it is random — but the rescaling preserves the expected magnitude of the surviving parameters. The effect is to reduce the number of parameters in active conflict before TIES resolves the remaining conflicts by sign election. Fewer conflicts going into TIES means less cancellation, and less cancellation means both specialisations survive the merge more intact.
The interference profile across transformer layers is not uniform. A merge configuration that treats all layers identically is leaving quality on the table in the layers where interference is highest.
The planned next step for FusionLM is per-layer density — applying a lower density (more aggressive sparsification) to the high-conflict mid-to-late blocks and a higher density to the early layers where interference is low. The current uniform density=0.6 is the parameter that a grid search identified as best across all layers simultaneously. Per-layer density would push the quality of the merge closer to the joint retraining upper bound without any training cost.
The decision audit — merge or fine-tune
The question that organises this decision is not "which technique is better?" It is "what kind of problem do I actually have?" Fine-tuning solves a capability gap — something the model cannot do that it needs to learn. Merging solves a capability consolidation problem — something two models can do separately that needs to happen in a single model. Diagnosing which problem you have determines which tool is appropriate.