Back to Articles

mergekit: Combining LLMs Without Training by Operating in Weight Space

[ View on GitHub ]

mergekit: Combining LLMs Without Training by Operating in Weight Space

Hook

What if you could create a new LLM that combines the coding skills of one model with the creative writing abilities of another—without a single GPU-hour of training?

Context

The traditional path to a specialized language model is expensive: start with a base model, gather domain-specific data, and fine-tune for days or weeks on expensive hardware. But the LLM landscape has evolved into a rich ecosystem where hundreds of models exist, each fine-tuned for different capabilities—coding, creative writing, mathematics, specific languages, or domain expertise. This raises an intriguing question: can we combine the learned capabilities of multiple models directly, bypassing the training process entirely?

Model merging operates on a counterintuitive principle: the weights of neural networks often lie in regions of parameter space where interpolation is meaningful. If two models started from the same base and were fine-tuned differently, averaging their weights (or combining them through more sophisticated methods) can sometimes yield a model that exhibits both sets of capabilities. The challenge has been practical—loading multiple 70B parameter models into memory simultaneously requires terabytes of RAM. mergekit solves this by streaming model weights from disk, performing operations lazily, and writing the merged result without ever holding complete models in memory. This democratizes model merging, making it accessible on consumer hardware with as little as 8GB of VRAM.

Technical Insight

At its core, mergekit is an out-of-core tensor processing engine designed specifically for the structure of transformer language models. Instead of loading entire model checkpoints, it uses lazy loading to fetch only the tensors needed for each operation, processes them, and streams the results to disk. This architecture enables merging models that would be impossible to hold in memory simultaneously.

The simplest merge is linear interpolation, defined in a YAML configuration:

merge_method: linear
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  weight: 0.5

This configuration merges two 7B models by averaging their weights. The layer_range specifies which transformer layers to include, enabling 'Frankenmerging' where you might take the first 16 layers from one model and the last 16 from another. mergekit walks through each layer, loads the corresponding tensors from both models, computes 0.5 * weights_a + 0.5 * weights_b, and writes the result.

But linear interpolation is just the beginning. mergekit implements multiple sophisticated algorithms, each suited to different scenarios. SLERP (Spherical Linear Interpolation) treats weight vectors as points on a hypersphere, interpolating along the geodesic rather than the straight line. This preserves the magnitude of weight vectors and often produces more stable results:

merge_method: slerp
slices:
  - sources:
      - model: base-model
        layer_range: [0, 32]
      - model: fine-tuned-model
        layer_range: [0, 32]
parameters:
  t: 0.5  # interpolation factor

For combining multiple task-specific fine-tunes, TIES (Trim, Elect Sign & Merge) and DARE (Drop And REscale) implement task arithmetic. These methods compute delta weights (the difference between each fine-tuned model and the base), resolve conflicts between deltas, and add them back to the base. TIES, for example, keeps only the top-k% of delta values by magnitude, resolves sign conflicts through voting, and merges the result:

merge_method: ties
models:
  - model: base-model
    # no parameters, this is the base
  - model: math-tuned-model
    parameters:
      density: 0.6
      weight: 1.0
  - model: code-tuned-model
    parameters:
      density: 0.6
      weight: 1.0
parameters:
  normalize: true
  int8_mask: true

The density parameter controls what percentage of delta weights to keep—lower values are more aggressive about pruning small changes, which can reduce interference between models.

Perhaps most ambitious is mergekit's ability to create Mixture of Experts (MoE) architectures from dense models. This transforms multiple models into experts that are selectively activated:

base_model: mistralai/Mistral-7B-v0.1
experts:
  - source_model: math-expert-7B
    positive_prompts:
      - "solve"
      - "calculate"
      - "mathematics"
  - source_model: code-expert-7B
    positive_prompts:
      - "code"
      - "function"
      - "implement"
gate_mode: random  # or: cheap_embed, hidden

This creates a model where different experts handle different types of queries, routed by a learned or heuristic gating mechanism. The resulting model can be more efficient than running multiple models separately, though the gating quality varies depending on the method.

mergekit's architecture also includes specialized tools: mergekit-extract-lora extracts LoRA adapters from full fine-tunes, and mergekit-tokensurgeon handles tokenizer vocabulary merging—critical when combining models trained on different tokenizers. The latter uses embedding space interpolation to create reasonable initializations for tokens that exist in one vocabulary but not another.

Gotcha

Model merging is fundamentally experimental, and mergekit makes no guarantees about output quality. A configuration that works brilliantly for one set of models might produce incoherent outputs for another. The merged model might hallucinate more, lose capabilities from the source models, or exhibit unexpected behaviors. You need robust evaluation pipelines—perplexity metrics, benchmark suites, and human evaluation—to validate results. There's no substitute for testing.

The constraint of architectural compatibility is real. You can only merge models with the same structure: same number of layers, same hidden dimensions, same attention head configuration. Merging a Llama-2 model with a Mistral model fails unless they happen to share architecture. Even models with compatible dimensions might use different positional encoding schemes or normalization approaches that make merging suboptimal. The tokenizer problem compounds this—models with different vocabularies require surgery that introduces approximations. And while mergekit handles the mechanics, understanding which merge method and hyperparameters to use requires experimentation and intuition. The documentation provides examples, but optimal configurations are model-specific and task-dependent. You'll spend time tweaking density parameters, interpolation factors, and layer ranges, running evaluations, and iterating. This isn't a one-command solution—it's a toolkit that requires skill to use effectively.

Verdict

Use if: You want to combine capabilities from multiple fine-tuned models without training compute, you're working with memory constraints that make traditional multi-model approaches impractical, you're comfortable with experimental tools and have evaluation infrastructure to validate outputs, or you're exploring creative model compositions and rapid prototyping matters more than guaranteed results. mergekit excels at democratizing access to model combination techniques and enabling experiments that would otherwise require datacenter resources. Skip if: You need production-ready models with quality guarantees (fine-tuning is more reliable), your models have incompatible architectures or vastly different tokenizers, you lack the evaluation infrastructure to validate merged outputs, or you're working with models where any quality degradation is unacceptable. Model merging is powerful but unpredictable—use it for exploration and specialization, not as a substitute for proper training when stakes are high.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/arcee-ai-mergekit.svg)](https://starlog.is/api/badge-click/llm-engineering/arcee-ai-mergekit)