Back to Articles

Mergekit: Combining LLMs Without Training by Operating in Weight Space

[ View on GitHub ]

Mergekit: Combining LLMs Without Training by Operating in Weight Space

Hook

What if you could combine a coding-specialized LLM with a creative writing model to get both capabilities in a single model—without any training, datasets, or GPU clusters?

Context

The traditional path to a capable language model involves massive datasets, extensive compute resources, and weeks of training time. When you need a model that’s good at multiple tasks—say, both technical documentation and creative storytelling—you typically face three options: train from scratch on combined data (expensive), run multiple models as an ensemble (slow and resource-intensive), or fine-tune an existing model on new data (requires access to quality datasets and compute). Each approach has significant drawbacks.

Model merging offers a fourth path: operating directly on pretrained model weights to mathematically combine their capabilities. This isn’t a new concept—researchers have explored weight averaging and model interpolation for years—but it gained mainstream attention when the open-source LLM community discovered that merging fine-tuned models could produce surprisingly capable results. The challenge was that existing tools either required extensive programming knowledge or consumed prohibitive amounts of memory. Mergekit emerged to solve this problem, providing a production-ready toolkit that can merge multi-billion parameter models on consumer hardware through clever out-of-core processing.

Technical Insight

merge method, parameters

model references

algorithm selection

weight chunks

interpolation

delta weights

layer slicing

expert routing

blended tensors

pruned deltas

assembled layers

expert weights

out-of-core processing

YAML Config File

Config Parser

Lazy Tensor Loader

Merge Algorithm

SLERP Blender

DARE/TIES Handler

Frankenmerge Assembler

MoE Builder

Streaming Merge Engine

Merged Model Weights

System architecture — auto-generated

Mergekit’s architecture centers on lazy tensor loading and streaming operations. Instead of loading entire models into memory, it processes weights in chunks, making it possible to merge large parameter models with as little as 8GB of VRAM. The system reads YAML configuration files that specify merge methods, source models, and parameters, then applies the selected algorithm to combine weights.

One supported merge method is SLERP, which blends model weights. Here’s a basic configuration:

slices:
  - sources:
      - model: mistralai/Mistral-7B-Instruct-v0.2
        layer_range: [0, 32]
      - model: OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
merge_method: slerp
base_model: mistralai/Mistral-7B-Instruct-v0.2
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

This configuration demonstrates interpolated gradients—the t parameter varies across layers, applying different blend ratios to attention versus MLP layers. This granular control lets you preserve certain capabilities while blending others.

For more sophisticated merging, mergekit supports DARE (Drop And REscale) and TIES (Trim, Elect, and Merge), which address parameter interference. These methods work with delta weights (the difference from the base model):

models:
  - model: mistralai/Mistral-7B-v0.1
    # No parameters required for base model
  - model: samir-fama/SamirGPT-v1
    parameters:
      density: 0.53
      weight: 0.4
  - model: abacusai/Slerp-CM-mist-dpo
    parameters:
      density: 0.53
      weight: 0.3
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  normalize: true
  int8_mask: true
dtype: bfloat16

The density parameter controls how many parameters to keep (0.53 means 53%), while weight determines each model’s influence. The normalize option ensures weights are properly scaled after merging.

Mergekit also enables “Frankenmerging”—assembling models layer by layer from different sources. You can take the first 16 layers from one model and the last 16 from another, creating hybrid architectures:

slices:
  - sources:
    - model: OpenHermes-2.5-Mistral-7B
      layer_range: [0, 16]
  - sources:
    - model: mistralai/Mistral-7B-Instruct-v0.2
      layer_range: [16, 32]
merge_method: passthrough
dtype: bfloat16

The passthrough method directly concatenates layer slices without blending. This approach has produced some of the community’s most capable merged models, though results can be unpredictable—the layers must be compatible in hidden dimension and architectural details.

Beyond basic merging, mergekit includes specialized tools. The Mixture of Experts merging capability converts merged models into MoE architectures, where each expert is a specialized model and a router network determines which expert handles each token. The toolkit also supports LoRA extraction (creating a LoRA adapter from the difference between two models) and tokenizer transplantation for swapping vocabulary between models.

Gotcha

Mergekit’s biggest limitation is architectural compatibility. Models must share the same fundamental architecture—you can merge two Mistral-7B derivatives, but not a Mistral model with a Llama model without manual intervention. Even models that appear compatible can fail if they differ in hidden dimensions, number of attention heads, or other structural details. The error messages when this happens can be cryptic, often manifesting as tensor shape mismatches deep in the merge process.

Merge quality is highly configuration-dependent with limited theoretical guarantees. While methods like TIES and DARE have approaches to parameter conflict resolution, there’s no guarantee your merged model will actually perform well. A merge that seems logical—combining a math-focused model with a code-focused model—might produce a model worse at both tasks due to parameter interference. Finding optimal merge parameters often requires extensive experimentation or using the evolutionary search tools. The community has developed intuition about what works (lower density values for DARE, specific SLERP ratios for certain model pairs), but these are empirical findings rather than theoretical certainties. You should plan to evaluate merged models thoroughly before deploying them, as they can exhibit unexpected behaviors like format confusion, capability regression, or novel failure modes not present in either source model.

Verdict

Use mergekit if you want to experiment with combining fine-tuned models without retraining, need to transfer capabilities between models on limited hardware, or want to create Mixture of Experts architectures from existing models. It’s particularly valuable when you have multiple domain-specific fine-tunes and want to explore whether their capabilities can coexist in a single model. The out-of-core processing makes sophisticated experiments accessible even on consumer hardware, and the YAML configuration approach means you can iterate quickly. Skip it if you need guaranteed performance characteristics or are working with production systems where unpredictable behavior is unacceptable—traditional fine-tuning or ensemble methods provide more reliable results. Also skip if you’re trying to merge fundamentally different architectures or if you have sufficient resources to train from scratch, as purpose-built training will generally outperform merged models for specific tasks.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/arcee-ai-mergekit.svg)](https://starlog.is/api/badge-click/llm-engineering/arcee-ai-mergekit)