Back to Articles

Running Mixtral-8x7B on Consumer Hardware: Expert Offloading with LRU Caching

[ View on GitHub ]

Running Mixtral-8x7B on Consumer Hardware: Expert Offloading with LRU Caching

Hook

When Mixtral-8x7B launched in December 2023, its 47 billion parameters seemed to demand expensive A100 GPUs. Within days, this project had it running on free Google Colab instances.

Context

Large language models have been locked behind hardware paywalls for years. While quantization techniques like GPTQ and AWQ made dense models like Llama-70B accessible on consumer GPUs, Mixture-of-Experts architectures presented a different challenge. Mixtral-8x7B’s architecture—where each layer contains 8 separate expert networks and activates only 2 per token—creates a unique memory profile: the model is technically 47B parameters, but only processes tokens through about 13B active parameters at any given time.

Traditional offloading strategies treat models as monolithic blocks, shuffling entire layers between CPU RAM and GPU VRAM. This works poorly for MoE models because it ignores their sparsity patterns. If you’re going to swap data between CPU and GPU constantly, you need granularity that matches how the model actually executes. The dvmazur/mixtral-offloading project recognizes this fundamental mismatch and rebuilds the offloading strategy from first principles, treating each expert as an independent unit that can be cached, evicted, and reloaded based on actual usage patterns.

Technical Insight

Memory Hierarchy

MoE Layer per 32 Blocks

Top-2 Expert IDs

Hit

Miss

Load Expert

Evict LRU

Weighted Output

Stored in

Input Tokens

Router Network

LRU Cache Check

GPU Expert Compute

System RAM Storage

Expert 1-8 FFNs

Routing Weight Mixer

Hidden States

System architecture — auto-generated

The core architectural insight is treating experts as first-class citizens in the memory hierarchy. In Mixtral’s 32 decoder layers, each MoE block contains 8 expert feed-forward networks. During inference, a learned router selects the top-2 experts for each token. Rather than keeping all experts in VRAM or naively offloading entire layers, this implementation keeps experts in system RAM and maintains an LRU (Least Recently Used) cache on the GPU.

The caching strategy exploits token locality—consecutive tokens in a sequence often trigger the same experts. When the router selects experts 3 and 7 for a token, those experts are loaded to GPU (if not already cached), used for computation, and retained for subsequent tokens. Only when the cache fills and a new expert is needed does the least-recently-used expert get evicted back to RAM. Here’s how the expert loading mechanism works:

def forward(self, hidden_states):
    # Router determines which experts to activate
    router_logits = self.gate(hidden_states)
    routing_weights, selected_experts = torch.topk(
        router_logits, self.top_k, dim=-1
    )
    routing_weights = F.softmax(routing_weights, dim=-1)
    
    # Process through selected experts with caching
    final_hidden_states = torch.zeros_like(hidden_states)
    for expert_idx in selected_experts.unique():
        # Check LRU cache, load from RAM if needed
        expert = self.get_expert(expert_idx)  
        expert_mask = (selected_experts == expert_idx)
        
        # Apply expert only to relevant tokens
        expert_input = hidden_states[expert_mask]
        expert_output = expert(expert_input)
        final_hidden_states[expert_mask] += (
            expert_output * routing_weights[expert_mask]
        )
    
    return final_hidden_states

The second innovation is heterogeneous quantization. Not all model components compress equally well. Attention layers contain most of the model’s “knowledge” and degrade quickly under aggressive quantization, while MLP experts are more redundant. The project uses HQQ (Half-Quadratic Quantization) with different schemes: 4-bit quantization for experts (which can tolerate more compression) and 2-bit quantization for attention weights, with careful per-channel scaling to preserve critical information.

This mixed precision strategy is implemented at the module level during model loading:

from hqq.core.quantize import HQQLinear

def quantize_model(model, config):
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            if 'attention' in name or 'q_proj' in name or 'k_proj' in name:
                # Less aggressive for attention
                quant_config = {'nbits': 4, 'group_size': 64}
            elif 'mlp' in name or 'expert' in name:
                # More aggressive for experts
                quant_config = {'nbits': 2, 'group_size': 32}
            
            # Replace with quantized version
            quantized = HQQLinear.from_float(
                module, quant_config
            )
            replace_module(model, name, quantized)

The memory savings are substantial. An unquantized Mixtral-8x7B requires roughly 94GB (47B parameters × 2 bytes for FP16). With 4-bit quantization across the board, you’d need about 24GB. The heterogeneous approach pushes this down to approximately 16GB total—but crucially, not all in VRAM. With expert offloading, you might keep only 4-6GB in GPU memory (attention layers plus cached experts) while the remaining 10-12GB lives in system RAM.

The performance characteristics reveal the tradeoff landscape. On a consumer setup with an RTX 3090 (24GB VRAM) and 32GB system RAM, generation speed drops to roughly 2-4 tokens per second compared to 15-30 tokens/second for a fully GPU-resident quantized model. The bottleneck is PCIe bandwidth: even on PCIe 4.0, transferring a 200MB expert between RAM and VRAM takes 10-20ms. The LRU cache helps—if the same two experts handle 70% of tokens (common in coherent text), most forward passes avoid transfers entirely.

The technical report backing this implementation (arXiv:2312.17238) explores additional optimizations not yet in the codebase. Speculative expert prefetching would predict which experts upcoming tokens will need based on the current generation context, preloading them during computation. This could hide transfer latency behind useful work, potentially doubling throughput. The gap between paper and implementation is typical for research code but worth noting for production users.

Gotcha

The most important limitation is one of expectations: this is not a speed optimization. If you have sufficient VRAM to run Mixtral conventionally (say, 24GB for a quantized version), this approach will make inference 5-10x slower, not faster. The constant CPU-GPU traffic creates a throughput ceiling that no amount of caching can fully overcome. You’re trading speed for accessibility—running a model that wouldn’t fit at all, albeit slowly.

System RAM becomes your new bottleneck. While you’ve escaped VRAM constraints, you now need 32GB+ of system memory to hold offloaded experts comfortably. On systems with 16GB RAM, you’ll hit swap, and performance will crater—disk-based paging is orders of magnitude slower than even PCIe transfers. The tool also assumes relatively modern hardware with PCIe 3.0 or better; on older systems with PCIe 2.0, transfer speeds halve and the already-slow generation becomes painful. Finally, batch processing is essentially broken—the implementation focuses on single-sequence generation, and trying to process multiple sequences simultaneously thrashes the expert cache, destroying any locality benefits.

Verdict

Use if: you need to run Mixtral-8x7B on consumer hardware (12-16GB VRAM) for research, experimentation, or low-throughput applications where waiting a few extra seconds per response is acceptable. This is ideal for hobbyists who want to work with cutting-edge MoE models without cloud costs, or researchers exploring Mixtral’s behavior without access to datacenter GPUs. It’s also valuable for understanding MoE architectures—reading and modifying this codebase teaches you how expert routing and sparsity actually work in practice. Skip if: you have access to 24GB+ VRAM GPUs, need production-grade inference speeds, or are serving user-facing applications where latency matters. In those cases, use standard implementations like vLLM or Hugging Face Transformers with proper quantization, or use commercial APIs. Also skip if you’re running on systems with limited RAM (under 32GB) or need batch processing capabilities—the architecture isn’t designed for those use cases and performance will be unacceptable.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/dvmazur-mixtral-offloading.svg)](https://starlog.is/api/badge-click/llm-engineering/dvmazur-mixtral-offloading)