Back to Articles

Running a 397B Parameter Model on a Laptop: How Flash-MoE Streams Experts from SSD

[ View on GitHub ]

Running a 397B Parameter Model on a Laptop: How Flash-MoE Streams Experts from SSD

Hook

A 397 billion parameter AI model should need a data center rack. Dan Veloper made it run on a MacBook Pro at 5.5 tokens per second—by treating his SSD like extra RAM.

Context

The trajectory of large language models has created an uncomfortable gap: models keep getting larger while most developers work on laptops. A 70B parameter model barely fits in 48GB of memory with 4-bit quantization. By 400B parameters, you’re firmly in multi-GPU territory—or so conventional wisdom suggests.

Mixture-of-Experts (MoE) architectures offer a theoretical loophole. Models like Qwen3.5-397B don’t activate all 397 billion parameters for every token. Instead, they route each token to a small subset of specialized “expert” networks. For any given computation, you might only need 8 out of 128 experts loaded. The rest sit idle. Flash-MoE exploits this sparsity ruthlessly: if you only need 6% of the model active at once, why keep the other 94% in RAM? Modern NVMe SSDs can deliver 17.5 GB/s of sequential reads. What if you just… streamed the experts on-demand from disk?

Technical Insight

NVMe Storage (120GB)

Top-K Expert IDs

Stream Experts

On-demand Read

Hidden States

Next Token

Input Tokens

Attention Layer

Metal GPU

Router Network

CPU Inference

Parallel SSD Loader

pread + F_NOCACHE

2-bit Quantized

Expert Weights

Dequantization

2-bit to FP16

Expert Computation

Metal GPU

LayerNorm + Output

Accelerate BLAS

Generated Tokens

System architecture — auto-generated

Flash-MoE’s architecture is a carefully orchestrated pipeline of three overlapping processes: SSD I/O, CPU routing, and GPU computation. The core insight is that modern Apple Silicon systems have fast enough interconnects to make this overlap nearly seamless.

The quantization scheme is aggressive but targeted. While attention weights use standard approaches, expert weights employ custom 2-bit affine quantization that reduces storage from 209GB to 120GB—a 44% reduction beyond what 4-bit would achieve. The author’s experiments showed this extreme compression maintains RMSE around 0.001, preserving quality because expert weights exhibit different statistical properties than attention mechanisms. Here’s the quantization structure for a single expert:

// Each expert stores its weights in 2-bit packed format
// with scale/zero-point per channel
typedef struct {
    uint8_t* packed_weights;  // 4 weights per byte
    float* scales;             // Per-channel scale factors
    uint8_t* zero_points;      // Per-channel offsets
    size_t input_dim;
    size_t output_dim;
} Expert2BitWeights;

// On-demand loading uses parallel pread() calls
void load_expert_async(int expert_id, Expert2BitWeights* dest) {
    off_t offset = expert_offsets[expert_id];
    size_t size = expert_sizes[expert_id];
    
    // F_NOCACHE tells kernel not to pollute page cache
    // Counterintuitively faster for this workload
    fcntl(ssd_fd, F_NOCACHE, 1);
    pread(ssd_fd, dest->packed_weights, size, offset);
}

The pipeline orchestration is where Flash-MoE truly shines. For each transformer layer, three command buffers execute in overlapping stages. While GPU command buffer CMD3 computes expert outputs for layer N, the CPU is already calculating routing decisions for layer N+1, and background threads are prefetching experts for layer N+2. The author’s notes reveal this took 37 experimental iterations to tune—naive approaches left the GPU idle 40% of the time.

Metal compute shaders handle the matrix multiplications after dequantization happens on-GPU. This is critical: transferring dequantized weights would saturate the PCIe bus, but transferring compressed 2-bit data and expanding in GPU SRAM keeps bandwidth manageable. The attention mechanism, however, uses CPU-side Accelerate BLAS because the sequential nature of attention doesn’t parallelize well enough to justify GPU transfer overhead at single-token batch sizes.

The SSD I/O strategy abandons mmap entirely. The author’s experiment log (visible in commit messages) shows mmap performed 5x slower than raw pread() calls for this access pattern. The reason: mmap triggers page faults that serialize on the kernel’s VM locks, while parallel pread() calls can pipeline through the NVMe queue depth. By issuing multiple pread() calls for different experts simultaneously, Flash-MoE keeps the SSD’s internal parallelism saturated—modern SSDs have 8+ parallel channels that go unused with single-threaded access.

Perhaps most impressive is the deferred execution model. Expert computation doesn’t happen immediately after routing. Instead:

// Routing happens on CPU, returns top-k expert IDs
int* active_experts = route_token(hidden_state, layer);

// Immediately start loading next experts while GPU is busy
for (int i = 0; i < top_k; i++) {
    prefetch_expert_async(active_experts[i]);
}

// Encode GPU work - but don't commit yet
id<MTLCommandBuffer> expertBuffer = [queue commandBuffer];
[expertBuffer encodeComputeWithExperts:active_experts];

// Only commit when previous layer's GPU work completes
[expertBuffer commit];  // Non-blocking

// While GPU runs, CPU prepares next layer routing
// This overlap achieves ~zero bubble time

This architecture achieves 5.55 tokens/second on a 48GB M2 MacBook Pro with the full 397B Qwen3.5 model—genuinely usable for interactive generation. The cold-start penalty (dropping to 2.83 tok/s when SSD caches are cold) reveals how dependent the system is on macOS’s sophisticated SSD caching, but subsequent runs maintain full speed.

Gotcha

Flash-MoE is unapologetically Apple Silicon-only. The entire architecture assumes unified memory (so GPU can see CPU allocations with zero copy), Metal compute shaders, and Apple Fabric SSD controllers with their specific performance characteristics. Porting to NVIDIA/CUDA would require rewriting the compute kernels, and porting to Linux would mean losing the tight integration with macOS’s SSD I/O scheduling. This isn’t a limitation—it’s an intentional design choice to optimize for one platform rather than compromise for portability.

The single-token batch size is a harder constraint. This architecture cannot efficiently serve multiple users simultaneously or handle long prompt prefill quickly. Batching would require keeping multiple expert sets in memory simultaneously, defeating the entire streaming approach. If your workload involves processing hundreds of prompts through the same model, or you need consistent low-latency for production serving, Flash-MoE’s architecture doesn’t scale. The cold-start latency unpredictability (nearly 2x slower until SSD caches warm) also makes this unsuitable for user-facing APIs where p99 latency matters. This is a research tool and power-user interface, not a production inference server.

Verdict

Use Flash-MoE if you’re doing AI research on Apple Silicon hardware and need to experiment with massive MoE models locally without renting GPU clusters. It’s perfect for prototyping agentic systems, testing prompts against state-of-the-art models, or learning low-level inference optimization—the codebase is remarkably readable C with extensive commit messages documenting the 90+ experiments that shaped it. The 5+ tok/s generation speed makes it genuinely usable for interactive work. Skip if you need cross-platform support, production-grade serving with batching, consistent latency guarantees, or you’re working primarily with dense models where the MoE-specific optimizations don’t apply. Also skip if you’re on Intel Macs or don’t have at least 48GB unified memory—the architecture assumes Apple Silicon’s specific hardware characteristics and won’t gracefully degrade.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/danveloper-flash-moe.svg)](https://starlog.is/api/badge-click/llm-engineering/danveloper-flash-moe)