Back to Articles

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

[ View on GitHub ]

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Hook

Most inference engines treat SSD streaming as a desperate fallback when you run out of RAM. ds4 flips the script: it puts your KV cache on NVMe by design, turning your Mac's storage speed into the primary constraint instead of RAM size.

Context

When DeepSeek V4 dropped in early 2025, the AI community faced a familiar frustration: a legitimately capable open-source model that was nearly impossible to run locally. The PRO variant requires 671B parameters across a mixture-of-experts architecture, and even with aggressive quantization, you're looking at 200GB+ of working memory for meaningful context windows. Most developers don't have a 512GB Mac Studio sitting around, and renting cloud GPUs defeats the entire purpose of open weights.

The existing inference landscape wasn't built for this problem. llama.cpp optimizes for broad model compatibility but treats giant MoE models as edge cases. vLLM assumes you're running in a datacenter with proper tensor parallelism. ExLlamaV2 is CUDA-only. Apple's MLX framework is elegant but generic, treating DeepSeek the same as any other transformer. Salvatore Sanfilippo—better known as antirez, creator of Redis—looked at this situation and made a characteristically opinionated bet: what if you stopped pretending RAM was infinite and instead designed around the actual hardware people own? What if you treated a Mac's 7GB/s NVMe as a legitimate storage tier rather than emergency overflow? The result is ds4, an inference engine that hardcodes everything about DeepSeek V4's architecture to squeeze maximum performance from unified memory systems.

Technical Insight

Distributed Mode

Inference Engine

Model Loading

Prompt

Load Q8

Stream 2-bit

Cache Miss

Activations

TCP Forward

Token

mmap KV Cache

Response

HTTP Client

OpenAI API

ds4 Server

Tool Calling + Speculative Decode

Model Loader

Hardcoded DeepSeek Layout

Routed Experts

IQ2_XXS/Q2_K

Shared Experts + Routing

Q8_0

Metal/CUDA Kernels

Chunked Prefill

LRU Expert Cache

RAM

NVMe SSD

Expert Weights + KV Cache

Transformer Layers 0-N

Machine 1

Transformer Layers N+1-M

Machine 2

System architecture — auto-generated

ds4's architecture rests on a single controversial decision: DeepSeek V4's routed experts can tolerate 2-bit quantization while shared experts and routing logic cannot. This isn't a general-purpose heuristic—it's a specific bet on how DeepSeek's MoE implementation handles precision loss. The codebase ships with pre-quantized GGUF files where only the routed experts get crushed to IQ2_XXS or Q2_K, while shared experts, attention projections, and the routing mechanism itself stay at Q8_0 or higher.

Here's what that looks like in practice. When ds4 loads a model, it doesn't parse arbitrary tensor configurations—it expects a hardcoded layout:

// Simplified from ds4's model loader
// Routed experts: aggressive quantization
for (int layer = 0; layer < num_layers; layer++) {
    for (int expert = 0; expert < num_routed_experts; expert++) {
        load_tensor_quantized(
            layer, expert, 
            QUANT_IQ2_XXS,  // 2.06 bits per weight
            STORAGE_SSD      // Stream from NVMe
        );
    }
    // Shared experts: keep precision
    load_tensor_quantized(
        layer, SHARED_EXPERT_0,
        QUANT_Q8_0,         // 8 bits per weight
        STORAGE_RAM          // Keep in unified memory
    );
}

The KV cache strategy is where things get interesting. Instead of allocating a giant contiguous buffer in RAM, ds4 writes key-value tensors directly to memory-mapped NVMe files. When processing a prompt, it chunks the prefill phase into segments small enough to fit in available RAM, streams intermediate activations to disk, and maintains an LRU cache of the most recently accessed KV blocks. On a Mac Studio with fast SSD, this means you can process 128K token contexts with only 64GB of RAM—the bottleneck becomes sequential read speed, not capacity.

The expert cache operates on a budget system. You configure ds4 with something like --expert-cache 32GB, and it keeps a working set of the most frequently routed experts in RAM while the full 200+ expert set lives on SSD. During inference, if a token routes to an expert that's been evicted, ds4 blocks on a disk read (typically 4-8ms on modern NVMe) and swaps it into the cache. The routing distribution is heavily skewed—DeepSeek's training causes certain experts to activate far more often than others—so a 32GB cache captures 80%+ of expert accesses even though the full expert parameter set is 150GB+.

Distributed mode takes a different approach than traditional model parallelism. Instead of splitting individual matrix multiplications across GPUs (tensor parallelism), ds4 partitions transformer layers across machines and forwards activations over TCP. Machine A runs layers 0-30, Machine B runs layers 31-60:

# On machine A (coordinator)
./ds4-server --model deepseek-v4-pro \
             --layers 0-30 \
             --distributed-peer 192.168.1.100:9001

# On machine B (worker)
./ds4-worker --model deepseek-v4-pro \
             --layers 31-60 \
             --listen 0.0.0.0:9001

The coordinator exposes the full OpenAI-compatible API, intercepts requests, runs its layer subset, serializes the activation tensor, sends it to the peer over TCP, receives the processed result, and continues. This is conceptually simpler than Megatron-style tensor sharding but requires low-latency networking—each token generation involves a full round trip. On a local 10GbE link, overhead is 20-30ms per token. Over the internet, it's unusable.

The quantization quality claim hinges on validation scripts that compare ds4's outputs against DeepSeek's official inference API. The repo includes regression tests that run identical prompts through both engines, compute perplexity differences, and fail CI if the divergence exceeds thresholds. For tool-calling tasks specifically—the use case antirez cares about—the 2-bit routed experts apparently preserve enough precision that function selection accuracy remains within 2% of the full-precision baseline. This hasn't been independently replicated at scale, but the validation methodology is at least transparent.

The codebase proudly declares it was built with 'strong GPT 5.5 assistance,' meaning significant portions are AI-generated. You can see this in the code style—verbose comments, defensive null checks everywhere, occasional algorithmic oddities that a human would refactor. The Metal kernels are adapted from llama.cpp but the orchestration layer is original C, with no libggml dependency. Whether this is 'the future of open source' or 'technical debt as a feature' depends entirely on your priors about AI-assisted coding.

Gotcha

The beta-quality warnings aren't hypothetical. The CPU fallback path has a documented issue that triggers macOS kernel panics on certain M3 Ultra configurations when swapping large expert tensors. The recommendation is to either avoid CPU mode entirely or limit expert cache size to prevent thrashing—not exactly confidence-inspiring for production use. Model switching is destructive: changing from Flash to PRO requires deleting the existing KV cache directory and re-downloading different quantized files. There's no version pinning, no migration tooling.

SSD streaming is Metal-only. Despite claims of CUDA and ROCm support, the disk-backed KV cache path only works on macOS unified memory systems. CUDA users get standard in-memory caching, which defeats the entire premise on anything smaller than a DGX. The distributed mode documentation admits feature parity issues—certain API endpoints don't properly forward through the TCP layer, and error handling is primitive. If the worker crashes mid-generation, the coordinator hangs indefinitely. The speculative decoding feature using MTP (a separate draft model) is implemented but acknowledged as 'not a meaningful win'—overhead from loading two models erases most of the speedup. And all of this assumes you're running exactly the GGUF files antirez built and uploaded. You can't bring your own quantization schemes, can't A/B test different bit depths for routed experts, can't even easily verify the provenance of the weights you're downloading. The entire system is a benevolent dictatorship where Salvatore's opinions about quantization quality are baked into the artifacts you're trusting.

Verdict

Use ds4 if you own a high-end Mac (128GB+ unified memory) and need DeepSeek V4 running locally for tool-calling workflows where you've validated the output quality meets your bar, or if you're specifically trying to maximize context length on hardware with fast NVMe and are comfortable treating storage speed as your primary bottleneck. The SSD-streaming approach is genuinely novel for Metal systems and the pre-validated quantization schemes save you weeks of experimentation. Skip if you need model flexibility beyond DeepSeek V4, require production-grade stability or error recovery, run anything other than recent Apple Silicon or high-end NVIDIA hardware, or have philosophical objections to AI-generated codebases. Also skip if you're evaluating multiple inference engines—ds4's dictatorial approach to quantization and model formats makes it impossible to compare apples-to-apples with llama.cpp or vLLM. This is a tool for practitioners who've already decided DeepSeek V4 is their target and are willing to accept a narrow, opinionated implementation in exchange for making 'impossible' hardware configurations merely slow instead of non-functional.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/antirez-ds4.svg)](https://starlog.is/api/badge-click/llm-engineering/antirez-ds4)