Back to Articles

Running Gemma-4 26B on DGX Spark: Why Speculative Decoding Falls Apart at Scale

[ View on GitHub ]

Running Gemma-4 26B on DGX Spark: Why Speculative Decoding Falls Apart at Scale

Hook

Speculative decoding is supposed to make inference faster. So why does this carefully tuned Gemma-4 deployment become slower than stock vLLM the moment you scale past 128 concurrent requests?

Context

DGX Spark represents NVIDIA's entry into edge AI with the GB10 chip—SM 12.1 architecture on ARM64 with 273 GB/s memory bandwidth. That's respectable for a workstation, but it's 12x slower than an H100's 3.35 TB/s. When you're trying to run a 26-billion parameter model on bandwidth-starved hardware, every memory access becomes a bottleneck.

The AEON-7 team attacked this problem by combining three techniques: NVFP4 quantization to compress weights to 4 bits, Mixture-of-Experts architecture to activate only 4B of 26B parameters per token, and DFlash speculative decoding to generate 10 tokens per forward pass. The result claims 144 tok/s for single-stream coding workloads and up to 158 tok/s for extraction tasks—a 20-50% improvement over their previous approach. But the repository reveals something more interesting than the headline numbers: it's a forensic document of everything that breaks when you push quantized MoE models onto bleeding-edge silicon with a patched inference stack.

Technical Insight

The core technical challenge isn't running Gemma-4—it's making compressed-tensors NVFP4 quantization work with vLLM's FusedMoE kernel on SM 12.1 hardware. Gemma-4 26B uses a hybrid architecture: 25 MoE layers with 1024-token sliding windows and 5 dense layers with global attention, interleaved throughout the model. Each MoE layer routes through 128 experts using top-8 selection, meaning only ~4B parameters are active per forward pass.

The problem emerges when you try to load compressed-tensors NVFP4 weights into vLLM. The quantization format stores four tensors per weight: weight_packed (4-bit compressed), weight_scale (per-channel scales), weight_global_scale (layer-wise adjustment), and input_global_scale (activation quantization). But vLLM's FusedMoE expects ModelOpt's NVFP4 format with different naming: weight, weight_scale_inverse, and input_scale. The mismatch goes deeper than names—the expert path construction differs fundamentally:

# What compressed-tensors produces:
model.layers.0.block_sparse_moe.experts.0.w1.weight_packed
model.layers.0.block_sparse_moe.experts.0.w1.weight_scale

# What vLLM's FusedMoE expects:
model.layers.0.moe.experts.0.w1_weight
model.layers.0.moe.experts.0.w1_weight_scale_inverse

The team's container patches this in the weight loader by adding .moe. segment insertion and collapsing the _weight. suffix during tensor name resolution. But there's a second incompatibility: vLLM's NVFP4 kernel includes dimension assertions that fail on packed 4-bit tensors because the shape inference doesn't account for bit-packing. A 4096-dimension weight packed into 4 bits reports shape [4096, 1024] instead of the expected [4096, 4096], breaking the GEMM kernel dispatcher.

The heterogeneous attention architecture creates another constraint. Gemma-4 uses 256-dimensional heads for sliding-window layers but 512-dimensional heads for global attention layers. Standard FlashAttention kernels assume uniform head dimensions, so the team splits attention backends: triton_attn for the target model and flex_attention for the DFlash drafter. This mixing is unusual because speculative decoding typically uses identical attention implementations for target and draft models to minimize backend switching overhead.

Here's where it gets interesting: the DFlash configuration that achieves 144 tok/s uses n=10 speculative tokens, not the n=15 default from z-lab's original implementation. The team discovered through empirical testing that 10-token drafts have higher acceptance rates on coding workloads, reducing wasted computation from rejected speculations. But at high concurrency, the entire approach inverts:

# Single-stream (c=1): DFlash wins
# Target: 144 tok/s with n=10 speculation
# Stock vLLM: ~90-100 tok/s

# High concurrency (c=256): Stock vLLM wins
# DFlash: 1,724 tok/s aggregate (collapses, server unstable)
# Stock vLLM: 3,000-3,700 tok/s aggregate

The collapse happens because speculative decoding adds fixed overhead per request—the draft model forward pass, verification against target model logits, and acceptance rate calculations. When batch sizes grow large enough, vLLM's CUDA graph optimizations and continuous batching provide better parallelism than DFlash's per-request speculation. At c=128+, you're paying the draft overhead while the target model could have been processing more concurrent requests instead.

The repository also documents a quantization surgery problem. The original llmcompressor quantization pass accidentally quantized the vision tower, MoE routers, and vision embedding layers because Gemma-4's component naming doesn't match standard ignore patterns. Vision tower layers are named vision_tower.* instead of .*visual.*, and routers use router.proj instead of .*gate.*. Quantizing these components crashes vLLM's Linear layer allocation, so the team manually extracted 221 BF16 tensors from the base model and replaced 884 NVFP4 tensors in the quantized safetensors files—a surgical fix that's described in prose but not automated in executable scripts.

One subtle architectural detail: Gemma-4 includes internal control tokens (IDs 98, 100, 101) for multi-channel generation that coordinate thought processes across model components. Without proper EOS configuration in generation_config.json, these tokens leak into the output stream as plaintext, creating infinite thought loops visible to end users. The fix patches stop token lists, but it reveals that Gemma-4's control flow isn't just a prompting convention—it's embedded in the architecture.

Gotcha

The DFlash approach has a fatal scaling problem: it becomes slower than stock vLLM at the exact concurrency level where production inference gets interesting. At c=256, stock vLLM delivers 3,000-3,700 tok/s while DFlash peaks at 1,724 tok/s and destabilizes the server. The repository's primary value proposition—speculative decoding for performance—only applies to single-stream or low-concurrency workloads. If you're building a service that handles batch inference or multi-user traffic, you'd disable DFlash entirely and use upstream vLLM, which makes the custom container's patches your only remaining value.

The compressed-tensors compatibility patches will bitrot against upstream vLLM. This container freezes vLLM at 0.22.1 with surgical modifications to expert path construction and NVFP4 dimension assertions. Every upstream vLLM release adds features (better CUDA graphs, expanded model support, performance optimizations) that you'll miss unless you manually forward-port the patches. The team hasn't provided automated patch scripts or upstreamed the fixes, so maintaining this becomes a manual merge process. The hardcoded BF16 KV cache requirement (you cannot use fp8) wastes memory because the drafter's non-causal attention prevents standard vLLM KV quantization. Performance benchmarks are DGX Spark specific with no tensor parallelism testing—scaling behavior on multi-GPU setups is completely unknown.

Verdict

Use if: You're running single-stream or low-concurrency inference (<128 concurrent requests) on DGX Spark hardware and need immediate Gemma-4 26B deployment without debugging vLLM compatibility yourself. The 144 tok/s coding performance is legitimately impressive for bandwidth-constrained edge hardware, and the container eliminates weeks of dependency resolution. This is essentially a 'DGX Spark owner's manual' that codifies months of trial and error into a reproducible recipe. Skip if: You need batch inference throughput (stock vLLM is 2-3x faster at high concurrency), multi-GPU scaling (tensor parallelism is untested), or want to stay current with upstream vLLM improvements (patches will diverge). Skip entirely if you're not on GB10 silicon—the entire value proposition is specific to DGX Spark's 273 GB/s memory bandwidth constraint. For datacenter GPUs where bandwidth isn't the bottleneck, run unquantized Gemma-4 on H100 or use TensorRT-LLM's native NVFP4 support instead.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/aeon-7-gemma-4-26b-a4b-it-uncensored-nvfp4.svg)](https://starlog.is/api/badge-click/ai-dev-tools/aeon-7-gemma-4-26b-a4b-it-uncensored-nvfp4)