Back to Articles

SGLang: How RadixAttention and Prefix Caching Achieve 5x Faster LLM Inference

[ View on GitHub ]

SGLang: How RadixAttention and Prefix Caching Achieve 5x Faster LLM Inference

Hook

A single architectural decision—caching computation prefixes in a radix tree—is saving millions of dollars in GPU costs for organizations serving LLMs at scale. SGLang's RadixAttention delivers 3-5x speedups on real-world workloads, and it's powering over 400,000 GPUs worldwide.

Context

The LLM serving landscape has been dominated by a tension between throughput and latency. As models grew from GPT-3's 175B parameters to today's trillion-parameter frontier models like DeepSeek-V3, serving infrastructure struggled to keep pace. Traditional serving frameworks treat each request independently, recomputing attention mechanisms even when multiple requests share identical prompt prefixes—a catastrophic waste when you're running chatbots with system prompts, batch processing with shared context, or agentic workflows with repeated tool calls.

SGLang emerged from research at UC Berkeley and LMSYS (the team behind Chatbot Arena) to address this fundamental inefficiency. While vLLM pioneered continuous batching and paged attention for LLM serving, SGLang introduced RadixAttention—a prefix caching mechanism that stores and reuses computed key-value caches across requests. When combined with a disaggregated architecture separating prefill (processing input tokens) from decode (generating output), SGLang achieves breakthrough performance on production workloads where prompt overlap is common. The framework now handles trillions of tokens daily across diverse hardware, from NVIDIA Blackwell GPUs to AMD and TPUs.

Technical Insight

At SGLang's core is RadixAttention, which organizes cached key-value pairs in a radix tree (prefix tree) rather than discarding them after each request. When a new prompt arrives, the system traverses the tree to find the longest matching prefix, reusing all cached computation up to that point. For a chatbot with a 500-token system prompt serving 10,000 requests, this means computing attention only once for those 500 tokens instead of 10,000 times—a 10,000x reduction in redundant computation for that prefix.

The implementation elegance shows in SGLang's request handling. Here's how you'd deploy a model with automatic prefix caching:

import sglang as sgl

# Launch runtime with RadixAttention enabled (default)
runtime = sgl.Runtime(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1,  # Tensor parallelism
    mem_fraction_static=0.8,  # Reserve 80% VRAM for KV cache
)

# System prompt shared across requests
system = "You are a helpful AI assistant specialized in Python."

# Multiple requests with shared prefix
for user_query in ["Explain decorators", "What are generators?", "How does asyncio work?"]:
    response = sgl.generate(
        f"{system}\n\nUser: {user_query}\nAssistant:",
        max_tokens=256,
        temperature=0.7
    )
    print(response["text"])

Behind the scenes, SGLang's scheduler maintains a global radix tree. The first request processes the full prompt, storing KV cache entries in paged memory blocks. Subsequent requests with the same system prompt hit the cache immediately, skipping prefill computation for those 23 tokens. The cache persists across requests using an LRU eviction policy when memory pressure builds.

The disaggregated architecture takes this further by running prefill and decode on separate GPU clusters. Prefill is compute-bound (processing hundreds of input tokens), while decode is memory-bound (generating one token at a time). By separating these phases, SGLang can optimize each independently:

# Disaggregated deployment configuration
prefill_runtime = sgl.Runtime(
    model_path="deepseek-ai/DeepSeek-V3",
    tp_size=8,  # Large TP for compute-heavy prefill
    node_type="prefill",
    scheduler_address="http://scheduler:8000"
)

decode_runtime = sgl.Runtime(
    model_path="deepseek-ai/DeepSeek-V3",
    tp_size=4,  # Smaller TP, more instances for memory-bound decode
    node_type="decode",
    scheduler_address="http://scheduler:8000"
)

This architecture shines for long-context models where prefill latency dominates. A 128K token context might take 30 seconds to prefill but only 50ms per decode step. Disaggregation lets you throw more compute at prefill while running many lean decode instances.

SGLang's structured output generation leverages compressed finite state machines (FSM) for 3x faster JSON decoding compared to naive constrained sampling. Instead of checking every possible token against a regex at each step, SGLang precompiles the JSON schema into a compressed FSM and uses jump-ahead techniques to skip invalid paths:

from pydantic import BaseModel

class CodeReview(BaseModel):
    language: str
    issues: list[str]
    severity: str

# Constrained generation with compiled FSM
review = sgl.generate(
    prompt="Review this code: def foo(): pass",
    json_schema=CodeReview.model_json_schema(),
    max_tokens=512
)
# Output is guaranteed valid JSON matching schema

For frontier models like DeepSeek-V3 with multi-head latent attention (MLA) and sparse mixture-of-experts, SGLang includes model-specific kernels. The MLA kernel fuses low-rank projection and attention computation, while sparse MoE routing uses expert parallelism to distribute the 256 expert modules across GPUs. These optimizations are why SGLang achieved day-0 support for DeepSeek-V3 and GPT-OSS, often before official providers.

The framework also unifies diffusion model serving (Flux, Stable Diffusion) and reinforcement learning post-training workflows. For RL, SGLang's zero-overhead scheduler enables rollout generation at inference-level performance, critical for PPO and online DPO where you're generating millions of responses for policy updates.

Gotcha

SGLang's rapid development pace is both its strength and weakness. The project moved from v0.2 to v0.4 in under six months, with breaking API changes in each release. If you're building production systems requiring API stability, this churn is painful—expect to refactor integration code every few months. The documentation struggles to keep up, particularly for advanced features like prefill-decode disaggregation and TPU deployment. You'll find yourself reading source code and GitHub issues more than polished guides.

The complexity ceiling is high for sophisticated deployments. Setting up disaggregated serving requires understanding GPU cluster networking, load balancing between prefill and decode pools, and tuning memory fractions for optimal cache hit rates. Expert parallelism for MoE models demands careful tensor sharding configuration. While simple single-GPU deployments work out of the box, extracting maximum performance at scale requires deep expertise. AMD and TPU support lags behind NVIDIA—expect rough edges and missing optimizations on non-CUDA backends. If you're not running H100s or A100s, you're on the bleeding edge.

Verdict

Use SGLang if you're serving LLMs at scale with high prefix overlap (chatbots, batch processing, agentic systems), need maximum throughput for frontier models like DeepSeek or Qwen, or require integrated RL post-training infrastructure. It's the best choice when GPU cost optimization matters and you have engineering resources to handle API evolution. The RadixAttention speedups alone justify the investment for workloads processing millions of requests with shared context. Skip it if you need API stability for long-term production deployments without dedicated ML infrastructure teams, are prototyping without performance requirements, or running simple single-model serving on non-NVIDIA hardware. In those cases, stick with vLLM's mature ecosystem or HuggingFace TGI's simplicity until your scale demands SGLang's advanced optimizations.

// AI Provenance

How was this tool built? We scanned the repo for AI tooling signals — config files, SDK imports, CI workflows, and README disclosures — to measure how transparently the maintainers document their use of AI. Why this matters →

35
Transparency Score
AI-Assisted
Claude Code
AI config files found (86)
AI in CI/CD pipeline (83 workflows)
No AI disclosure in README
No AIBOM
Full provenance report →