Back to Articles

Headroom: The Three-Layer Compression Stack That Makes LLM Context Windows 60% Cheaper

[ View on GitHub ]

Headroom: The Three-Layer Compression Stack That Makes LLM Context Windows 60% Cheaper

Hook

Your Claude coding agent just spent $47 processing a 200KB log file it never fully read. Headroom gives the LLM a retrieval function instead of the whole file, compressing context by 85% while making accuracy better, not worse.

Context

LLM context windows have exploded from 4K tokens in 2022 to 200K+ in 2024, but cost scales linearly—Anthropic charges $3 per million input tokens for Claude 3.5 Sonnet, OpenAI charges $2.50 for GPT-4 Turbo. Developer tools compound this: Cursor sends entire file trees on每次 autocomplete request. Aider includes full git diffs. RAG systems stuff 50 document chunks into every query. A single debugging session with kubectl logs piped to Claude can burn through $20 in minutes.

Existing solutions attack the wrong layer. Prompt caching (Anthropic's native feature) only helps with exact prefix matches—change one line in your system prompt and cache misses. LLMLingua uses token-level deletion models but requires training and lacks reversibility—you can't recover what was compressed. RTK and lean-ctx focus on CLI output compression but break on structured data like JSON API responses or AST-heavy codebases. The fundamental tension: compress too aggressively and the LLM hallucinates; compress too conservatively and you're still burning tokens. Headroom resolves this with reversible compression—it gives the LLM a retrieval tool to request uncompressed segments on-demand, shifting the problem from 'does compressed context suffice' to 'does the LLM know when to ask for more,' which is empirically easier.

Technical Insight

Headroom Core

Compressors

HTTP Request

Library Mode

Proxy Mode :8787

MCP Server

Stabilize Prefix

Detect Type

Code Detected

Prose Detected

Compressed

Compressed

Compressed

Inject Retrieval Tool

Store Original

Response + Tool Calls

Retrieve Segments

Final Response

Agent/SDK Client

Entry Point

CacheAligner

ContentRouter

SmartCrusher

JSON

CodeCompressor

AST/Tree-sitter

Kompress-base

Prose

CCR Store

Reversible Layer

Cross-Agent Memory

Content-Addressed

LLM API

Anthropic/OpenAI

System architecture — auto-generated

Headroom's architecture is a three-layer interception system: library (direct SDK integration), proxy (OpenAI-compatible endpoint), and MCP server (Model Context Protocol tool integration). The core intelligence lives in ContentRouter, which fingerprints payloads and dispatches to specialized compressors. Here's how it works in proxy mode:

from headroom import HeadroomProxy

# Start the compression proxy on localhost:8787
proxy = HeadroomProxy(
    target="https://api.anthropic.com",
    compression_config={
        "json": {"compressor": "SmartCrusher", "threshold": 1000},
        "code": {"compressor": "CodeCompressor", "languages": ["python", "typescript"]},
        "prose": {"compressor": "Kompress", "ratio": 0.4}
    },
    ccr_enabled=True  # Reversible compression with retrieval tool
)

# Your agent code doesn't change—just point at localhost
import anthropic
client = anthropic.Anthropic(
    base_url="http://localhost:8787",
    api_key=os.environ["ANTHROPIC_API_KEY"]
)

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    messages=[{
        "role": "user",
        "content": f"Debug this 50KB kubectl log:\n{huge_log}"
    }]
)
# Headroom intercepts, compresses huge_log to 8KB, injects retrieval tool

ContentRouter performs content-type detection using a cascade: tree-sitter AST parsing for code (fails fast on non-code), JSON schema fingerprinting for structured data, then falls back to Kompress-base for prose. SmartCrusher handles JSON by flattening arrays and deduplicating object keys—a 10KB AWS CloudWatch response with 200 repeated {"timestamp": ..., "message": ...} entries compresses to 2KB by extracting the schema once and storing values as tuples. CodeCompressor parses source with tree-sitter and prunes docstrings, comments, and unreferenced imports while preserving structure—critical for code where whitespace matters.

The reversible compression layer (CCR) is where Headroom gets clever. Instead of sending compressed content directly, it stores the original in a local content-addressed cache and sends a compressed summary with metadata. It then injects a retrieval tool into the LLM's function schema:

{
  "name": "retrieve_original_content",
  "description": "Fetch uncompressed segments when compressed context is insufficient",
  "parameters": {
    "content_id": "string (hash of original content)",
    "line_range": "optional tuple for code segments"
  }
}

If the LLM determines it needs more detail—say, it sees a compressed stack trace but needs the full variable values—it calls retrieve_original_content. The proxy fetches from local cache and streams the full segment in a follow-up turn. Benchmarks show Claude 3.5 Sonnet invokes retrieval on 12% of compressed requests, and when it does, task accuracy matches uncompressed baselines (94.3% vs 94.1% on HumanEval debugging tasks).

CacheAligner is pure cost engineering. Anthropic and OpenAI use KV cache for prompt prefixes—if your first 5K tokens match a prior request, you pay $0.10/1M instead of $3/1M. But cache hits require exact prefix matches. CacheAligner rewrites prompts to move dynamic content (timestamps, request IDs, variable log segments) to suffixes:

# Before CacheAligner (cache miss every time)
messages = [
    {"role": "system", "content": f"You are a debugger. Time: {time.now()}"},
    {"role": "user", "content": f"Analyze {log_file}"}
]

# After CacheAligner (cache hit on system + stable user prefix)
messages = [
    {"role": "system", "content": "You are a debugger."},  # Stable prefix
    {"role": "user", "content": f"Analyze this log. Context metadata: {time.now()}, {log_file}"}
]

This turns random-access cache behavior into sequential hits, compounding token savings with billing savings. On a 50-request debugging session, CacheAligner + compression reduced costs from $8.40 to $1.20 (85% reduction).

The MCP server mode integrates with Claude Desktop and other MCP clients as a tool provider. Instead of compressing messages in-flight, it exposes compress_file, compress_command_output, and compress_rag_chunks as callable tools. This gives the agent explicit control:

// In Claude Desktop with Headroom MCP server
const tools = [
  {
    name: "compress_file",
    input: { path: "/var/log/nginx/access.log", method: "auto" }
  }
];
// Returns compressed summary + retrieval handle

Cross-agent memory is the under-documented feature that should be Headroom's lead pitch. It stores compressed content with agent provenance (Cursor vs Claude vs Codex) and deduplicates across sessions. If Cursor compresses a 500-file repository at 10 AM, Claude Desktop at 2 PM reuses the compressed representation instead of re-compressing. This requires a shared content-addressed store (SQLite by default, Postgres for teams) and agent ID tracking. The README buries this in a subsection, but for teams running multiple coding agents, it's the killer feature—it's a primitive agentic operating system with shared memory.

Gotcha

Local-only execution is a deployment blocker for production systems. The proxy runs on localhost, which works for individual developers but fails for Lambda functions, Cloud Run services, or distributed agent architectures. The privacy pitch ('your data stays here') rings hollow when agents already send context to Anthropic/OpenAI—enterprises subject to SOC 2 or HIPAA can't audit a local compression proxy more easily than they audit API calls. For production use, you'd need to containerize Headroom and run it as a sidecar, which adds latency and operational complexity.

Kompress-base, the prose compression model, is a black box. The HuggingFace repo shows benchmarks (40% compression at 91% semantic similarity on 'agentic traces') but no training data, architecture details, or ablation studies. You can't debug why it mangles specific log formats or audit for data contamination. When compression fails—say, it drops a critical error code—you're stuck treating it as a magic oracle. For regulated industries or teams that need explainability, this opacity is a non-starter.

Reversible compression assumes the LLM will invoke retrieval correctly. If Claude hallucinates context instead of calling retrieve_original_content, or if it doesn't realize compressed data is insufficient, CCR becomes lossy. The benchmarks don't measure retrieval failure rates or token budgets consumed by round-trips—if retrieval adds 2K tokens per invocation and happens on 20% of requests, savings evaporate on complex tasks. The headroom wrap command is brittle shell script generation. It works for Claude Code and Cursor because they read ANTHROPIC_API_KEY and respect ANTHROPIC_BASE_URL overrides, but GitHub Copilot CLI uses SSO auth flows that ignore environment variables, and custom agents with hardcoded endpoints require per-version patches.

Verdict

Use if: You're an individual developer or small team spending $300+/month on Claude/GPT-4 with tool-heavy workflows (debugging, RAG, codebase search), your agents run locally (Cursor, Aider, Claude Code), and you're willing to run a local proxy for 60-85% token savings. The reversible compression story is technically sound—giving the LLM a retrieval function sidesteps the accuracy debate better than competitors. If you're using multiple coding agents (Cursor + Claude Desktop + Codex), evaluate Headroom for the cross-agent memory alone—shared compressed context across tools is genuinely novel. Skip if: You're running production LLM services in Lambda/Cloud Run/Fargate (local-only execution is a non-starter), you need auditability for regulated industries (Kompress-base lacks transparency), or you're working with non-OpenAI-compatible agents (GitHub Copilot, custom internal tools). The 19K GitHub stars smell like Hacker News hype—the enterprise docs are placeholder-tier, and the single-maintainer risk is high. For teams that can't run local infrastructure, Anthropic's native prompt caching at $0.10/1M cached tokens is simpler than operating Headroom, even if cache hit rates are lower.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/chopratejas-headroom.svg)](https://starlog.is/api/badge-click/ai-dev-tools/chopratejas-headroom)