Shard: Proving LLM Inference Can Work Across Scattered GPUs and Terrible Internet
Hook
A research project just demonstrated 30 tokens per second inference on a 671B parameter model by connecting random GPUs across the internet—not by making the network faster, but by designing around the assumption that it's impossibly slow.
Context
The frontier model problem is getting worse: GPT-4 scale models need hundreds of gigabytes of VRAM, pricing out anyone without datacenter access. The standard solutions—tensor parallelism, pipeline parallelism, even frameworks like vLLM and TGI—all assume your GPUs sit in the same rack with 400 Gbps NVLink or InfiniBand connecting them. They fall apart the moment you introduce WAN latency.
Meanwhile, there's massive unused GPU capacity scattered across the world: gaming rigs, university clusters, hobbyist setups. The economic incentive is obvious—aggregate consumer hardware to serve frontier models without Hyperscaler infrastructure costs. But the technical challenge seemed insurmountable: how do you pipeline activations through model layers when each hop takes 50-150ms instead of microseconds? Naive pipeline parallelism over WAN gives you 1.87 tokens per second—completely unusable. Shard is an architectural proof-of-concept from the c0mpute decentralized inference project that tackles this exact constraint, demonstrating that with the right topology and aggressive speculation, WAN-distributed inference isn't just possible—it's viable.
Technical Insight
Shard's core innovation is ring direct-return topology with asynchronous pipeline chunking. Traditional pipeline parallelism flows activations forward through stages (GPU0 → GPU1 → GPU2) and returns results backward (GPU2 → GPU1 → GPU0). Over WAN, this doubles your latency. Shard breaks the return path: activations still flow sequentially through the ring, but the final stage sends results directly back to the coordinator in one hop. This seemingly minor change unlocks asynchronous pipelining—the coordinator can stream multiple verification chunks into the pipeline simultaneously without waiting for round-trips.
Here's how the coordinator manages speculative verification:
# From phase0/coordinator.py - async pipeline chunk submission
for chunk_idx, chunk_start in enumerate(range(0, K+1, chunk_size)):
chunk_end = min(chunk_start + chunk_size, K+1)
chunk_tokens = all_tokens[prompt_len + chunk_start : prompt_len + chunk_end]
# Fire chunk into pipeline without blocking
verify_task = asyncio.create_task(
self._verify_chunk_distributed(
chunk_tokens,
chunk_start,
stage_clients
)
)
verify_tasks.append(verify_task)
# Await all chunks in parallel - pipeline hides WAN latency
chunk_results = await asyncio.gather(*verify_tasks)
The draft model runs locally on the coordinator using CUDA graphs with a static KV cache. The challenge: speculative decoding requires rolling back cache state when the verifier rejects tokens, but CUDA graphs capture the entire execution graph—you can't conditionally resize tensors. Shard's solution is brilliant: use a static-size cache with position-tensor-driven writes. The cache never shrinks; instead, a position tensor tracks which slots are valid. On rejection, decrement the position counter—the stale slots stay in memory but get overwritten on the next forward pass. This produces byte-identical output to eager execution (validated by research/glm_swarm_nvfp4_cg_diff.py) while eliminating per-token launch overhead:
# From draft/static_graph.py - rollback-safe cache writes
def forward_graphed(self, input_ids, position_id_tensor):
# position_id_tensor is a 1-element tensor updated host-side
pos = position_id_tensor.item() # Read current position
# Write to cache slot determined by position tensor
self.k_cache[:, :, pos:pos+1, :] = new_key
self.v_cache[:, :, pos:pos+1, :] = new_value
# On rejection, caller just decrements position_id_tensor
# Next forward pass overwrites the rejected slot
This CUDA-graphed draft delivers a 3.8× speedup (49.7ms → 13.1ms per token) compared to eager execution. Combined with speculation accepting ~5 tokens per round-trip, the system shifts from WAN-latency-bound (1.87 tok/s) to throughput-bound (30 tok/s)—WAN latency drops to ~5% of total loop time.
The transport layer (phase0/wire.py) deserves attention for what it doesn't do: it refuses to use pickle. The Python ML ecosystem's standard serialization format can execute arbitrary code on malformed frames—a catastrophic security hole when accepting network traffic from untrusted nodes. Shard uses ChaCha20-Poly1305 authenticated encryption with length-prefixed framing:
# Simplified from phase0/wire.py
async def send_frame(writer, data: bytes, cipher):
nonce = secrets.token_bytes(12)
ciphertext = cipher.encrypt(nonce, data, None)
frame = len(ciphertext).to_bytes(4, 'big') + nonce + ciphertext
writer.write(frame)
await writer.drain()
async def recv_frame(reader, cipher):
length_bytes = await reader.readexactly(4)
length = int.from_bytes(length_bytes, 'big')
nonce = await reader.readexactly(12)
ciphertext = await reader.readexactly(length)
return cipher.decrypt(nonce, ciphertext, None)
A compromised PSK still lets attackers decrypt traffic, but they can't achieve remote code execution by sending malicious frames—they just trigger parse errors.
Finally, the system generates verifiable execution receipts embedding GPU UUIDs, public IPs, measured WAN RTTs, and output token hashes. This is infrastructure for proving provenance: demonstrating that inference actually happened across distributed nodes rather than trusting a single datacenter's claim. The receipt format is simple but auditable—any observer can verify that the declared topology matches network measurements.
Gotcha
The limitations are disqualifying for production use and the codebase is refreshingly honest about them. There's no NAT traversal—every worker must expose a public port. This blocks the entire target audience of residential GPU owners behind home routers. The README acknowledges this explicitly: hole-punching and relay fallback are listed as Phase 1 future work, not present in the codebase today.
Privacy is fundamentally leaky. Any node that processes activations sees those activations in plaintext—they have to decrypt them to run inference. The README correctly notes that intermediate layers leak partial token information to malicious workers, and proposes 'trusted node pinning' (routing through verified operators) rather than overclaiming zero-knowledge. But trusted routing isn't implemented; it's documented intent. If you join a Shard swarm today, every participant can see the activations they process. Confidential computing (AMD SEV, Intel TDX) could solve this with hardware-enforced memory encryption, but requires datacenter hardware and isn't compatible with Shard's consumer-GPU focus.
Fault tolerance is nonexistent. A single node failure crashes the entire pipeline. There's no checkpointing, no fallback routing, no dynamic rebalancing if a worker goes offline. The Phase 3 roadmap mentions heterogeneous GPU support with adaptive layer allocation, but that's vaporware—today you manually assign shard boundaries and pray nothing crashes. The single-request-at-a-time design with greedy decoding only (no batching, no sampling, no beam search) means this can't compete with production serving frameworks on utilization or flexibility.
Verdict
Use if: You're building decentralized inference infrastructure and need architectural proof that WAN-distributed pipeline parallelism can achieve usable speeds. You're willing to treat this as a research artifact—a reference implementation demonstrating ring direct-return topology, CUDA-graphed speculation, and the measured path from 1.87 tok/s to 30 tok/s—and build the remaining 80% of a production system yourself. You value the honest documentation of what doesn't work as much as what does. Skip if: You need production-ready serving (use vLLM or TGI), actual privacy guarantees beyond trust (wait for confidential computing integration), fault tolerance or dynamic rebalancing (Petals handles this but doesn't prevent single-node-sees-all-weights), or anything beyond single-request greedy decoding. This is a proof-of-concept that solves one very specific problem—proving frontier LLM inference can work across scattered consumer GPUs—and deliberately punts everything else to future phases. The value is the architectural insight, not a working inference engine.