Back to Articles

Candle: Running LLMs in Rust Without the Python Tax

[ View on GitHub ]

Candle: Running LLMs in Rust Without the Python Tax

Hook

A 7MB WebAssembly binary running Whisper speech recognition entirely in your browser—no servers, no Python, no Docker. That's what happens when you rebuild ML infrastructure in Rust.

Context

Machine learning deployment has a Python problem. Every production ML system eventually hits the same wall: you train in Python because the ecosystem is unbeatable, but deploying Python means bundling a runtime, managing dependencies, and accepting higher memory overhead and slower cold starts. For edge devices, embedded systems, and WebAssembly targets, shipping a Python interpreter is often a non-starter.

The typical workaround—ONNX export and specialized runtimes—forces you into a lowest-common-denominator world where modern architectures like transformers require workarounds and debugging becomes an exercise in tracing through opaque runtime layers. Candle takes a different approach: rebuild the ML framework itself in Rust with inference as the primary use case, maintaining API familiarity for PyTorch users while unlocking Rust's compile-time optimizations, memory safety, and target flexibility. It's Hugging Face's bet that the future of ML deployment looks more like systems programming than notebook-driven development.

Technical Insight

Candle's architecture centers on a Tensor type that abstracts over CPU, CUDA, and Metal backends through a unified Device enum. Unlike PyTorch's dynamic approach, Candle leverages Rust's type system to catch dimension mismatches and device placement errors at compile time. The API will feel immediately familiar if you've written PyTorch:

use candle_core::{Tensor, Device, DType};

fn main() -> Result<()> {
    let device = Device::cuda_if_available(0)?;
    
    // Create tensors with PyTorch-like syntax
    let a = Tensor::randn(0f32, 1.0, (2, 3), &device)?;
    let b = Tensor::randn(0f32, 1.0, (3, 4), &device)?;
    
    // Matrix multiplication with automatic device placement
    let c = a.matmul(&b)?;
    
    // Automatic differentiation for backprop
    let x = Tensor::new(&[3f32], &device)?.set_requires_grad(true)?;
    let y = (x * x)?;  // y = x²
    let grads = y.backward()?;
    
    println!("dy/dx: {:?}", grads.get(&x));
    Ok(())
}

The killer feature is what Candle doesn't include: a Python interpreter. A typical candle-based inference binary for a quantized LLaMA model clocks in under 20MB including the runtime. Compare that to a PyTorch deployment requiring 1GB+ of dependencies. This size difference isn't just academic—it translates directly to faster cold starts in serverless environments and viability on embedded targets.

Candle's model loading leverages the safetensors format and integrates directly with Hugging Face Hub, meaning you can load pretrained weights without leaving Rust:

use candle_transformers::models::llama;
use hf_hub::api::sync::Api;

let api = Api::new()?;
let repo = api.model("meta-llama/Llama-2-7b-hf".to_string());
let weights = repo.get("model.safetensors")?;

let config = llama::Config::v2_7b();
let vb = unsafe { 
    candle_nn::VarBuilder::from_mmaped_safetensors(
        &[weights], 
        DType::F32, 
        &device
    )? 
};

let model = llama::Llama::load(vb, &config)?;

The quantization story is particularly compelling. Candle implements GGML-style quantization schemes (Q4_0, Q4_K, Q8_0) that reduce memory bandwidth and enable running 7B parameter models on consumer GPUs. The quantization is exposed through the type system—a QTensor type that maintains the quantization parameters alongside the data:

use candle_core::quantized::{QTensor, gguf_file};

// Load a quantized model directly
let mut file = std::fs::File::open("model-q4_0.gguf")?;
let model = gguf_file::Content::read(&mut file)?;

// Quantized matmul is transparently faster
let result = qtensor.matmul(&input)?;  // Dequantizes on-the-fly

WebAssembly support is where Candle truly diverges from traditional frameworks. Because Rust compiles to WASM with predictable performance characteristics and no garbage collector, you can run models like Whisper, Segment Anything, or even LLaMA2 entirely client-side. The wasm32-unknown-unknown target works out of the box with feature flags controlling SIMD optimizations. Hugging Face maintains a collection of online demos proving this isn't vaporware—real models running at interactive speeds in Chrome.

The neural network building blocks in candle-nn provide familiar abstractions: Linear, Conv2d, LayerNorm, and optimization primitives like AdamW. Defining a model looks structurally similar to PyTorch's nn.Module:

use candle_nn::{Module, VarBuilder, linear, layer_norm};

struct TransformerBlock {
    attention: MultiHeadAttention,
    ln1: LayerNorm,
    mlp: Mlp,
    ln2: LayerNorm,
}

impl TransformerBlock {
    fn forward(&self, x: &Tensor) -> Result<Tensor> {
        let residual = x;
        let x = self.ln1.forward(x)?;
        let x = self.attention.forward(&x)?;
        let x = (x + residual)?;
        
        let residual = &x;
        let x = self.ln2.forward(&x)?;
        let x = self.mlp.forward(&x)?;
        x + residual
    }
}

The ownership model forces you to think carefully about tensor lifetime and mutation, which ironically results in more predictable memory usage than Python's reference counting. You'll never accidentally hold references that prevent GPU memory from being freed.

Gotcha

Candle is brutally honest about what it's not: a training framework. While autodiff exists and backpropagation works, the ergonomics and performance optimizations heavily favor inference. Don't expect PyTorch Lightning's training loop abstractions or extensive learning rate scheduling options. The examples directory contains basic training scripts, but you'll find yourself fighting the type system to implement custom training procedures that would be straightforward in Python. Mixed precision training, gradient accumulation, and distributed training primitives are either missing or immature.

The ecosystem gap is real. When you hit an error in PyTorch, Stack Overflow has 50,000 answers. With Candle, you're reading source code and opening GitHub issues. Third-party model implementations are limited to what the core team has ported—if you need a niche architecture from a recent paper, you're implementing it yourself. Documentation exists but lacks the depth of tutorials, gotcha guides, and war stories that make Python ML frameworks approachable. The Rust learning curve also can't be ignored: if your team hasn't internalized borrowing, lifetimes, and trait bounds, Candle will teach those lessons the hard way during a production deadline.

Verdict

Use if: You're deploying inference-only workloads where Python's overhead is measurable (serverless cold starts, edge devices, embedded systems), you need WebAssembly ML for privacy-preserving client-side inference, or you're building Rust-native applications where a Python bridge would introduce unacceptable complexity. Candle excels when binary size, memory efficiency, and predictable performance matter more than development velocity. Skip if: Your primary workflow involves model training and experimentation, your team lacks Rust expertise and the timeline doesn't accommodate learning it, or you need the long tail of community-contributed models and utilities that only Python frameworks provide. For research and prototyping, PyTorch remains the better choice—use Candle when it's time to ship.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/huggingface-candle.svg)](https://starlog.is/api/badge-click/developer-tools/huggingface-candle)