Back to Articles

Running 70B LLMs on a 4GB GPU: How AirLLM Trades Speed for Accessibility

[ View on GitHub ]

Running 70B LLMs on a 4GB GPU: How AirLLM Trades Speed for Accessibility

Hook

What if the bottleneck preventing you from running a 70B parameter model isn't your GPU—it's just our assumptions about where models need to live during inference?

Context

The democratization of large language models hit a hard wall: memory. While open-source models like Llama 2 70B and Mixtral 8x7B became freely available, actually running them required enterprise-grade GPUs with 80GB+ VRAM or complex multi-GPU setups costing tens of thousands of dollars. Quantization techniques like GPTQ and GGUF helped, but even an 8-bit quantized 70B model needs roughly 70GB of memory—still far beyond consumer hardware.

Researchers and independent developers faced a frustrating paradox: the models were open and free, but the hardware to run them wasn't. This created a new form of gatekeeping where only well-funded labs could experiment with state-of-the-art models. AirLLM approaches this problem from a radically different angle: instead of compressing the model to fit in GPU memory, it asks why the entire model needs to be in memory at all. By treating GPU VRAM as a cache and disk as the primary storage, it inverts the traditional inference architecture—accepting slower speeds in exchange for running on hardware you already own.

Technical Insight

AirLLM's core innovation is layer-wise model sharding combined with dynamic loading. Traditional inference loads the entire model into GPU memory before processing begins. AirLLM instead splits transformer models into individual layers—each containing attention mechanisms, feed-forward networks, and normalization—and stores them as separate files on disk. During inference, only the currently executing layer resides in GPU memory.

The architecture maintains a sliding window approach: as token passes through layer N, the system prefetches layer N+1 from disk while computing, then swaps layers in GPU memory. This creates a pipeline where disk I/O overlaps with computation, hiding some of the latency cost. Here's how you initialize and run inference:

from airllm import AutoModel

# First run splits model into layers (one-time cost)
model = AutoModel.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    compression='4bit',  # Optional: apply quantization
    profiling_mode=False,
    delete_original=True  # Remove source model after sharding
)

input_text = "Explain quantum entanglement in simple terms:"
input_tokens = model.tokenizer(input_text, return_tensors="pt").input_ids

# Generation loads one layer at a time
output = model.generate(
    input_tokens,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7
)

print(model.tokenizer.decode(output[0]))

The first initialization triggers model decomposition—AirLLM downloads the model from HuggingFace, splits it layer-by-layer, and saves each to disk. This process can take 30-60 minutes for a 70B model and temporarily requires storage for both the original and sharded versions (using delete_original=True cleans up afterward). The sharded structure looks like this in your cache directory:

~/.cache/huggingface/airllm/
├── model_layer_0.safetensors
├── model_layer_1.safetensors
├── model_layer_2.safetensors
...
├── model_layer_79.safetensors
└── config.json

Each layer file is typically 1-2GB for a 70B model without quantization. During generation, AirLLM uses memory-mapped files to avoid loading entire layers into RAM, transferring only necessary weights to GPU memory. The prefetching mechanism employs threading to load layer N+1 while GPU processes layer N, providing roughly 10% speedup over naive sequential loading.

The optional compression parameter applies block-wise quantization (4-bit or 8-bit) to weights before sharding. Unlike activation quantization which struggles with outliers, weight-only quantization is deterministic and reduces both disk storage and transfer time. A 4-bit quantized 70B model compresses to roughly 35GB on disk and loads approximately 3x faster during inference. The quantization happens during the initial sharding phase:

# With 4-bit quantization
model_4bit = AutoModel.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    compression='4bit'
)

# With 8-bit quantization  
model_8bit = AutoModel.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    compression='8bit'
)

# No quantization (slowest but highest quality)
model_fp16 = AutoModel.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    compression=None
)

AirLLM supports automatic architecture detection for Llama, Mistral, Qwen, ChatGLM, and Mixtral variants. For Mixture-of-Experts models like Mixtral, it shards expert layers independently, loading only activated experts per token. The system also works on Apple Silicon Macs, using Metal Performance Shaders for GPU acceleration while maintaining the same disk-based layer swapping strategy.

The tradeoff is stark: where a traditional in-memory 70B inference might generate 10-20 tokens per second on an A100, AirLLM on a 4GB GPU typically produces 0.5-2 tokens per second depending on disk speed and quantization settings. You're essentially converting a memory bottleneck into a disk I/O bottleneck—feasible because SSD sequential reads (2-7 GB/s for NVMe) can keep a GPU fed at reduced throughput.

Gotcha

The elephant in the room is speed—or rather, the lack of it. AirLLM's disk-swapping approach makes inference painfully slow compared to any traditional method. Generating a 500-token response might take 5-10 minutes instead of 30 seconds. This isn't just an inconvenience; it fundamentally changes how you can use the model. Interactive chat applications become frustrating. Batch processing becomes overnight jobs. Real-time applications are completely off the table. The architecture is brilliant for making models accessible, but that accessibility comes with a temporal cost that makes most production use cases impractical.

Storage requirements also catch users off guard. The initial model decomposition needs space for both the original HuggingFace model and the sharded output simultaneously—potentially 150GB+ for a 70B model before you can clean up the original. Subsequent runs only need the sharded version, but you still need substantial SSD space. Using a slower HDD instead of SSD makes inference even more glacial, sometimes dropping below 0.1 tokens per second. The system also lacks advanced features like continuous batching, speculative decoding, or KV cache optimization that production inference engines provide. You're getting raw model access, nothing more.

Verdict

Use if: You're a researcher, hobbyist, or student who needs to experiment with large models but only has consumer-grade hardware (4-8GB GPU). Perfect for prototyping, testing prompts, fine-tuning validation, educational exploration, or generating small amounts of text where you can walk away and come back. The ability to run a 70B model at all—even slowly—beats not running it. Also valuable if you're doing occasional model evaluation where overnight batch processing is acceptable and you can't justify cloud GPU costs. Skip if: You need any form of production deployment, real-time inference, or interactive applications. If you have access to proper GPU infrastructure (24GB+ VRAM), use standard inference engines like vLLM or TGI instead—they'll be 20-50x faster. Also skip if you're doing heavy experimentation requiring hundreds of generations; cloud GPU spot instances become more cost-effective than waiting hours for local results. AirLLM is a clever hack for accessibility, not a performance solution.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/lyogavin-airllm.svg)](https://starlog.is/api/badge-click/llm-engineering/lyogavin-airllm)