Back to Articles

Parameter Golf: Training Language Models Under Extreme Memory Constraints

[ View on GitHub ]

Parameter Golf: Training Language Models Under Extreme Memory Constraints

Hook

What if the constraint wasn’t how fast you could train a model, or how little data you could use, but how few bytes of memory you could possibly consume? OpenAI is betting $1M in compute credits that this seemingly arbitrary limit will unlock fundamentally new approaches to language model design.

Context

The deep learning community has spent the past decade in an arms race of scale. GPT-3 brought 175 billion parameters. GPT-4 allegedly uses over a trillion. Even “small” models like Llama-7B dwarf what most developers can run locally. This trajectory has created a monoculture: bigger is better, and architectural innovation takes a backseat to simply adding more layers and parameters.

Parameter Golf inverts this paradigm entirely. Instead of optimizing for speed (like the NanoGPT Speedrun) or data efficiency, it asks a deceptively simple question: what’s the best language model you can train that fits in exactly 16 megabytes? That’s roughly the size of a single high-resolution photo, or about 4 million float32 parameters. The challenge provides a baseline transformer implementation, a standardized evaluation pipeline measuring compression performance on FineWeb validation data, and strict rules: your model must train in under 10 minutes on 8xH100 GPUs. OpenAI is providing compute grants and using this as both a research initiative and recruiting mechanism—a technical olympiad where the prize is advancing our understanding of neural scaling laws at the extreme low end of the parameter spectrum.

Technical Insight

Model Components

raw text

token sequences

batches

predictions

compression score

pass

fail

FineWeb Dataset

SentencePiece Tokenizer

1024 vocab

Data Loader

Decoder-only Transformer

configurable depth/width

Token Embeddings

vocab_size × d_model

N Transformer Layers

attention + FFN

Output Projection

d_model × vocab_size

Bits-per-Byte Evaluator

Validation Gate

16MB + 10min

Submission

Rejected

System architecture — auto-generated

The repository structure reveals the core architectural decisions. The baseline model is a decoder-only transformer with configurable depth, width, and vocabulary size. What makes this interesting isn’t the architecture itself—it’s what you’re forced to optimize when size becomes the binding constraint.

The evaluation metric is bits-per-byte rather than traditional loss, which has profound implications. By measuring raw compression independent of tokenization, participants can treat vocabulary size as an optimization variable. The baseline uses a 1024-token SentencePiece vocabulary, but nothing stops you from using 256 tokens or 4096 tokens—each choice trades off embedding table size against sequence length and expressiveness:

# From the baseline config
class ModelConfig:
    vocab_size: int = 1024  # Embedding table: vocab_size * d_model params
    d_model: int = 256      # Model width
    n_layers: int = 6       # Depth
    n_heads: int = 4        # Attention heads
    
    def count_params(self):
        # Embedding table
        params = self.vocab_size * self.d_model
        # Each transformer layer: attention + FFN
        per_layer = (
            4 * self.d_model ** 2 +  # Q, K, V, O projections
            8 * self.d_model ** 2    # FFN (assuming 2x expansion)
        )
        params += per_layer * self.n_layers
        # Output projection
        params += self.d_model * self.vocab_size
        return params

With 16MB (4 million float32 parameters), every architectural choice becomes a zero-sum game. Want deeper models? Reduce width. Need more attention heads? Shrink the FFN. The baseline implements parameter tying between input and output embeddings, immediately saving vocab_size * d_model parameters—a 26% reduction when vocabulary is 1024 and d_model is 256.

The MLX implementation for Apple Silicon is particularly clever for iteration. Most participants won’t have 8xH100s sitting around, so the repo includes a Metal-optimized version that runs on M1/M2 Macs. This lets you prototype architectures locally before submitting official runs:

# MLX-specific optimization
import mlx.core as mx
import mlx.nn as nn

class TinyAttention(nn.Module):
    def __init__(self, dim, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.scale = (dim // n_heads) ** -0.5
        # Fused QKV projection saves memory
        self.qkv = nn.Linear(dim, dim * 3, bias=False)
        self.proj = nn.Linear(dim, dim, bias=False)
    
    def __call__(self, x):
        B, L, D = x.shape
        qkv = self.qkv(x).reshape(B, L, 3, self.n_heads, -1)
        q, k, v = mx.split(qkv, 3, axis=2)
        # MLX handles attention efficiently on Metal
        attn = mx.softmax(q @ k.transpose(0, 1, 3, 2) * self.scale, axis=-1)
        return self.proj((attn @ v).reshape(B, L, D))

The data pipeline uses FineWeb, a massive web corpus, but samples only a tiny fraction for evaluation. This democratizes the challenge—you’re not competing on who has the best data cleaning pipeline, but purely on architectural efficiency. The validation script measures bits-per-byte by computing the negative log likelihood and converting to information-theoretic units, which naturally accounts for different tokenization schemes.

What’s fascinating is what this constraint encourages: test-time compute techniques (like speculative sampling or chain-of-thought built into the architecture), extreme parameter sharing (weight tying across layers), novel attention patterns (linear attention to save parameters), and even exploring whether transformers are the right architecture at all at this scale. You could reasonably submit an LSTM, state-space model, or completely novel architecture if it compresses better.

Gotcha

The most obvious limitation is accessibility. Despite OpenAI offering compute grants, the official evaluation requires 8xH100 GPUs—roughly $20-30 per hour on cloud providers. The 10-minute training window means you need both the hardware allocation and enough preliminary experimentation to know what architecture to train. This creates a two-tier system: those with existing research compute budgets can iterate rapidly, while individual developers are bottlenecked by grant approval processes and unfamiliarity with multi-GPU training.

More fundamentally, optimizing for bits-per-byte on web text may not correlate with what makes a language model useful. A model that achieves 1.2 bpb might be worse at following instructions, reasoning, or staying coherent than one scoring 1.3 bpb. The challenge optimizes for a proxy metric—compression—rather than downstream task performance. As of the repository’s current state, there’s only one baseline submission on the leaderboard, suggesting either very early-stage adoption or that the barrier to meaningful participation is higher than anticipated. The compressed model size also makes knowledge distillation approaches difficult—you can’t easily transfer knowledge from a larger teacher model without fundamentally changing the game. This is training from scratch in hard mode, which is scientifically interesting but practically limiting.

Verdict

Use if: you’re fascinated by architectural efficiency and want to explore the boundaries of what’s possible with minimal parameters, you have access to multi-GPU compute (either through the grants or institutional resources), you’re interested in extreme compression techniques that might inform future edge deployment, or you’re genuinely interested in working at OpenAI and want a technical portfolio piece that demonstrates creative problem-solving under constraints. Skip if: you need models for production use cases where task performance matters more than parameter count, you lack the compute infrastructure and don’t want to navigate grant processes, you’re more interested in applied ML than research-oriented optimization challenges, or you prefer working on problems with established best practices rather than largely unexplored territory. This is a research competition first and a practical tool second—treat it as an intellectual exercise that might yield interesting insights rather than a path to deployable models.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/openai-parameter-golf.svg)](https://starlog.is/api/badge-click/llm-engineering/openai-parameter-golf)