makemore: Understanding Language Models by Implementing Them Seven Different Ways
Hook
The simplest language model in makemore isn't a neural network at all—it's a 27x27 matrix of character counts normalized into probabilities. Yet it can generate plausible baby names, and understanding why reveals what every LLM is actually doing.
Context
Most developers encounter language models through high-level APIs: OpenAI's completions endpoint, HuggingFace's pipeline abstraction, or LangChain's prompt templates. These tools hide the fundamental mechanism behind every autoregressive model, from GPT-4 to your phone's autocorrect: predicting the next token based on previous context, sampling from that distribution, then feeding the result back as input.
makemore exists because reading papers doesn't build intuition the way writing code does. Andrej Karpathy created this repository as a companion to his Neural Networks: Zero to Hero lecture series, implementing a progression from trivial statistical models to Transformers. Each of the seven models—bigram, MLP, BatchNorm MLP, RNN, GRU, LSTM, and Transformer—is a self-contained lesson in autoregressive modeling, trained on the same tiny dataset of 32,000 baby names. The repository isn't trying to achieve state-of-the-art performance or production readiness. It's optimizing for clarity: showing you exactly what changes architecturally as you move from counting character pairs to computing self-attention across a learned embedding space.
Technical Insight
The progression starts with bigram.py, which demonstrates that language modeling predates neural networks entirely. The model is a literal count matrix: for every character pair in the training data, increment counts[char1][char2]. Normalize each row into probabilities, and you have a language model:
# Not a neural network - just counting
import torch
# Build 27x27 matrix (26 letters + special token)
N = torch.zeros((27, 27), dtype=torch.int32)
for name in names:
chars = ['.'] + list(name) + ['.']
for ch1, ch2 in zip(chars, chars[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
# Convert counts to probabilities
P = (N+1).float() # +1 for smoothing
P /= P.sum(1, keepdim=True)
# Sample: pick next char based on current char's row
for _ in range(5):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1).item()
if ix == 0: break
out.append(itos[ix])
print(''.join(out))
This produces outputs like "mor", "axx", "minaymoryles"—nonsensical but structurally name-like. The training loss is negative log-likelihood, which you can calculate by looking up the actual next character's probability in the matrix. This model has exactly 729 parameters (27×27), no backpropagation, and trains in milliseconds.
The MLP implementation (mlp.py) replaces the count matrix with Bengio's 2003 neural probabilistic language model. Instead of looking at one previous character, it concatenates embeddings for the last three characters and feeds them through a neural network:
class MLP(nn.Module):
def __init__(self, vocab_size, emb_dim=10, hidden_dim=200, context_length=3):
super().__init__()
self.C = nn.Embedding(vocab_size, emb_dim)
self.layers = nn.Sequential(
nn.Linear(context_length * emb_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, vocab_size)
)
def forward(self, x):
# x: (batch, context_length)
emb = self.C(x) # (batch, context_length, emb_dim)
emb = emb.view(emb.shape[0], -1) # flatten to (batch, context_length * emb_dim)
logits = self.layers(emb)
return logits
The training loop is deliberately imperative—no PyTorch Lightning abstractions:
for i in range(100000):
# Sample minibatch
ix = torch.randint(0, X.shape[0], (32,))
# Forward pass
logits = model(X[ix])
loss = F.cross_entropy(logits, Y[ix])
# Backward pass
model.zero_grad()
loss.backward()
# Update
for p in model.parameters():
p.data += -0.1 * p.grad
No optimizer objects, no lr_scheduler, just naked gradient descent where you can watch every parameter update. This MLP achieves ~2.4 nats loss compared to the bigram's ~2.5, a modest improvement that demonstrates the value of learned representations and nonlinear transformations.
The Transformer implementation (transformer.py) shows the full GPT architecture in ~200 lines. The key innovation is replacing the MLP's fixed-position embeddings with self-attention, which dynamically weights the context based on content:
class CausalSelfAttention(nn.Module):
def __init__(self, n_embd, n_head, block_size):
super().__init__()
assert n_embd % n_head == 0
self.c_attn = nn.Linear(n_embd, 3 * n_embd) # qkv in one
self.c_proj = nn.Linear(n_embd, n_embd)
self.n_head = n_head
self.n_embd = n_embd
# Causal mask: can only attend to past
self.register_buffer("bias", torch.tril(
torch.ones(block_size, block_size)
).view(1, 1, block_size, block_size))
def forward(self, x):
B, T, C = x.size()
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
The causal mask ensures position i can only attend to positions ≤i, maintaining the autoregressive property. The repository doesn't hide this behind attention libraries—you see the matrix multiplications, the scaling factor, the softmax, the masking. This Transformer achieves ~2.0 nats loss, demonstrating that learned attention patterns outperform fixed context windows.
Critically, all seven models share the same sampling loop, making the architectural differences isolated to the forward pass. The generation code manually implements the autoregressive loop:
for _ in range(num_samples):
context = torch.zeros((1, block_size), dtype=torch.long)
for i in range(max_length):
logits = model(context)
probs = F.softmax(logits[:, -1, :], dim=-1)
next_char = torch.multinomial(probs, num_samples=1)
context = torch.cat([context[:, 1:], next_char], dim=1)
if next_char == 0: break
No .generate() method, no beam search, no temperature scheduling—just the raw loop that every autoregressive model executes under the hood.
Gotcha
The repository optimizes for pedagogical clarity at the expense of everything else. There's no batching strategy beyond manually slicing tensors, which means training on anything larger than the 32K names dataset will thrash your RAM and crawl at single-digit examples per second. The Transformer implementation lacks KV-caching, so generation is O(n²) in sequence length—fine for 10-character baby names, unusable for 1000-token documents.
The evaluation story is almost nonexistent. You get loss curves and generated samples, but no diversity metrics, no perplexity breakdowns by character position, no systematic comparison of sample quality across models. This makes it hard to develop intuition about what architectural changes actually improve—you're eyeballing whether "dariela" looks more name-like than "mxaahjl" rather than quantifying generation quality. The code also skips every modern training practice: no learning rate warmup, no gradient clipping, no weight decay, no dropout. Models will happily overfit to memorizing the training set, and the code won't warn you or show you how to diagnose it. This is deliberate—adding these would obscure the core lesson—but it means you can't use makemore as a template for real projects without rewriting the training infrastructure entirely.
Verdict
Use if: You're an ML engineer or data scientist who's implemented scikit-learn models but never built a neural network training loop from scratch, you're trying to understand what HuggingFace's transformers library is abstracting away when you call .forward(), you're reading the Attention is All You Need paper and want executable reference code for the equations, or you learn better by modifying working code than reading documentation. makemore is the best educational implementation ladder for autoregressive language modeling—the progression from counts to Transformers builds intuition that no amount of blog posts can replicate. Skip if: You need a production text generation system (use transformers or llama.cpp), you're working with datasets larger than 100MB (no data pipeline or memory optimization), you want to experiment with architectural variants without editing monolithic scripts (nanoGPT has better modularity), or you're already comfortable implementing attention mechanisms and just need a fast reference (the Annotated Transformer is more concise). This is a teaching tool, not a library. Its value is showing you what libraries hide.