Autoresearch: When AI Agents Write Their Own Training Code
Hook
What if the output of your ML research wasn't a trained model, but a system that writes its own training code? Andrej Karpathy's autoresearch flips the entire paradigm: you write markdown, AI writes Python.
Context
Machine learning research has always followed the same pattern: researchers read papers, form hypotheses, write training code, run experiments, analyze results, and repeat. The automation has crept in gradually—hyperparameter tuning libraries like Optuna, experiment tracking with Weights & Biases, neural architecture search—but the fundamental loop remains human-driven. Researchers still write the Python code that defines model architectures, loss functions, and optimization schedules.
Autoresearch represents a fundamentally different approach. Instead of automating optimization within fixed code boundaries, it allows AI agents to modify the actual training implementation itself. Born from Karpathy's nanochat project (a minimal GPT implementation for educational purposes), autoresearch demonstrates what happens when you give an AI agent permission to rewrite train.py based on validation performance. The human's role shifts from writing PyTorch to writing program.md—a markdown file containing research directions and constraints. This isn't just a productivity tool; it's an exploration of what ML research could look like when the research organization itself becomes the primary artifact humans maintain.
Technical Insight
The architecture of autoresearch is deceptively simple, which is precisely the point. The system has three layers that interact in a continuous loop. The preparation layer (prepare.py) handles data loading, tokenization, and evaluation metrics—this code is immutable and provides a stable foundation. The mutable layer is train.py, containing the complete GPT model implementation and training loop that agents can modify freely. The instruction layer is program.md, where humans specify research objectives, constraints, and domain knowledge in natural language.
Here's how the agent loop actually works. Each iteration, an AI agent (typically GPT-4 or similar) reads the current train.py, reviews recent experimental results, and proposes modifications. It then writes the modified code, and the system executes a training run with a hard 5-minute time limit. The evaluation uses bits-per-byte on the validation set—a deliberately vocabulary-size-independent metric that remains valid even if the agent decides to change tokenization or model architecture fundamentally. If validation performance improves, the changes are kept; otherwise, they're discarded and the agent tries something else.
The fixed time budget is brilliant in its constraints. Instead of "train for 10,000 iterations," experiments run for exactly 5 minutes. This means if an agent implements a more efficient architecture or faster optimizer, it automatically gets more training iterations in the same wall-clock time. The metric being optimized is literally "how good can you get in 5 minutes," which naturally encourages efficiency improvements alongside accuracy gains.
Here's what a simplified interaction pattern looks like in program.md:
# Research Objective
Improve validation bits-per-byte on TinyStories dataset within 5-minute training budget.
# Constraints
- Single GPU (CUDA required)
- Modify only train.py
- Do not change prepare.py or evaluation logic
- All changes must complete training within 5 minutes
# Current Focus
Explore whether rotary positional embeddings improve sample efficiency
compared to learned absolute positions in this tiny-scale regime.
# Knowledge Base
Previous experiments showed:
- Batch size 64 performs better than 128 (run #43)
- Learning rate 3e-4 with cosine decay is current best (run #51)
- Model size is currently ~10M parameters
The agent reads this context, examines the current train.py implementation, and might generate a modification like this:
# Agent-modified section of train.py
class RotaryEmbedding(nn.Module):
def __init__(self, dim):
super().__init__()
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer('inv_freq', inv_freq)
def forward(self, seq_len, device):
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
return emb.cos(), emb.sin()
class GPTAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_head = config.n_head
self.n_embd = config.n_embd
self.rotary = RotaryEmbedding(config.n_embd // config.n_head)
# ... rest of attention implementation modified to use rotary
This gets executed, trained for 5 minutes, evaluated, and the result is appended to the experimental log. The agent sees "Run #64: 1.23 BPB (previous best: 1.28 BPB)" and keeps the change, or sees worse performance and reverts.
The single-file constraint is crucial for keeping this manageable. Allowing agents to create new files, restructure the codebase, or modify the evaluation harness would create an unbounded search space. By constraining modifications to train.py only, the system remains debuggable and human-reviewable. You can diff any experimental run against main and see exactly what changed.
The meta-research insight here is profound: the research organization (program.md) becomes a living document that humans curate, while the actual implementation code becomes an output artifact. When you discover that rotary embeddings help, you don't manually implement them everywhere—you update program.md with that knowledge, and future agent runs incorporate it automatically. The institutional knowledge lives in markdown, not in Slack messages or paper notes.
Gotcha
The platform constraints are real and deliberate. Autoresearch currently works only on NVIDIA GPUs with CUDA. There's no CPU fallback, no Apple Silicon MPS support, no multi-GPU distribution. This isn't an oversight—adding these would require significant abstraction layers that would obscure the educational clarity of the codebase. Karpathy has explicitly chosen to keep the code minimal and readable over broadly compatible. If you're on a MacBook or running in a CPU-only environment, you're blocked from experimenting.
The fixed 5-minute training window creates a reproducibility challenge across different hardware. An experiment run on a 4090 isn't comparable to the same code on a 3090 or A100. The agent might discover that a larger model works better simply because your GPU happens to have more memory, not because it's actually a better architecture. This means the research findings are somewhat hardware-specific, and you can't easily share experimental results with collaborators on different setups. For a tool designed around autonomous research, this friction around reproducibility is ironic. The focus on single-GPU simplicity means this is fundamentally a personal exploration tool, not infrastructure for a research team.
Verdict
Use if: You want to experiment with AI-driven research automation on a conceptual level, you're learning about LLM training internals and want a sandbox that's actually modifiable by agents, or you're exploring meta-research ideas about how AI systems could accelerate ML development. This is perfect for educational deep-dives into both GPT architecture and autonomous agent design patterns. Skip if: You need production-grade training infrastructure, require multi-GPU or cross-platform support, or want state-of-the-art training performance rather than exploring the research automation paradigm. If your goal is training the best possible model rather than exploring how AI agents conduct research, use the parent nanochat project directly or graduate to production frameworks like PyTorch Lightning with traditional hyperparameter optimization.