Autoresearch: Teaching AI Agents to Optimize Neural Networks by Rewriting Their Own Training Code
Hook
What if instead of tuning hyperparameters yourself, you wrote instructions in markdown telling an AI agent how to iteratively rewrite the training loop—and then let it run experiments overnight while you sleep?
Context
Neural network research typically involves researchers manually tweaking architectures, adjusting hyperparameters, and running experiments in a tedious cycle that can take days or weeks. Even with tools like Ray Tune or Optuna, humans still define the search space and write the experimental framework. The promise of autonomous AI research—where agents not only run experiments but modify the code itself—has remained largely theoretical or limited to toy problems.
Autoresearch, from Andrej Karpathy, tackles this challenge by radically constraining the problem space. Instead of attempting to automate all of ML research, it focuses on a single tractable question: can an AI agent iteratively improve a GPT training implementation on a single GPU? The system combines three minimalist components—fixed utilities for data prep, a modifiable training script, and markdown instructions that guide agent behavior—to create a complete autonomous research loop. By fixing the training time budget at exactly 5 minutes per experiment and using vocabulary-independent metrics, it ensures that wildly different architectural choices remain comparable.
Technical Insight
The core architectural insight of autoresearch is separating what’s fixed from what’s explorable. The prepare.py file contains immutable utilities for tokenization, data loading, and evaluation. The AI agent never touches this code. Instead, it exclusively modifies train.py, which contains the complete GPT implementation and training loop in a single, self-contained file. This constraint makes the system’s behavior predictable and diffs reviewable.
The most novel aspect is the “programming via markdown” paradigm. Rather than writing Python, you iterate on program.md—a natural language specification that instructs the agent how to conduct research. A typical instruction might look like:
## Research Objective
Improve validation bits-per-byte (BPB) on the FineWeb-Edu dataset by modifying train.py.
## Constraints
- Training runs for exactly 5 minutes wall-clock time
- Single NVIDIA GPU only
- Maintain compatibility with prepare.py evaluation functions
## Exploration Strategy
1. Baseline: Run current train.py and record validation BPB
2. Generate 3 hypotheses about potential improvements (architecture, optimizer, learning rate schedule)
3. For each hypothesis, modify train.py and run 5-minute experiment
4. Compare results using validation BPB (lower is better)
5. Keep best-performing change and iterate
## Success Criteria
Achieve validation BPB below 1.05 (current baseline: 1.12)
The fixed 5-minute training budget is crucial for fair comparisons. Traditional ML experiments fix steps or epochs, but this breaks when comparing different architectures. A model with 50M parameters might train 10 batches in 5 minutes while a 200M parameter model trains only 2 batches. By fixing wall-clock time instead, autoresearch ensures that architectural changes don’t invalidate comparisons—a faster architecture that trains more steps has a legitimate advantage.
The system uses bits-per-byte (BPB) as its optimization target rather than perplexity or loss. This vocabulary-size-independent metric allows the agent to experiment with tokenization changes or vocabulary modifications without breaking the comparison framework. A BPB of 1.0 means perfect compression (1 bit per byte), while higher values indicate worse modeling.
Here’s a simplified view of how the agent might modify train.py:
# Original configuration in train.py
class GPTConfig:
n_layer: int = 6
n_head: int = 6
n_embd: int = 384
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# Agent modification after hypothesis "deeper model may help"
class GPTConfig:
n_layer: int = 8 # Increased from 6
n_head: int = 8 # Proportionally increased
n_embd: int = 384
# Agent also adjusts learning rate based on scaling laws
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
The training loop itself is built on the nanochat codebase, which combines a standard GPT architecture with the Muon+AdamW optimizer—a hybrid approach that uses Muon (a momentum-based optimizer) for weight matrices and AdamW for biases and normalization layers. This gives the agent a reasonable starting point while leaving room for improvement.
Because everything runs on a single GPU with no distributed training complexity, the entire system remains comprehensible. An experienced developer can read through train.py in 10 minutes and understand exactly what the agent is modifying. This transparency is essential for trusting autonomous research—you can git diff each change and understand the agent’s reasoning.
Gotcha
The NVIDIA-only requirement is autoresearch’s most significant limitation. It won’t run on Apple Silicon (MPS), AMD GPUs, or CPU-only machines. This isn’t a temporary constraint—the fixed 5-minute timing assumption means results are inherently hardware-dependent. An experiment that achieves 1.08 BPB on an H100 might reach only 1.15 BPB on an RTX 3090 in the same timeframe, not because the approach is worse but because fewer training steps completed. This makes sharing results across different hardware configurations essentially meaningless. If you’re running on a laptop GPU, you’re establishing a completely different baseline than someone with datacenter hardware.
The experimental nature of the project is also worth emphasizing. This is demonstration-quality code exploring what autonomous research could look like, not a production framework. There’s no versioning strategy, no stability guarantees, and Karpathy has explicitly positioned it as an experiment rather than a maintained tool. The agent’s ability to modify arbitrary Python code also means it can introduce subtle bugs—type errors, shape mismatches, or logic errors that only surface during training. You’ll need to review each change carefully rather than blindly accepting agent modifications. The single-file constraint helps here, but it’s still possible for an agent to generate syntactically valid but semantically broken code.
Verdict
Use autoresearch if you have an NVIDIA GPU and want to explore the frontier of autonomous AI research in a constrained, understandable domain. It’s perfect for researchers investigating agentic workflows, hobbyists who enjoy meta-level experimentation, or anyone curious about how AI might eventually automate parts of ML engineering. The overnight experiment loop—where you refine markdown instructions before bed and wake up to results—is genuinely compelling for exploring the “programming via natural language” paradigm. Skip it if you need multi-GPU training, production reliability, cross-platform support, or are focused purely on achieving state-of-the-art results rather than exploring research automation methodology. This is a telescope pointed at the future of AI-driven research, not a tool for today’s production ML workflows.