TinyZero: Reproducing DeepSeek R1-Zero’s Emergent Reasoning for Under $30

Hook

A 3B parameter language model can learn to reason, verify its own answers, and perform tree search—without a single labeled reasoning step. This isn’t theoretical; it’s reproducible on accessible hardware for a modest budget.

Context

When DeepSeek released R1-Zero, they demonstrated something remarkable: large language models could develop sophisticated reasoning capabilities purely through reinforcement learning, without step-by-step supervision or human-labeled reasoning traces. The model learned to think through problems, backtrack when wrong, and verify its own solutions—all from simple reward signals indicating right or wrong final answers. The catch? Reproducing these results required infrastructure most researchers don’t have access to.

TinyZero exists to democratize this breakthrough. Developed by Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr, it’s a minimal reproduction of the R1-Zero approach on focused tasks like countdown problems and multiplication. The goal isn’t production deployment—the repository now directs users to the veRL library for active development—but education and accessibility. It demonstrates that the core insights about emergent reasoning aren’t locked behind massive compute budgets. According to the authors, for less than $30 in cloud GPU costs (depending on your provider and configuration), individual researchers can experience what they call the “Aha moment”: watching a model spontaneously develop reasoning patterns through nothing but trial and error.

Technical Insight

TinyZero is built on veRL, a reinforcement learning library for language models, using a policy gradient approach to train base models on reasoning tasks. The architecture is deliberately minimal: a base language model generates multiple reasoning traces for each prompt, receives binary rewards based on whether the final answer is correct, and updates its policy to favor successful reasoning patterns. There’s no supervised fine-tuning on reasoning chains, no manually crafted prompts showing how to think step-by-step—just raw RL from outcome feedback.

The training pipeline leverages vLLM for efficient inference during rollout generation and Ray for distributed computing. For the countdown task—where the model must use arithmetic operations to reach a target number—data preparation is straightforward:

python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}

Training a 3B model requires careful resource management. The key configuration variables control tensor parallelism and GPU allocation:

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

The ROLLOUT_TP_SIZE=2 setting splits the model across GPUs during inference, critical for fitting 3B+ parameter models in VRAM during the generation-heavy rollout phase. If you hit out-of-memory errors, gradient checkpointing can be enabled with critic.model.enable_gradient_checkpointing=True, trading compute time for memory efficiency.

What makes TinyZero’s results compelling is the apparent threshold effect in model capacity. Models with 1.5B parameters or fewer don’t successfully develop reasoning capabilities in this framework—the README notes that “for Qwen2.5-0.5B base, we know it fails to learn reasoning.” At 3B parameters, the models begin generating reasoning traces where they propose solutions, check their work, backtrack when they detect errors, and explore alternative paths. This mirrors DeepSeek’s findings about minimum model capacity for emergent reasoning.

The codebase includes an ablation study comparing base models to instruct-tuned variants. The README mentions experiments with both Qwen-2.5-3B base and instruct versions. For the instruct variant, the data preprocessing follows the model’s chat template:

python examples/data_preprocess/countdown.py --template_type=qwen-instruct --local_dir={path_to_your_dataset}

Under the hood, TinyZero’s simplicity is its strength. The training loop generates multiple rollouts per prompt, evaluates final answers against ground truth, and uses policy gradient methods to increase the probability of action sequences that led to correct answers. The model learns that certain patterns—explicit calculation steps, checking intermediate results, trying multiple approaches—correlate with success, and these patterns solidify into consistent reasoning behavior.

Gotcha

The most critical limitation is right at the top of the README: TinyZero is no longer actively maintained. The authors explicitly state: “This repo is no longer actively maintained. For running RL experiments, please directly use the latest veRL library.” This positions TinyZero as a reference implementation and educational tool rather than production infrastructure. If you’re planning ongoing research or need support for evolving requirements, you’re building on archived code.

The second major constraint is model size requirements. Despite the promise of accessibility, you still need hardware capable of running 3B+ parameter models with enough headroom for RL training dynamics. Models at smaller scales simply don’t develop reasoning in this framework—the README explicitly confirms that Qwen2.5-0.5B “fails to learn reasoning.” This creates a real barrier for researchers with limited GPU access—you can’t scale down to truly resource-constrained environments and still observe the emergent reasoning phenomenon. The promise of “under $30” assumes you have access to appropriate cloud instances or local hardware, and actual costs will vary based on your specific cloud provider and instance configuration.

VRAM management is another persistent challenge. Even with two GPUs and tensor parallelism, 3B models require careful configuration. The need for gradient checkpointing flags and XFORMERS attention backends indicates you’re operating near memory limits. If you’re experimenting with modifications or different model architectures, expect to spend time debugging out-of-memory errors. The task scope is also narrowly constrained: countdown problems and multiplication are focused domains that demonstrate the principle but don’t translate to real-world reasoning benchmarks. The full experiment logs on Weights & Biases show interesting emergent behaviors, but they’re all within these controlled environments where correct answers are unambiguous and verifiable through simple computation.

Verdict

Use TinyZero if you’re a researcher or student trying to understand how reinforcement learning can induce reasoning in language models, particularly if you want hands-on experience with the DeepSeek R1-Zero approach without massive infrastructure. It’s valuable for education, for experimenting with RL training dynamics on reasoning tasks, and for reproducing the core insight that base models can develop self-verification and search purely from outcome supervision. The modest cost makes the fundamental phenomenon accessible even if you’re working with limited research budgets. Skip it if you need actively maintained infrastructure—the deprecation notice means you should go directly to veRL for ongoing development work. Also skip it if you’re constrained to GPUs that can’t handle 3B+ models, since smaller models won’t develop the reasoning capabilities that make the project interesting. If your goal is tackling real-world reasoning benchmarks rather than understanding the fundamental mechanisms, the focused tasks here won’t meet your needs. TinyZero is a teaching tool and proof of concept, powerful for those purposes but explicitly not positioned as ongoing research infrastructure.

TinyZero: Reproducing DeepSeek R1-Zero's Emergent Reasoning for Under $30

TinyZero: Reproducing DeepSeek R1-Zero’s Emergent Reasoning for Under $30

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE

TinyZero: Reproducing DeepSeek R1-Zero’s Emergent Reasoning for Under $30

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

fwknop: How Single Packet Authorization Makes Your SSH Server Invisible to Port Scanners

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE