Back to Articles

BabyAGI 2o: Building an Autonomous Agent That Writes Its Own Tools

[ View on GitHub ]

BabyAGI 2o: Building an Autonomous Agent That Writes Its Own Tools

Hook

Most autonomous agents come with pre-built tools. BabyAGI 2o takes a different approach: it writes its own Python functions from scratch every time you give it a task, installs whatever packages it needs, and debugs itself when things break.

Context

The autonomous agent space has exploded with frameworks like AutoGPT, LangChain Agents, and Microsoft’s Semantic Kernel. These systems typically come with extensive pre-built tool libraries—web scrapers, file handlers, API wrappers—that the agent selects from to complete tasks. This approach works, but it creates a ceiling: the agent can only be as capable as its pre-configured toolbox.

BabyAGI 2o, created by Yohei Nakajima, explores a more radical idea: what if the agent could build its own tools? Instead of selecting from a menu of existing functions, the agent generates Python code dynamically based on the task requirements, executes it, learns from any errors, and continues iterating towards task completion. It’s a minimalist experiment in self-building agents that prioritizes autonomy over safety, simplicity over features. The project is a sibling to BabyAGI 2, which focuses on storing and executing functions from a database for reuse, while BabyAGI 2o emphasizes the on-the-fly creation loop itself.

Technical Insight

No

Yes

Error

Success

User Task Input

LLM with Function Calling

Function Exists?

Generate Python Tool Code

Function Registry

Detect Dependencies

pip install packages

Execute Generated Function

Execution Success?

Return Result to User

Update Function Registry

System architecture — auto-generated

At its core, BabyAGI 2o operates through a deceptively simple loop: receive a user task, use the LLM’s function-calling capability to generate a Python tool as a string, execute that code, and respond to any errors by regenerating the tool. The architecture leverages litellm as a unified interface across LLM providers, which means you can swap between GPT-4, Claude, or other models as long as they support tool/function calling.

The agent maintains a dynamic function registry during its session. When the LLM decides it needs a capability—say, scraping a website or processing an image—it generates not just the logic but also identifies required dependencies. BabyAGI 2o then installs those packages via pip automatically before executing the function. Here’s the kind of workflow you’d see from the repository’s examples:

# User task: "Scrape techmeme and provide a summary of headlines"

# The agent might generate something like this internally:
def scrape_techmeme():
    import requests
    from bs4 import BeautifulSoup
    
    response = requests.get('https://www.techmeme.com')
    soup = BeautifulSoup(response.content, 'html.parser')
    headlines = soup.find_all('a', class_='ourh')
    
    return [h.get_text() for h in headlines[:10]]

# BabyAGI 2o would:
# 1. Detect it needs 'requests' and 'beautifulsoup4'
# 2. Run: pip install requests beautifulsoup4
# 3. Execute the function
# 4. If it fails (wrong CSS class, connection error), regenerate with corrections

The error-handling mechanism allows the agent to learn from failures and iterate towards working solutions. The system handles errors gracefully and continues attempting to complete the task based on what it learns from each failure.

The setup process reveals the model-agnostic design. You configure BabyAGI 2o entirely through environment variables:

export LITELLM_MODEL=gpt-4
export OPENAI_API_KEY=your-key-here

Or using a .env file for persistence:

LITELLM_MODEL=claude-2
ANTHROPIC_API_KEY=your-anthropic-key

This abstraction means you can experiment with different models’ coding abilities without touching the codebase.

What makes this approach compelling is the meta-programming aspect. The agent isn’t just calling APIs or running shell commands; it’s writing procedural Python that can do anything Python can do. One of the documented examples involves generating a Halloween flyer by calling DALL-E for a background image, then overlaying a halloween message in big letters, then saving the result. That’s multiple capabilities coordinated through generated code, not pre-built integrations.

The tradeoff is reliability. Pre-built tools have been tested and hardened. Generated code is only as good as the LLM’s current output, which can vary between runs.

Gotcha

The elephant in the room is security. BabyAGI 2o installs packages and executes code based entirely on LLM output. There are no apparent sandboxing mechanisms, no package whitelists, no code review steps. If the LLM decides it needs a package, it gets installed. If it generates code to make network requests or modify files, that code runs. The repository README includes a prominent caution about executing in safe environments, specifically recommending Replit for testing, for exactly this reason. This isn’t a tool you can run in environments with access to sensitive data or systems.

Persistence is another significant limitation. Functions are registered dynamically during a session but are not stored between runs. If the agent creates a useful web scraper on Monday, it has to recreate it from scratch on Tuesday. The README mentions that the goal is to integrate with the BabyAGI 2 framework for persistence of created tools, but that’s not implemented in this repository. For now, every session starts from zero.

Reliability is probabilistic, not deterministic. Success depends on the LLM’s ability to reason through the task, generate correct Python syntax, handle edge cases, and debug its own errors. Complex tasks—especially those requiring domain knowledge or multi-step reasoning—may encounter difficulties. The repository’s examples are explicitly described as things that “sometimes work,” which reflects the experimental nature of the approach. You’re working with emergent capabilities, not deploying stable automation.

Verdict

Use BabyAGI 2o if you’re researching autonomous agent architectures, want to prototype self-modifying task-solving systems, or need to experiment with how different LLMs handle code generation in iterative loops. It’s ideal for educational exploration of meta-programming and for understanding the current boundaries of LLM autonomy. Run it in Replit or other sandboxed environments where arbitrary code execution won’t cause damage, as the creator himself recommends. Skip it if you need production reliability, security-conscious automation, or persistent tool development. Don’t use it in any environment with access to sensitive data or systems, or where arbitrary package installation could create supply-chain risks. This is an experimental exploration that demonstrates what’s possible with self-building agents, not a hardened framework for real-world deployment. If you need production-ready autonomous task completion, look at LangChain Agents or similar frameworks instead—they trade the meta-programming novelty for safety rails and stability.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/yoheinakajima-babyagi-2o.svg)](https://starlog.is/api/badge-click/ai-agents/yoheinakajima-babyagi-2o)