BabyAGI 2o: The Autonomous Agent That Writes Its Own Tools
Hook
What if your AI agent didn't just use tools, but wrote them from scratch every time it needed them? BabyAGI 2o takes autonomous agents to their logical extreme: a system that bootstraps its own capabilities through pure code generation.
Context
The autonomous agent landscape has exploded since GPT-4's function calling capabilities emerged. Projects like AutoGPT and the original BabyAGI captured imaginations by chaining LLM calls to accomplish multi-step tasks. But they all shared a common limitation: they relied on pre-built tools and functions that developers had to manually create and maintain.
BabyAGI 2o takes a radically different approach. Created by Yohei Nakajima as a minimalist evolution of his earlier work, it asks a deceptively simple question: what if the agent could write its own tools? Instead of maintaining a library of pre-built functions or persisting tools in a database (like BabyAGI 2), this framework generates Python functions on-the-fly, executes them, learns from failures, and iterates until the task is complete. It's metaprogramming meets autonomous agents—a self-modifying system that treats code generation as its primary problem-solving mechanism. The result is simultaneously elegant and terrifying: an agent that can bootstrap almost any capability it needs, limited only by the creativity of the underlying LLM and your willingness to execute arbitrary AI-generated code.
Technical Insight
At its core, BabyAGI 2o implements a deceptively simple loop: analyze task, generate tool, install dependencies, execute, handle errors, repeat. The magic happens in how it orchestrates LLM function calling to turn natural language goals into executable Python code.
The architecture leverages litellm as its model abstraction layer, which means you can swap between OpenAI, Anthropic, or any provider that supports function calling. When you give the agent a task, it first analyzes what it needs to accomplish, then uses function calling to invoke a create_tool function. This isn't a pre-built function in your codebase—it's a function signature the agent is told about, which triggers the LLM to generate the actual implementation:
# The agent receives function definitions like this
tools = [
{
"type": "function",
"function": {
"name": "create_tool",
"description": "Create a new Python function to accomplish a specific task",
"parameters": {
"type": "object",
"properties": {
"function_name": {"type": "string"},
"function_code": {"type": "string"},
"required_packages": {"type": "array", "items": {"type": "string"}}
}
}
}
}
]
# The LLM responds with actual code
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": task}],
tools=tools
)
# Execute the generated function
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
# Install dependencies automatically
for package in args.get("required_packages", []):
subprocess.run(["pip", "install", package], check=True)
# Execute the generated code
exec(args["function_code"], globals())
result = eval(f"{args['function_name']}()")
This pattern inverts the traditional tool-use paradigm. Instead of the agent selecting from a menu of pre-built capabilities, it synthesizes exactly what it needs in the moment. Need to scrape a website? It generates a function using BeautifulSoup or requests. Need to process images? It creates a Pillow-based function and installs the dependency automatically. The agent isn't constrained by what you anticipated it might need—it's only constrained by what the LLM can generate.
The error handling mechanism is equally clever. When a generated function fails—whether due to incorrect logic, missing imports, or runtime errors—the agent captures the exception, feeds it back to the LLM with context about what went wrong, and asks for a corrected version. This creates a self-healing loop where the agent iteratively refines its approach:
try:
result = execute_tool(generated_function)
except Exception as e:
error_context = f"Function failed with error: {str(e)}\n\nGenerated code:\n{function_code}"
# Feed error back to LLM for correction
correction_response = client.chat.completions.create(
messages=[
{"role": "system", "content": "Fix the error in this function"},
{"role": "user", "content": error_context}
],
tools=tools
)
# Retry with corrected function
The dependency management deserves special attention. By allowing the agent to specify required_packages and automatically installing them via subprocess calls to pip, BabyAGI 2o removes one of the biggest friction points in agent development. Traditional agents either require comprehensive environment setup upfront or fail when they encounter missing libraries. This approach treats the Python ecosystem as an infinite toolkit that can be assembled on-demand.
What makes this architecture particularly interesting is its ephemerality. Unlike BabyAGI 2, which persists generated functions to a database for reuse, 2o throws everything away after execution. This seems wasteful until you consider the implications: no state management, no schema migrations, no consistency concerns. Each task starts fresh, which paradoxically makes the system more robust—there's no accumulated technical debt from poorly generated functions in previous runs. The trade-off is obvious: you're paying the token cost to regenerate common utilities every time, but you're gaining simplicity and isolation.
Gotcha
Let's address the elephant in the room: this system executes arbitrary code generated by an LLM with automatic package installation. From a security perspective, this is somewhere between 'concerning' and 'absolutely terrifying' depending on your threat model. You're not just running LLM-generated code—you're giving it permission to install any package from PyPI, which means a compromised or misbehaving model could theoretically install malicious packages, exfiltrate data, or worse. This is explicitly a sandbox-first architecture. The repository README suggests running it in Replit for a reason—you need containerization or VM isolation unless you're comfortable with the risk profile.
The lack of persistence creates practical limitations beyond just token efficiency. If you're building an agent that needs to learn and improve over time, throwing away all generated tools after each run is counterproductive. There's no way to build up a library of battle-tested functions, no mechanism for the agent to learn which tools work reliably, and no opportunity to optimize frequently-used utilities. Every task is a cold start. This makes BabyAGI 2o excellent for one-off explorations but frustrating for repetitive workflows where you'd want the efficiency gains of reusable components.
The requirement for function-calling capable models is also more restrictive than it appears. While litellm provides compatibility with many providers, the quality of function calling varies dramatically. GPT-4 and Claude 3+ handle it well, but many open-source models either don't support it at all or produce unreliable function calls. You're effectively locked into frontier model providers, which means ongoing API costs and dependency on external services. There's no local-first story here.
Verdict
Use if: You're prototyping agent capabilities and want to explore the boundaries of LLM-driven metaprogramming. BabyAGI 2o excels in research contexts, educational environments, or when you need an agent to tackle genuinely novel tasks where pre-built tools would be insufficient. It's perfect for controlled sandbox experiments where you want to see what's possible when you remove constraints on tool availability. The simplicity of the codebase also makes it an excellent learning resource for understanding how autonomous agents work at a fundamental level. Skip if: You need production-grade reliability, care about security, or want to build agents that improve over time through tool reuse. The automatic code execution makes it unsuitable for any context where you can't fully isolate the runtime environment, and the lack of persistence means you'll waste tokens and time regenerating the same utilities repeatedly. If you're building actual products rather than exploring possibilities, look at LangChain Agents or the OpenAI Assistants API instead—they provide the guardrails and persistence mechanisms that real applications demand.