Building Self-Planning AI Agents: How open-multi-agent Turns Goals Into Task Graphs
Hook
Most multi-agent frameworks make you define task graphs manually. What if your coordinator agent could decompose "build a user dashboard" into parallelizable subtasks automatically, deciding which agents run when?
Context
The explosion of LLM-powered applications has created a new problem: orchestrating multiple AI agents to accomplish complex goals. Early solutions like LangChain and AutoGen required developers to explicitly define task sequences—which agent runs first, what data gets passed between them, when parallel execution happens. This works for fixed workflows but breaks down when goal complexity varies wildly. A simple "summarize this document" shouldn't need the same orchestration machinery as "analyze this codebase, identify performance issues, generate fixes, and create a PR."
open-multi-agent approaches this differently. Instead of requiring developers to construct task graphs manually, it uses a coordinator agent that interprets natural language goals, decomposes them into discrete tasks, identifies dependencies, and orchestrates parallel execution across specialized worker agents. Built with TypeScript and maintaining only three runtime dependencies, it's designed to drop into existing Node.js backends without the framework bloat that characterizes older Python-centric solutions. With support for 10+ LLM providers, MCP tool integration, and built-in observability, it represents a new generation of goal-driven orchestration frameworks that prioritize developer experience and production readiness.
Technical Insight
The core architectural innovation in open-multi-agent is its coordinator-worker pattern with automatic DAG generation. When you submit a goal, the coordinator agent doesn't just split work randomly—it reasons about task dependencies, identifies parallelization opportunities, and constructs a Directed Acyclic Graph that maximizes concurrent execution while respecting data flow constraints.
Here's how this looks in practice. Instead of manually defining a pipeline, you declare your intent:
import { createTeam, createAgent } from 'open-multi-agent';
const researcher = createAgent({
name: 'researcher',
model: 'gpt-4',
instruction: 'You research topics and gather information.',
tools: ['web_search', 'read_file']
});
const writer = createAgent({
name: 'writer',
model: 'claude-3-5-sonnet',
instruction: 'You write comprehensive articles based on research.',
tools: ['write_file']
});
const team = createTeam({
name: 'content-team',
agents: [researcher, writer],
coordinatorModel: 'gpt-4-turbo'
});
const result = await team.run(
'Write a technical article about WebAssembly security. Research current vulnerabilities, analyze mitigation strategies, and produce a 2000-word article.'
);
Behind the scenes, the coordinator decomposes this into tasks like "research WebAssembly vulnerabilities," "research mitigation strategies," "synthesize findings," and "write article." The first two tasks can run in parallel since they have no dependencies. The synthesis task waits for both research tasks to complete. The writing task depends on the synthesis output. The coordinator handles all scheduling, retry logic, and data passing through shared memory.
What makes this TypeScript-native approach powerful is the provider flexibility combined with structured outputs. Each agent can use a different LLM provider—your researcher might use Perplexity for web search capabilities while your writer uses Claude for long-form generation. The framework uses Zod schemas for validation:
import { z } from 'zod';
const analysisAgent = createAgent({
name: 'analyzer',
model: 'gemini-2.0-flash',
instruction: 'Analyze code for security issues.',
outputSchema: z.object({
vulnerabilities: z.array(z.object({
severity: z.enum(['low', 'medium', 'high', 'critical']),
location: z.string(),
description: z.string(),
recommendation: z.string()
})),
overallRisk: z.enum(['low', 'medium', 'high'])
})
});
const analysis = await analysisAgent.run(
'Analyze this Express.js authentication middleware',
{ context: { code: middlewareCode } }
);
// analysis is fully typed and validated
console.log(analysis.vulnerabilities[0].severity);
If the LLM returns invalid JSON or misses required fields, open-multi-agent automatically retries with error feedback, asking the model to correct the output. This eliminates the fragile string parsing and manual validation that plague many AI integrations.
The MCP (Model Context Protocol) integration extends capabilities beyond the six built-in tools. You can expose external APIs, databases, or custom business logic as tools that any agent can invoke:
const dataAgent = createAgent({
name: 'data-processor',
model: 'deepseek-chat',
mcpServers: [{
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-postgres'],
env: { DATABASE_URL: process.env.DATABASE_URL }
}]
});
// Agent now has access to postgres query tools
await dataAgent.run(
'Query the users table for accounts created in the last 24 hours and calculate churn risk'
);
Observability is baked into the runtime. Every agent execution produces streaming events with token-by-token outputs, and the post-run dashboard visualizes the actual task DAG that executed—showing which tasks ran in parallel, how long each took, and which agent handled it. This isn't an afterthought or separate tracing service; it's built into the core with minimal overhead.
The three-dependency constraint is intentional architectural discipline. The framework depends only on the AI SDK (for provider adapters), Zod (for schema validation), and a UUID generator. Everything else—retry logic, streaming, memory stores, trace collection—is implemented directly. This minimizes supply chain risk and makes the framework auditable for security-conscious teams.
Gotcha
The coordinator-driven approach has a critical weakness: your entire task decomposition quality depends on the reasoning capabilities of your coordinator model. If you choose a weaker or cheaper model as coordinator to save costs, you'll get suboptimal DAGs—tasks that should run in parallel get sequenced unnecessarily, or the coordinator creates redundant tasks that duplicate work. There's no explicit validation that the generated DAG is optimal, and debugging why the coordinator made certain planning decisions requires inspecting LLM reasoning chains, which can be opaque.
The Node.js-only limitation is real. If your backend is Python, Go, or Rust, you can't use open-multi-agent without introducing JavaScript into your stack or building external service boundaries. There are no language bindings, and the TypeScript-native design means idioms don't translate cleanly to other ecosystems. The shared memory implementation defaults to an in-process key-value store, which breaks down under high concurrency or distributed deployments—you'll need to wire up Redis or Postgres adapters, and the documentation around production memory store configuration is sparse. Error recovery strategies for coordinator failures (what happens if the coordinator hallucinates an invalid DAG or crashes mid-planning?) aren't well documented, leaving production teams to build safety nets themselves.
Verdict
Use if: You're building TypeScript/Node.js backends that need complex multi-step AI workflows where goal complexity varies unpredictably, you want automatic task decomposition and parallel execution without manual graph construction, you need to mix different LLM providers across agents for cost or capability reasons, and you value minimal dependencies and built-in observability over ecosystem maturity. It's particularly strong for internal tools, API backends, and automation scripts where TypeScript is already your primary language. Skip if: You're working in Python or other languages and don't want to introduce Node.js, you need battle-tested production features like distributed tracing (OpenTelemetry) and sophisticated failure recovery that mature frameworks provide, your workflows have strict task ordering requirements that you don't trust an LLM coordinator to plan correctly, or you're building mission-critical systems where coordinator reasoning failures could cause significant business impact. For Python-first teams or enterprises requiring proven resilience patterns, LangGraph or Semantic Kernel remain safer choices despite their complexity.