> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

MetaGPT: Building Software Companies from Prompts with Multi-Agent SOPs

[ View on GitHub ]

MetaGPT: Building Software Companies from Prompts with Multi-Agent SOPs

Hook

What if you could hire an entire software company—complete with product managers, architects, and engineers—for the cost of a few API calls? MetaGPT's 67,000+ GitHub stars suggest thousands of developers are already experimenting with exactly that.

Context

The explosion of LLM-powered code generation tools has created a curious gap. While GitHub Copilot and ChatGPT excel at writing individual functions or classes, they falter when asked to architect complete applications. You get code snippets, not coherent systems. The missing ingredient isn't more powerful models—it's structure.

MetaGPT attacks this problem by encoding something most AI tools ignore: the organizational knowledge embedded in how real software companies operate. Its central thesis—expressed in the formula 'Code = SOP(Team)'—argues that software quality emerges not from individual brilliance but from well-defined Standard Operating Procedures executed by specialized roles. Instead of throwing a monolithic prompt at an LLM and hoping for coherent output, MetaGPT orchestrates multiple AI agents playing distinct roles (Product Manager, Architect, Engineer, QA) that collaborate through structured handoffs, mimicking the workflows that have produced successful software for decades.

Technical Insight

The architectural elegance of MetaGPT lies in its role-based agent system and publish-subscribe communication pattern. Each agent inherits from a base Role class that encapsulates not just LLM interaction but behavioral patterns specific to job functions. When you submit a requirement like "Create a snake game," the framework instantiates a team and begins a sequential workflow.

Here's how you'd initialize a basic software company:

from metagpt.software_company import SoftwareCompany
from metagpt.roles import ProductManager, Architect, Engineer

async def generate_project():
    company = SoftwareCompany()
    company.hire([
        ProductManager(),
        Architect(),
        Engineer()
    ])
    
    company.invest(investment=3.0)  # Budget in USD for LLM calls
    company.run_project(idea="Create a CLI snake game with score tracking")
    
await generate_project()

Behind this clean interface, each role executes a carefully choreographed sequence. The ProductManager agent doesn't just generate requirements—it produces structured artifacts following templates. It analyzes the input requirement, conducts competitive analysis, and outputs a Product Requirements Document (PRD) with user stories, acceptance criteria, and success metrics. This PRD gets published to a shared message environment.

The Architect agent subscribes to PRD publications. When triggered, it doesn't receive raw text but a structured Message object containing the PRD. This separation of concerns is critical: agents don't share context indiscriminately but communicate through defined interfaces, preventing the context-window explosion that plagues naive multi-agent implementations.

class Architect(Role):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.set_actions([WriteDesign])
        self._watch([WritePRD])  # Subscribe to PM outputs
    
    async def _act(self) -> Message:
        prd = self.rc.memory.get_by_action(WritePRD)[-1]
        system_design = await WriteDesign().run(prd)
        return Message(
            content=system_design,
            role=self.profile,
            cause_by=WriteDesign
        )

The _watch() mechanism implements a reactive pattern where agents activate only when relevant artifacts appear. This prevents unnecessary LLM calls and creates a natural dependency graph. The Architect won't hallucinate designs for non-existent requirements because it literally cannot act until a PRD exists.

What makes this more than clever prompt chaining is the action abstraction layer. Each role composes multiple Action objects—reusable, testable units of LLM interaction with specific prompts and output schemas. WriteDesign isn't just a prompt template; it's a class that validates outputs, handles retries, and can be swapped or extended. This modularity means you can replace the architecture generation strategy without rewriting role logic.

The framework extends beyond software generation with the Data Interpreter role, which handles autonomous data analysis. Given a dataset and an analytical question, it writes Python code, executes it in a sandboxed environment, generates visualizations, and iterates based on results—essentially implementing a ReAct loop specialized for data science workflows. This demonstrates MetaGPT's SOP philosophy applied beyond its software company metaphor: define a role, formalize its workflow, and let the LLM execute within those guardrails.

Recent research contributions from the MetaGPT team reveal deeper architectural thinking. Their AFlow framework (presented at ICLR 2025) automatically optimizes agent workflows by searching the space of possible SOP configurations—essentially using LLMs to discover better LLM orchestration patterns. This meta-optimization approach hints at where multi-agent systems are heading: not hand-crafted workflows but evolved collaboration patterns.

Gotcha

The Python version constraint (3.9-3.11 only, explicitly excluding 3.12+) immediately signals technical debt. This stems from dependency chains in the LangChain ecosystem and specific pinned library versions that haven't caught up with Python's evolution. For teams standardized on modern Python environments, this creates friction—you're managing separate virtual environments or containers just for MetaGPT.

More fundamentally, the sequential SOP workflow that gives MetaGPT its structure also creates brittleness. Real software teams iterate messily: architects reconsider designs mid-implementation, product managers adjust requirements when technical constraints emerge, engineers push back on infeasible specs. MetaGPT's linear pipeline doesn't naturally accommodate this. Once the Product Manager publishes a PRD, there's no built-in mechanism for the Engineer to say "this requirement is ambiguous" and trigger a refinement cycle. You get one pass through the assembly line.

The cost model deserves honest scrutiny. Generating a complete project might invoke GPT-4 dozens of times across multiple agents, each with substantial context windows containing prior artifacts. A moderately complex application could easily consume $5-20 in API costs. The invest() parameter exists precisely because runaway costs are a real concern. For experimentation, this is manageable. For production use cases generating dozens of projects, it becomes a budget line item that needs justification.

Verdict

Use MetaGPT if you're generating greenfield project scaffolds where comprehensive documentation matters as much as code, prototyping business applications in well-understood domains, or researching multi-agent architectures and need a sophisticated reference implementation. It shines when you want structured outputs—PRDs, architecture diagrams, API specs—not just code dumps. Skip it if you're working with existing codebases that need contextual modifications, operating under tight LLM API budgets, require Python 3.12+ features, or need rapid iteration cycles where the overhead of multi-agent orchestration slows you down. For quick scripts or exploratory coding, simpler tools like GPT-Engineer or direct LLM interaction will feel less ceremonious. MetaGPT's power is its structure; that structure is also its constraint.