LLM Council: Building Consensus Through Multi-Agent Deliberation
Hook
What if the best way to get a reliable answer from an LLM isn’t to pick the smartest model, but to make several models argue with each other first?
Context
Single-model interactions have a fundamental weakness: you’re betting everything on one system’s interpretation of your question. Ask GPT-4 about a nuanced technical decision and you get one perspective. Ask Claude, and you might get a completely different—equally confident—answer. For trivial queries, this doesn’t matter. But for high-stakes questions where you’d normally consult multiple experts, relying on a single model feels uncomfortably like asking one person and calling it research.
The obvious solution is to query multiple models manually, but that’s tedious and doesn’t scale. You end up with a browser full of tabs, manually synthesizing responses in your head. LLM Council, a weekend project from Andrej Karpathy, automates this multi-expert consultation pattern through a three-stage deliberation process: parallel querying, anonymous peer review, and synthesis. It’s not trying to be a production platform—it’s a reference implementation that demonstrates how orchestrating model disagreement can surface better answers than any single model provides alone.
Technical Insight
The architecture mirrors how human expert panels actually work. When you ask LLM Council a question, it fans out your query to multiple models simultaneously via OpenRouter’s unified API. Each model generates its response in isolation—no peeking at others’ answers. This parallel execution is handled by FastAPI’s async capabilities:
async def get_council_opinions(query: str, models: list[str]):
async with aiohttp.ClientSession() as session:
tasks = [
call_openrouter(session, model, query)
for model in models
]
responses = await asyncio.gather(*tasks)
return responses
The interesting part comes next: peer review. Each model receives all the other responses (anonymized, labeled as Response A, B, C, etc.) and ranks them. The prompt for this stage is carefully designed to force critical evaluation rather than politeness. Models must explain their rankings, which creates an audit trail of what criteria they considered important. This isn’t just voting—it’s argumentation.
The anonymization is crucial. If models could identify their own responses or recognize writing patterns from specific models, they might exhibit favoritism or tribal behavior. By stripping attribution, you force evaluation based on content quality alone. The ranking prompt looks something like this:
PEER_REVIEW_PROMPT = """
You are evaluating responses to this question: {original_query}
Here are the responses to evaluate:
{anonymized_responses}
Rank these responses from best to worst. For each ranking, explain:
1. Factual accuracy and completeness
2. Clarity and structure
3. Potential weaknesses or gaps
Provide your ranking as: Best: [letter], Second: [letter], etc.
"""
The final stage hands everything to a “Chairman” model—typically the most capable model in your council—to synthesize the original responses and peer reviews into a single answer. This synthesis isn’t just averaging; the Chairman sees which responses were highly rated and why, which criticisms were raised, and where models disagreed. The output acknowledges uncertainty when models diverged significantly.
The frontend implementation uses React with a tab-based UI that exposes the entire deliberation process. You can inspect each model’s original response, see how every model ranked the others, and read the final synthesis—all before deciding whether to trust the output. This transparency is pedagogically valuable; you learn which models excel at which types of reasoning by watching them critique each other.
Data persistence is deliberately simple: conversations serialize to JSON files in a local directory. No database, no cloud storage, no user accounts. This weekend-project minimalism means you can audit exactly what’s being stored and where. For developers who want to extend the system—maybe adding RAG capabilities or custom scoring algorithms—the codebase is small enough to understand in an afternoon.
The OpenRouter integration is the architectural choice that makes this practical. Instead of managing API keys for Anthropic, OpenAI, Google, and xAI separately, you configure one OpenRouter key and select from their entire model catalog. This abstraction layer means experimenting with different council compositions (maybe three Claudes with different temperature settings, or a mix of frontier models) requires only configuration changes, not code changes.
Gotcha
The cost structure gets expensive fast. If you run a five-model council, you’re making five initial API calls, then five peer review calls (each receiving multiple responses as context), then one synthesis call (receiving everything). That’s 11 API calls per question, many with large context windows. With frontier models costing $10-15 per million tokens, a complex question could easily cost $0.50-$2.00. This isn’t a tool you’d use for casual queries or integrate into a high-traffic application.
The peer review anonymization, while clever, might not be robust against models recognizing their own writing styles. If Claude consistently structures responses with numbered lists and GPT-4 tends toward flowing prose, models could potentially identify authorship despite the anonymization. Karpathy acknowledges this limitation—the implementation prioritizes simplicity over cryptographic guarantees. For some use cases this matters; for others it’s fine if models have hints about authorship.
Most importantly, this is explicitly a reference implementation, not a supported product. The README warns: “This is a weekend project provided as-is for inspiration. Don’t expect updates, bug fixes, or responses to issues.” If you’re not comfortable reading and modifying Python/React code yourself, you’ll struggle when something breaks or when you need features that don’t exist. There’s no authentication system, no rate limiting, no error recovery beyond basic try-catch blocks. It’s vibe coding—functional enough to be useful, minimal enough to understand, but definitely not production-grade infrastructure.
Verdict
Use if: You’re tackling research questions, architectural decisions, or creative work where cross-validation genuinely adds value and you’re comfortable with the API costs. You should be a developer willing to fork and customize the code, because that’s the entire point—this is a starter kit, not a SaaS product. It’s perfect for consultants billing clients for high-quality analysis, researchers exploring complex technical questions, or teams making expensive technology decisions where spending $2 on LLM consensus beats spending $2000 on the wrong choice. Skip if: You need production reliability, user authentication, or cost efficiency. Skip if your questions are routine enough that a single model suffices—most queries don’t benefit from deliberation. Skip if you’re not prepared to maintain and extend the code yourself, because community support will be minimal. For casual multi-model comparison without customization, just use Poe.com’s multi-bot chat feature instead.