LLM Council: Multi-Model Consensus Through Blind Peer Review
Hook
What if the best way to get a reliable answer from an LLM isn’t to pick the smartest model, but to make them debate each other blindly and vote on the results?
Context
LLM Council emerged from a practical need: Andrej Karpathy wanted to read books alongside multiple LLMs and compare their interpretations side-by-side. Instead of toggling between different LLM interfaces in separate browser tabs, he built a unified interface that queries multiple models simultaneously through OpenRouter. But the project goes beyond simple parallel querying—it implements a three-stage consensus mechanism where models respond independently, anonymously review each other’s work, and then a designated ‘Chairman’ synthesizes the final answer.
The underlying insight is borrowed from ensemble methods in traditional machine learning: diverse models making independent predictions often outperform any single model. LLM Council applies this to conversational AI by combining different frontier models through a single API. The anonymization during peer review is particularly clever—by stripping model identities before the ranking phase, the system prevents models from exhibiting potential favoritism toward responses from the same provider or well-known competitors.
Technical Insight
The architecture is refreshingly straightforward: a Python backend orchestrates LLM interactions through OpenRouter’s unified API, while a React frontend displays responses in a tabbed interface. The entire conversation history persists as JSON files in a local data/conversations/ directory—no database required. This is ‘vibe code’ optimized for rapid prototyping rather than enterprise deployment.
The three-stage pipeline leverages async HTTP requests via Python’s httpx library to parallelize the expensive LLM calls. Stage 1 fans out the user’s query to all council members simultaneously. You configure the roster in backend/config.py with model identifiers from OpenRouter’s catalog:
COUNCIL_MODELS = [
"openai/gpt-5.1", # Replace with actual model IDs
"google/gemini-3-pro-preview",
"anthropic/claude-sonnet-4.5",
"x-ai/grok-4",
]
CHAIRMAN_MODEL = "google/gemini-3-pro-preview"
Note: The model identifiers shown in the README are examples—you’ll need to replace them with actual model IDs available through OpenRouter.
OpenRouter acts as the abstraction layer, providing a single API endpoint that routes requests to multiple providers. This means you need only one API key and one HTTP client implementation, regardless of how many different model families you’re querying. It’s vendor-agnostic orchestration without the complexity of maintaining separate SDKs for OpenAI, Anthropic, Google, and xAI.
Stage 2 is where the design gets interesting. Each model receives a prompt containing all responses from Stage 1, but with identifying information stripped out. The models are labeled as ‘Response A’, ‘Response B’, etc., forcing them to evaluate substance rather than source. Each LLM then ranks the responses by accuracy and insight. This blind review mechanism mirrors academic peer review and theoretically reduces bias—a model can’t automatically favor responses from the same provider if it doesn’t know which one it came from.
Stage 3 hands everything to the Chairman model, which receives both the original responses and the peer rankings. The Chairman’s job is synthesis: distill the collective wisdom into a coherent answer while accounting for what the other models thought were the strongest points. The user sees the final synthesized response prominently, but can click through tabs to inspect each individual model’s original answer and review comments.
The frontend uses react-markdown for rendering, which means models can return formatted text with headers, lists, and code blocks that display properly. Conversation state lives in React components and syncs to the backend’s JSON storage after each interaction. It’s stateful enough to maintain conversation history but simple enough that you can inspect or manually edit the JSON files if needed.
Because this is explicitly a weekend hack, there’s no authentication, no rate limiting, no error recovery beyond basic try-catch blocks. The .env file holds your OpenRouter API key in plain text. The startup process is either a convenience script (./start.sh) or manually launching the backend and Vite dev server in separate terminals. It’s the kind of architecture you’d never ship to production but works perfectly for personal tooling and rapid experimentation.
Gotcha
The README opens with a giant disclaimer: this is ‘99% vibe coded’ with no support intentions. Karpathy explicitly states he won’t improve or maintain it. That’s honest, but it means you’re inheriting code that was optimized for speed-to-first-working-demo rather than robustness. Expect sharp edges. Error handling is minimal. If OpenRouter is down or you hit rate limits, the app will likely break ungracefully.
More critically, the token economics can be expensive. A single query hits every council member for initial responses, then hits them all again for peer reviews, then hits the Chairman for synthesis. If you have four council members (as shown in the example config), that’s nine LLM calls per user question (four initial + four reviews + one synthesis). With frontier models, costs accumulate with usage. This isn’t necessarily a tool for casual browsing—it’s designed for questions where you genuinely want multiple expert perspectives.
There’s also zero empirical validation that the council approach actually produces better answers than just using a single capable model. The peer review stage assumes models can accurately judge each other’s outputs, but we know LLMs can struggle with self-evaluation and can be confidently wrong. The anonymization is clever but doesn’t eliminate all bias—models might still recognize their own writing style or favor responses that align with their training. The Chairman synthesis could amplify consensus errors rather than correcting them. Without A/B testing or human evaluation benchmarks, the quality improvement remains theoretical.
Verdict
Use LLM Council if you’re researching multi-model collaboration patterns, need to compare frontier LLMs side-by-side with structured cross-evaluation, or have specific questions where consulting multiple models simultaneously adds value. It’s excellent for AI researchers exploring ensemble methods in generative models, developers who want a working reference implementation of OpenRouter integration, or power users willing to fork and customize the code for their own workflows. The tabbed interface alone is valuable for anyone who regularly compares model outputs manually.
Skip it if you need production-ready infrastructure with support and maintenance, want a polished user experience with comprehensive error handling and edge case coverage, or expect the tool to work reliably without reading and potentially modifying the source code. Also skip if you’re looking for proven quality improvements—this is an experiment, not a validated technique.
The real value isn’t the deployed app, it’s the implementation pattern: a working example of async multi-LLM orchestration, blind peer review mechanics, and OpenRouter-based provider abstraction that you can adapt for your own use cases. At 16,434 stars, it’s clearly resonated with developers looking for exactly this kind of reference implementation.