Building a Deliberative LLM Council: Multi-Model Consensus Through Peer Review

Hook

What if the best way to get a reliable answer from an LLM isn't to pick the smartest model, but to make several models argue, vote, and synthesize their disagreements? That's exactly what LLM Council does—and the architecture reveals surprising patterns about multi-agent orchestration.

Context

The frontier LLM landscape is fragmented. GPT-4 excels at reasoning, Claude shines at nuance, Gemini handles multimodal tasks well, and new models emerge monthly. Developers face a paradox: no single model dominates across all tasks, yet querying multiple models and manually comparing outputs is tedious and doesn't scale. You could average responses or implement simple voting, but that discards the nuance of why models disagree.

LLM Council, created by Andrej Karpathy as a weekend experiment, tackles this with a structured deliberation system. Instead of treating models as isolated oracles, it orchestrates them through three stages: parallel querying, anonymized peer review, and hierarchical synthesis. The system emerged from a practical question: can we build ensemble intelligence through structured debate rather than statistical aggregation? While explicitly positioned as 'vibe code' rather than production software, it demonstrates orchestration patterns that generalize far beyond the specific implementation.

Technical Insight

The architecture centers on a three-stage pipeline that treats LLMs as deliberative agents rather than simple predictors. When you submit a question, the FastAPI backend distributes it to multiple models simultaneously via OpenRouter's unified API. Each model generates its response independently, creating diverse perspectives shaped by different training data, architectures, and optimization objectives.

The second stage implements something clever: anonymized peer review. Each council member receives all responses with identities stripped, then ranks them based on quality. This anonymization is crucial—it prevents models from exhibiting deference bias (rating their own outputs higher) or brand bias (favoring responses they assume come from 'better' models). Here's how the ranking prompt is structured:

# Simplified example of the peer review stage
review_prompt = f"""
You are evaluating responses to: {original_question}

Here are {len(responses)} anonymized responses:

{format_anonymized_responses(responses)}

Rank these responses from best to worst based on:
- Accuracy and correctness
- Completeness and depth
- Clarity and usefulness

Provide your ranking as: [A, C, B, ...] with brief justification.
"""

# Each model ranks all OTHER responses
for model in council_members:
    ranking = await query_llm(model, review_prompt)
    peer_reviews[model] = parse_ranking(ranking)

The third stage hands everything to a 'Chairman' model (typically GPT-4 or Claude) that sees the original responses, peer rankings, and justifications. The Chairman's job isn't to vote or average—it's to synthesize a final answer that incorporates the strongest reasoning from each response while acknowledging disagreements. This hierarchical synthesis allows nuanced combination: maybe Claude's ethical framing combined with GPT's technical details yields a better answer than either alone.

The state management is deliberately simple. Conversations are stored as JSON files with a structure like:

{
  "conversation_id": "uuid-here",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum entanglement"
    },
    {
      "role": "assistant",
      "content": "[synthesized answer]",
      "metadata": {
        "council_responses": [...],
        "peer_reviews": [...],
        "synthesis_reasoning": "..."
      }
    }
  ]
}

This structure preserves the entire deliberation chain for later inspection—crucial for understanding why the council reached its conclusion.

The frontend React component fetches this metadata and renders tabs for each council member alongside the final synthesis. You can compare GPT's response with Claude's, see how each model ranked the others, and trace the Chairman's reasoning. This transparency transforms the tool from a black-box ensemble into an educational instrument for understanding model behavior.

OpenRouter integration is the architectural linchpin that makes this practical. Instead of managing API keys, rate limits, and request formats for multiple providers, the backend makes identical HTTP requests to different model endpoints:

async def query_council_member(model: str, prompt: str) -> str:
    response = await httpx.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {OPENROUTER_KEY}"},
        json={
            "model": model,  # e.g., "anthropic/claude-3-opus"
            "messages": [{"role": "user", "content": prompt}]
        }
    )
    return response.json()["choices"][0]["message"]["content"]

This abstraction layer means you can swap council members by changing a config array—want to replace Grok with Mistral? Just update the model identifier. The pattern reveals an important insight: when building multi-LLM systems, unified APIs like OpenRouter or LiteLLM are force multipliers that let you focus on orchestration logic rather than provider integration.

The async/await pattern throughout the backend enables parallel querying without blocking. When a question arrives, all council members get queried simultaneously, and the system waits for all responses before proceeding to peer review. This parallelization is essential—serial querying would multiply latency by the number of models, making the system unusably slow.

Gotcha

The author calls this 'vibe code' for good reason—it's optimized for experimentation, not reliability. Error handling is minimal. If one model times out or returns an unexpected format, the entire council process fails. There's no retry logic, no graceful degradation, and no validation that peer reviews actually follow the requested format. In practice, you'll hit edge cases quickly: models refusing to rank competitors, formatting their rankings inconsistently, or producing rankings that contradict their justifications.

The cost multiplier is brutal. A single question might query five models for initial responses, then query each model again for peer review (five more calls), then call the Chairman for synthesis—potentially 11+ API calls per question, mostly to frontier models charging $10-60 per million tokens. A conversation with ten exchanges could cost $5-20 depending on response lengths and models chosen. There's no empirical validation that this expensive process actually produces better answers than querying GPT-4 once. The repository includes no benchmarks, no A/B tests, no metrics—just the intuition that deliberation should help. For production use, you'd need rigorous evaluation proving the quality improvement justifies the cost.

Verdict

Use if: You're researching multi-agent LLM systems and want a concrete reference implementation of deliberative consensus mechanisms. It's perfect for qualitative exploration when you need to compare model behaviors side-by-side, or as a starting scaffold for building more sophisticated ensemble systems where you control the orchestration logic. The architectural patterns—parallel querying, identity anonymization, hierarchical synthesis—are valuable regardless of the specific implementation quality. Also useful if you're evaluating OpenRouter for multi-model applications and want to see practical usage patterns. Skip if: You need production-ready code, cost efficiency, or proven quality improvements. The author explicitly commits to zero maintenance and positions this as disposable exploration code. If you're optimizing for reliability or budget, direct single-model queries will serve you better. Also skip if you expect quantitative evidence that the council approach outperforms individual models—no such validation exists in the repo.

Building a Deliberative LLM Council: Multi-Model Consensus Through Peer Review

Building a Deliberative LLM Council: Multi-Model Consensus Through Peer Review

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building a Deliberative LLM Council: Multi-Model Consensus Through Peer Review

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]