Berry: Hallucination Detection as a First-Class Development Primitive
Hook
What if we treated AI hallucinations in generated code the same way we treat memory leaks—as measurable bugs with dedicated tooling, not just unfortunate accidents to catch in code review?
Context
AI coding assistants have evolved from autocomplete curiosities to essential development tools. GitHub Copilot, Cursor, and Claude Code now generate substantial portions of production codebases. But there’s a fundamental problem the industry hasn’t solved: these tools hallucinate. They confidently suggest APIs that don’t exist, reference documentation that’s wrong, and synthesize plausible-looking code patterns that fail at runtime.
Traditional quality gates—linting, type checking, testing—catch some of these errors, but they’re reactive. By the time your test suite flags that an AI-generated function calls a non-existent method, you’ve already merged the suggestion, built context around it, and possibly propagated the pattern elsewhere. Berry approaches this differently: it treats hallucination detection as a first-class development workflow primitive, giving you tools to verify AI suggestions before they pollute your codebase. Built on the Model Context Protocol (MCP), it integrates with major AI coding assistants to provide real-time verification, audit trails, and evidence notebooks for tracking exactly what your AI assistant proposed and how it was validated.
Technical Insight
Berry’s architecture centers on three core components: an MCP server that exposes verification tools, a Strawberry-based hallucination detection service, and an evidence notebook system for audit trails. The MCP integration is the clever bit—rather than building yet another IDE plugin or proprietary assistant, Berry operates as a protocol-level service that any MCP-compatible tool can consume.
The installation model reflects this universality. Berry injects configuration into your AI coding environment without requiring fork maintenance:
# Install Berry globally
pipx install hallbayes
# Initialize in a repository
berry init
# This creates .berry/config.json with MCP server settings
# and optionally updates ~/.cursor/mcp.json or equivalent
Once configured, Berry exposes tools through MCP that your AI assistant can invoke. The primary verification workflow looks like this:
# In your AI assistant's context, Berry tools become available:
# Tool: detect_hallucination
# Checks if a code suggestion references non-existent APIs
result = detect_hallucination(
code_snippet="requests.get(url, verify_ssl=True)",
context={"library": "requests", "version": "2.31.0"}
)
# Returns:
# {
# "hallucinated": true,
# "issue": "Parameter 'verify_ssl' does not exist. Did you mean 'verify'?",
# "confidence": 0.94,
# "evidence_span_id": "span_7a3f9b"
# }
The evidence notebook system is where Berry differentiates itself from simple API validation. Every verification creates a span—a structured record of what was checked, when, and with what result. These spans become part of your repository’s verification history:
berry audit --since="2024-01-01"
# Shows all verification events, hallucination detections,
# and which suggestions were accepted vs. rejected
This creates a feedback loop. Over time, you build a corpus of your AI assistant’s accuracy patterns for your specific codebase. Did it consistently hallucinate pandas API changes between versions? That’s captured. Does it reliably suggest correct patterns for your internal libraries? That’s documented too.
The Strawberry verification service (Berry’s external dependency) performs the actual hallucination detection through a combination of static analysis, API schema validation, and LLM-based semantic checking. When your AI assistant suggests code, Berry can intercept the suggestion and run it through Strawberry before you see it. The workflow playbooks Berry ships with show concrete patterns:
# .berry/workflows/strict-verification.yaml
triggers:
- on: "code_suggestion"
when: "file_matches('src/critical/**/*.py')"
actions:
- verify:
tool: "detect_hallucination"
block_on_failure: true
confidence_threshold: 0.85
- record:
to: "evidence_notebook"
include_context: true
This policy-as-code approach means you can enforce strict verification for production code while allowing more permissive flows for prototyping. The repo-scoped configuration model keeps these policies versioned alongside your code.
The MCP architecture also enables composability. Berry’s tools can chain with other MCP servers—you might combine hallucination detection with automated test generation, or integrate evidence spans with your observability platform. Because it’s protocol-based rather than tightly coupled to a specific IDE, the same verification workflow works whether your team uses Cursor, Claude Code, or builds custom tooling on MCP.
Gotcha
Berry’s biggest limitation is its dependency on the external Strawberry verification service. This isn’t just an API key inconvenience—it’s a fundamental architectural constraint. Every verification request leaves your local environment, which means latency, cost per verification, and potential privacy concerns if you’re working with proprietary code. The docs don’t clearly specify Strawberry’s pricing model or rate limits, and there’s no documented fallback for offline development. If Strawberry’s service degrades or shuts down, Berry’s core value proposition evaporates.
The Python-only installation (pipx/pip) creates friction for polyglot teams. While Berry theoretically works with any MCP-compatible AI assistant regardless of language, actually getting it installed requires Python tooling. If you’re a Go shop using Cursor, you’re asking developers to maintain a Python environment solely for Berry. The evidence notebook also stores verification results as JSON files in your repository, which could bloat repo size over time—there’s no documented cleanup or retention policy. Finally, the documentation structure suggests complexity: multiple workflow playbooks, configuration files in different locations (.berry/config.json, global MCP settings), and concepts like ‘trace budget auditing’ that aren’t immediately intuitive. Teams will need to invest time understanding the verification model before it becomes effective, which may be a hard sell if hallucinations haven’t yet caused production incidents.
Verdict
Use Berry if you’re working on production systems where AI-generated code needs systematic verification before merge, especially in regulated industries or critical infrastructure where audit trails matter. It’s valuable for teams already experiencing hallucination problems that slip through traditional testing, and the evidence notebook system provides exactly the kind of reproducible verification workflow that mature engineering organizations need. The MCP integration means you’re not locked into a specific IDE, and the policy-as-code model scales across repositories. Skip Berry if you’re in early prototyping phases where code churn makes verification overhead counterproductive, working in non-Python environments where the installation friction outweighs benefits, or your team hasn’t yet felt pain from AI hallucinations (which means traditional code review is catching them). The external API dependency and setup complexity make it overkill for individual developers or small teams who can manually verify AI suggestions faster than configuring workflow policies.