Berry: Detecting AI Hallucinations in Code with Token Probability Analysis
Hook
Your AI coding assistant just confidently generated a function that looks perfect but silently breaks edge cases. What if you could measure its uncertainty token-by-token before running the code?
Context
AI coding assistants have become indispensable for modern developers, but they share a dangerous trait: they hallucinate with confidence. A model might generate a plausible-looking API call to a method that doesn't exist, invent command-line flags that were deprecated three versions ago, or confidently assert algorithmic complexity that's flat-out wrong. The problem isn't just factual errors—it's that these tools present hallucinated content with the same confident tone as verified information.
Traditional approaches to hallucination detection focus on semantic analysis or fact-checking against knowledge bases, but these methods are either too slow for interactive coding workflows or miss subtle fabrications in domain-specific code. Berry (hallbayes) takes a fundamentally different approach: it analyzes the token-level log probabilities that language models generate internally. When a model is uncertain about what comes next, the probability distribution across possible tokens flattens. By exposing these uncertainty signals, Berry transforms AI assistance from a black box into a measurable, auditable process. It runs as a Model Context Protocol (MCP) server, integrating directly into tools like Cursor, Claude Code, and Gemini CLI to provide real-time verification without disrupting your workflow.
Technical Insight
Berry's architecture centers on the MCP protocol, which allows it to act as a verification layer between your IDE and the underlying language models. Instead of building yet another IDE extension, it implements a server that any MCP-compatible client can query. This design choice means you configure it once and it works across multiple tools.
The core insight is remarkably elegant: language models compute a probability distribution over their vocabulary for every token they generate. High-entropy distributions (where probability is spread across many tokens) signal uncertainty, while low-entropy distributions (probability concentrated on one token) suggest confidence. Berry captures these logprobs through the API and applies statistical thresholds to flag suspicious outputs. Here's what a typical integration looks like in your Cursor configuration:
{
"mcpServers": {
"berry": {
"command": "python",
"args": ["-m", "hallbayes.server"],
"env": {
"OPENAI_API_KEY": "your-key",
"BERRY_BACKEND": "openai",
"BERRY_MODEL": "gpt-4"
}
}
}
}
Once running, Berry exposes tools through the MCP protocol. The primary tool is detect_hallucination, which accepts generated text and returns uncertainty scores. The implementation uses a sliding window approach across tokens, computing perplexity metrics and flagging regions where the model's confidence drops. For example, when generating a Python function that uses a third-party library, Berry might flag the exact parameter names as high-uncertainty while marking the function structure as confident.
The evidence notebook system is where Berry's workflow transformation becomes apparent. Instead of just flagging uncertain outputs, it maintains a trace of verification decisions across "runs" and "spans." A run might represent a debugging session, while spans track individual verification events—code generation, hallucination detection, human review. This creates an audit trail that's invaluable for understanding which AI suggestions you trusted and why. The audit_trace_budget tool lets you query this history, effectively building institutional knowledge about your AI assistance patterns.
Berry supports multiple backends (OpenAI, Gemini, Vertex AI) through a unified interface, but the implementation reveals an important constraint: it needs raw logprobs from the model API. This is why Anthropic models are excluded—their OpenAI-compatible endpoint doesn't expose probability data. When you call detect_hallucination, Berry makes a secondary API call with logprobs=True, then analyzes the returned probability distributions:
# Conceptual example of Berry's detection logic
def analyze_uncertainty(logprobs_response):
uncertainty_spans = []
for i, token_data in enumerate(logprobs_response['content']):
# Extract top logprob for this token
top_logprob = token_data['logprob']
# Convert to probability (logprob is log(p))
confidence = math.exp(top_logprob)
if confidence < 0.3: # Threshold for uncertainty
uncertainty_spans.append({
'token_index': i,
'token': token_data['token'],
'confidence': confidence,
'alternatives': token_data.get('top_logprobs', [])
})
return uncertainty_spans
The "classic" toolpack bundles hallucination detection with standard MCP capabilities like code execution and file operations, making Berry a one-stop verification server. This is crucial because hallucination detection in isolation isn't useful—you need to act on uncertain outputs, whether that's running tests, checking documentation, or escalating to human review. The integration means your AI assistant can detect uncertainty, propose verification steps, and execute them within the same context.
One subtle architectural choice is the repo-scoped execution model. Berry runs locally with access to your codebase, which raises security questions about AI-generated commands. The MCP protocol includes permission boundaries—tools must be explicitly enabled—but Berry trusts that your local environment is already secured. This is the right tradeoff for a verification tool: it needs full context to assess hallucinations effectively, and restricting it to sandboxed environments would undermine that capability.
Gotcha
The logprob dependency is Berry's Achilles heel. If you're using Claude (Anthropic's models) as your primary coding assistant, Berry simply won't work because Anthropic's API doesn't expose token probabilities. This isn't Berry's fault—Anthropic has architectural reasons for withholding this data—but it's a dealbreaker for a large segment of AI-assisted developers. Even with supported models, you're making an additional API call for every verification, which doubles latency and token costs. For rapid-fire completions where you're accepting dozens of suggestions per minute, this overhead becomes prohibitive.
The token-level analysis also has blind spots. A model can hallucinate with high confidence—imagine it invents a plausible function name that's consistent with a library's naming conventions but doesn't actually exist. The tokens will have high probability because they fit the pattern, but the semantic content is fabricated. Berry catches uncertainty, not incorrectness. You still need domain knowledge and testing to catch confident hallucinations. The evidence notebook helps here by creating a feedback loop, but it's manual labor to review and annotate traces.
Documentation is sparse beyond the workflow playbooks shown in the README. There's no API reference for customizing detection thresholds, no guide for interpreting uncertainty scores in different contexts (is 0.3 confidence bad for API names but acceptable for variable names?), and limited examples of integrating Berry with testing frameworks. For a tool that's fundamentally about building trust in AI outputs, the lack of transparent documentation about its own detection mechanics is ironic. You'll need to read the source code to understand what it's actually measuring.
Verdict
Use if: You're doing high-stakes AI-assisted development where hallucinations are costly—think infrastructure code, security-sensitive logic, or research implementations where subtle bugs waste hours. The quantitative uncertainty metrics give you objective criteria for when to trust AI suggestions versus when to verify manually, and the evidence notebook builds institutional knowledge about AI assistance patterns in your codebase. It's especially valuable if you're using OpenAI or Gemini models and already running MCP-compatible tools like Cursor. Skip if: You rely on Anthropic's Claude models, work on low-stakes projects where occasional AI errors are easily caught in testing, or need real-time completions without verification latency. Berry adds meaningful overhead (API costs, response time, cognitive load of reviewing uncertainty scores) that only pays off when hallucinations are expensive. For exploratory coding or boilerplate generation, the native IDE features are simpler and sufficient.