Back to Articles

Pythea: Detecting When Your LLM Shows Its Work But Ignores It

[ View on GitHub ]

Pythea: Detecting When Your LLM Shows Its Work But Ignores It

Hook

Your language model just counted the r’s in ‘strawberry’ correctly in its chain-of-thought reasoning, then confidently told you there are two. This isn’t a knowledge problem—it’s a routing failure, and Pythea is designed to catch it.

Context

The hallucination problem in large language models has been framed almost entirely around factual accuracy: the model doesn’t know something, so it makes up an answer. But there’s a more insidious failure mode emerging as models get better at reasoning: procedural hallucination. These are cases where the model generates correct intermediate work—citing the right sources in RAG, showing valid scratch calculations, or producing sound logical steps—but then completely fails to incorporate that information into its final answer. It’s like watching a student solve a math problem correctly on scratch paper, then write down a different answer on the test.

This gap matters enormously for agent workflows, formal verification systems, and RAG pipelines where we’ve built elaborate scaffolding to make models ‘show their work.’ We assumed that if we could get models to generate reasoning traces, they’d naturally use them. Pythea, a multi-component reliability toolkit from leochlon, challenges that assumption with a framework for detecting and mitigating these routing failures. Its core insight: if you remove the evidence a model claims to have used and its confidence doesn’t drop, it never actually used that evidence in the first place.

Technical Insight

Strawberry Core

First-token logprob

First-token logprob

Yes

No

User Query + RAG Context

EvidenceScrubber

Baseline Generation

with Citations

Extract Citations

& Logprobs

Scrub Cited Evidence

from Prompt

Scrubbed Generation

without Context

Compare Confidence Delta

JS Divergence

< Threshold?

Flag Procedural

Hallucination

Evidence Used

Properly

System architecture — auto-generated

Pythea’s architecture revolves around three interconnected components, but its signature contribution is Strawberry—a procedural hallucination detector built on evidence-scrubbing methodology. The approach is elegantly simple: when a model cites sources or generates intermediate reasoning, Strawberry makes a second API call with those citations removed, then measures the confidence delta between the two responses. If the model claims to have used evidence from document X but produces essentially the same answer with the same confidence when document X is removed, that’s a red flag for citation confabulation.

The implementation works by analyzing first-token logprobs as a proxy for model confidence. Here’s what a basic Strawberry verification flow looks like in practice:

from pythea.strawberry import EvidenceScrubber
from pythea.thea import TheaClient

# Initialize with your model endpoint
scrubber = EvidenceScrubber(model="thea-mini-reasoning")

# Original prompt with RAG context
full_prompt = """
Context: [Document A discusses Q1 revenue of $2.3M]
Question: What was Q1 revenue?
"""

# Get baseline response with full context
baseline = scrubber.generate(full_prompt, return_logprobs=True)
print(f"Answer: {baseline.text}")
print(f"Confidence: {baseline.first_token_prob}")

# Scrub the cited evidence and regenerate
scrubbed_prompt = scrubber.remove_citations(full_prompt, baseline.citations)
scrubbed = scrubber.generate(scrubbed_prompt, return_logprobs=True)

# Calculate procedural hallucination score
ph_score = scrubber.compute_delta(
    baseline.first_token_prob,
    scrubbed.first_token_prob,
    method="js_divergence"
)

if ph_score < 0.15:  # Threshold for suspected routing failure
    print(f"⚠️ Procedural hallucination detected (δ={ph_score:.3f})")
    print("Model may not be using cited evidence")

The JS divergence calculation here is crucial—it’s measuring the information-theoretic distance between the model’s confidence distributions with and without evidence. A low divergence means the evidence didn’t actually change the model’s internal probability landscape, even if the model explicitly cited it.

The second component, the Thea API Client, is a lightweight wrapper around a proprietary reasoning API that supports ensemble and mixture-of-models inference. This is where Pythea’s research origins show through—the client is designed to work with a specific reasoning endpoint that isn’t publicly documented in the repo. For teams with their own model infrastructure, this would need to be adapted, but the interface provides useful patterns for confidence-aware inference.

The third piece, Offline QMV Probing, implements a permutation-mixture evaluator that’s genuinely model-agnostic. It works by permuting the order of candidate responses and using first-token logprobs to estimate quality bounds without fine-tuning or ground truth labels. This is particularly useful for domains like formal verification where you have multiple proof attempts but no clear oracle for correctness:

from pythea.qmv import PermutationProbe

# You have 5 different proof attempts for the same theorem
proofs = [proof_1, proof_2, proof_3, proof_4, proof_5]

probe = PermutationProbe(model="your-model", n_permutations=20)

# QMV evaluates by permuting order and measuring first-token consistency
quality_scores = probe.evaluate(
    candidates=proofs,
    eval_prompt="Is this proof valid? Answer: ",
    aggregation="bernoulli_mean"
)

# Higher scores indicate more consistent first-token confidence
best_proof = proofs[quality_scores.argmax()]
print(f"Most reliable proof: {best_proof.id}")
print(f"Quality bound: {quality_scores.max():.3f}")

Where Pythea really shines is in its pre-built agent workflows, shipped as Codex skills via the included MCP (Model Context Protocol) server. The proof repair agent is particularly impressive: it attempts formal verification steps, runs Strawberry checks after each micro-step to detect routing failures, and backtracks when procedural hallucinations are detected. This creates a verification-gated reasoning loop that prevents the cascading errors common in multi-step agent workflows. The evidence-first debugging workflow uses similar gating—each diagnostic step is validated to ensure the agent is actually using the debug information it claims to have gathered, rather than confabulating based on priors.

The architectural decision to use first-token logprobs as the confidence signal is both clever and limiting. It’s lightweight (doesn’t require full sequence generation for verification) and often correlates well with model uncertainty. But it assumes that token-level confidence maps cleanly to reasoning-level confidence, which isn’t always true for complex multi-step outputs. Pythea partially addresses this by supporting Bernoulli probes across permutations, but you’re still ultimately measuring something indirect.

Gotcha

The elephant in the repository is the proprietary Thea Mini Reasoning API dependency. The entire Strawberry module and agent workflows are built around this endpoint, which isn’t publicly available and has no documented pricing, rate limits, or stability guarantees. The codebase includes a citation to a 2026 paper (yes, future-dated), which strongly signals this is research-stage work that may have been developed with internal or academic access to models not yet released. If you’re planning to use Pythea in production, you’ll need to either wait for Thea API access or spend significant engineering effort adapting the evidence-scrubbing methodology to work with OpenAI, Anthropic, or your own model endpoints. The logprob analysis is theoretically model-agnostic, but the implementation is tightly coupled to Thea’s response format.

The agent workflows are similarly constrained to the Codex/Claude ecosystem via the MCP server. While this provides excellent integration if you’re already in that world, adapting these workflows to LangChain, AutoGPT, or custom agent frameworks would require substantial refactoring. The proof repair skills in particular make assumptions about the structure of formal verification environments (Lean/Coq) that may not transfer cleanly to other domains. And while the evidence-scrubbing approach is novel for detecting procedural hallucinations, it doubles your API costs and latency for every verified inference—you’re making two model calls instead of one. For high-throughput production systems, that overhead matters.

Verdict

Use Pythea if: you’re debugging RAG systems where models cite sources but seem to ignore them, building agent workflows where reasoning drift is killing reliability (especially in mathematical or formal verification domains), or researching procedural hallucination detection and want a framework that goes beyond factual accuracy metrics. The evidence-scrubbing methodology is genuinely novel and the proof repair agents demonstrate that verification-gated reasoning can work. Skip it if: you need production-ready tooling with clear SLAs and public API access, your hallucination problems are factual rather than procedural (model lacks knowledge vs. model fails to route knowledge), or you’re working outside the Codex/Claude ecosystem and don’t want to rewrite agent integrations. The research-stage maturity and proprietary dependencies make this an exploration tool, not a drop-in production solution. Consider it a preview of where LLM reliability tooling is headed, but budget for adaptation work if you want to deploy it seriously.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/leochlon-pythea.svg)](https://starlog.is/api/badge-click/llm-engineering/leochlon-pythea)