Back to Articles

SelfCheckGPT: Catching LLM Hallucinations by Making Models Contradict Themselves

[ View on GitHub ]

SelfCheckGPT: Catching LLM Hallucinations by Making Models Contradict Themselves

Hook

The best lie detector for an LLM isn't a fact-checking database or human reviewer—it's asking the same model to tell the story again and watching it change the details.

Context

Large language models have become remarkably fluent, but their confidence doesn't correlate with accuracy. GPT-4 will assertively tell you that the capital of Brazil changed to Brasília in 1963 (it was 1960) with the same certainty it uses for correct facts. Traditional hallucination detection requires expensive external knowledge bases, human-annotated datasets, or white-box access to model internals—resources that most developers don't have when integrating third-party LLM APIs.

SelfCheckGPT, published at EMNLP 2023, offers an elegant alternative based on a key observation: hallucinations are inconsistent. When an LLM invents facts, the stochastic sampling process produces different fabrications across multiple generations. Real facts, anchored in training data, appear consistently. This zero-resource approach requires nothing but the ability to sample from the model multiple times—no knowledge graphs, no labeled datasets, no access to internal probabilities. For developers building fact-sensitive applications on top of commercial LLM APIs, this represents a practical path to reliability without infrastructure overhead.

Technical Insight

The core architecture is deceptively simple: generate a response to your prompt, then sample 3-5 additional responses using non-zero temperature. Each sentence in the original response gets scored against these samples using consistency metrics. High consistency means likely factual; low consistency signals potential hallucination.

SelfCheckGPT provides five scoring variants with different computational profiles. The BERTScore variant compares sentence embeddings, measuring semantic similarity between each sentence and corresponding sentences in samples. The N-gram approach calculates token-level probabilities based on overlap frequency. The MQAG (Multi-Question Answering and Generation) variant generates questions from each sentence, answers them using the samples, then compares answers for consistency. The NLI (Natural Language Inference) variant—recommended by the authors—uses an entailment model to detect contradictions. Finally, the LLM-Prompt variant uses another LLM call to judge consistency.

Here's how to implement basic hallucination detection with the NLI scorer:

from selfcheckgpt.modeling_selfcheck import SelfCheckNLI

# Initialize with a natural language inference model
device = "cuda" if torch.cuda.is_available() else "cpu"
selfcheck_nli = SelfCheckNLI(device=device)

# Your original LLM response (e.g., from GPT-4)
response = """Marie Curie won the Nobel Prize in Physics in 1903. 
She later won a second Nobel Prize in Chemistry in 1911. 
She was born in Warsaw in 1867 and discovered radium in her Paris laboratory."""

# Generate multiple samples (typically 3-5) with temperature > 0
samples = [
    "Marie Curie received the Nobel Prize in Physics in 1903 alongside her husband Pierre. She won another Nobel in Chemistry in 1911. Born in Warsaw, she conducted groundbreaking research on radioactivity in France.",
    "In 1903, Marie Curie became a Nobel laureate in Physics. She earned a second Nobel Prize in Chemistry in 1911. She was born in Warsaw, Poland in 1867.",
    "Marie Curie won her first Nobel Prize in 1903 for Physics. Her second Nobel came in 1911 for Chemistry. She discovered polonium and radium through her research."
]

# Split into sentences (or use spacy/nltk for better segmentation)
sentences = [s.strip() for s in response.split('.') if s.strip()]

# Score each sentence
scores = selfcheck_nli.predict(
    sentences=sentences,
    sampled_passages=samples
)

# Higher scores = more likely hallucination
for sentence, score in zip(sentences, scores):
    flag = "⚠️ SUSPICIOUS" if score > 0.5 else "✓ Likely factual"
    print(f"{flag} [{score:.3f}] {sentence}")

The NLI scorer works by checking if each sample passage entails, contradicts, or is neutral to the original sentence. The final score aggregates contradiction signals across all samples. If you generate "Marie Curie discovered radium in 1898" in one sample but "discovered radium in 1902" in another, the NLI model detects this contradiction, raising the hallucination score.

The tradeoff between variants is substantial. BERTScore runs fastest (single forward pass per sentence) but misses semantic contradictions that embeddings don't capture. NLI offers the best balance—accuracy comparable to MQAG at a fraction of the computational cost. MQAG is most thorough but requires question generation and answering for every sentence, making it 5-10x slower. The LLM-Prompt variant seems appealing for its simplicity but introduces dependency on another model's reasoning capabilities and multiplies API costs.

For production use, you'll want to implement caching and batching. Generate your samples once, then reuse them for all sentences. The NLI model can process batches efficiently:

# Batch processing for efficiency
from typing import List, Tuple

def detect_hallucinations_batch(
    responses: List[str],
    samples_per_response: int = 5,
    threshold: float = 0.5
) -> List[List[Tuple[str, float, bool]]]:
    """
    Returns: List of (sentence, score, is_hallucination) tuples per response
    """
    selfcheck = SelfCheckNLI(device="cuda")
    results = []
    
    for response in responses:
        # Generate samples (pseudo-code - use your LLM API)
        samples = generate_samples(response, n=samples_per_response)
        sentences = segment_sentences(response)
        
        scores = selfcheck.predict(
            sentences=sentences,
            sampled_passages=samples
        )
        
        annotated = [
            (sent, score, score > threshold)
            for sent, score in zip(sentences, scores)
        ]
        results.append(annotated)
    
    return results

The sampling strategy matters significantly. Temperature too low (< 0.5) produces nearly identical samples, reducing discriminative power. Temperature too high (> 1.0) makes even factual content inconsistent. The sweet spot is typically 0.7-0.9. You'll also want at least 3 samples—2 is insufficient for statistical confidence, while beyond 5 shows diminishing returns unless dealing with highly technical domains where factual variation is subtle.

Gotcha

The fundamental assumption—that hallucinations produce inconsistency—breaks down in predictable ways. When models confidently hallucinate the same false information across all samples (mode collapse), SelfCheckGPT scores it as factual. This happens frequently with obscure topics where the model has limited training data but strong priors. Ask about a fictional person who "sounds real" and watch the model consistently invent plausible but identical biographical details across samples.

The cost and latency implications are substantial. Detecting hallucinations in a 10-sentence response with 5 samples and the NLI scorer means processing 50 sentence pairs through a BERT-scale model. For the LLM-Prompt variant, you're making 6x the API calls (1 original + 5 samples). In production, this could translate to seconds of latency and costs that make marginal interactions uneconomical. There's also no universal threshold calibration—a score of 0.5 might work for GPT-4 on biographical content but fail completely for GPT-3.5 on technical documentation. You'll need domain-specific validation datasets to tune thresholds, which somewhat undermines the "zero-resource" promise. The paper provides benchmarks on specific datasets (WikiBio, etc.) but your mileage will vary significantly based on model, domain, and even prompt structure.

Verdict

Use SelfCheckGPT if you're building fact-sensitive applications (content verification, medical/legal summaries, educational tools) where hallucinations carry real consequences and you can absorb 3-5x generation costs. The NLI variant offers production-ready performance for most use cases, especially when you're stuck with API-only access to models and lack the infrastructure for retrieval-augmented approaches. It's particularly valuable during development and testing to identify which prompts or topics produce unreliable outputs. Skip if you need real-time single-pass inference, have access to curated knowledge bases (where RAG is more cost-effective), or your LLM shows mode collapse on your domain. Also skip if you're using models with accessible logits—uncertainty quantification methods that analyze token probabilities will be faster and more reliable. For high-stakes production systems, consider SelfCheckGPT as one signal in an ensemble alongside citation requirements and human review rather than a standalone solution.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/potsawee-selfcheckgpt.svg)](https://starlog.is/api/badge-click/llm-engineering/potsawee-selfcheckgpt)