Back to Articles

PaperQA2: How an Agentic RAG System Achieves Superhuman Performance on Scientific Literature

[ View on GitHub ]

PaperQA2: How an Agentic RAG System Achieves Superhuman Performance on Scientific Literature

Hook

A RAG system just beat human experts at answering questions from scientific papers. The secret wasn't just better embeddings—it was teaching the LLM to think like a researcher conducting a literature review.

Context

Scientific literature is a uniquely challenging domain for RAG systems. Unlike corporate documents or general web content, academic papers are dense with technical jargon, mathematical notation, contradictory findings across studies, and citation networks that matter as much as the content itself. A claim from a highly-cited Nature paper carries different weight than one from an unreviewed preprint, and knowing whether a paper has been retracted is critical.

Traditional RAG implementations fail spectacularly in this environment. Chunk a physics paper into 512-token segments and you'll split equations from their context. Use semantic search alone and you'll miss the nuanced differences between similar methodologies. Generate an answer without tracking which specific passage supported each claim and you've created an unverifiable hallucination machine. PaperQA2, developed by Future House and validated in a 2024 research paper, was built specifically to solve these problems—and according to their benchmarks, it now exceeds human performance on scientific question answering, summarization, and contradiction detection tasks.

Technical Insight

PaperQA2's architecture is a masterclass in task-specific RAG optimization. At its core, it combines three search strategies: full-text search via tantivy (a Rust-based engine), semantic search using OpenAI embeddings with metadata awareness, and LLM-based re-ranking with what they call Retrieval-augmented Contextual Summarization (RCS). This hybrid approach means a query like "What methods were used to measure protein folding?" hits both the exact term "protein folding" in full-text and semantically similar concepts like "structural dynamics."

The metadata-aware embeddings are particularly clever. When indexing documents, PaperQA2 doesn't just embed the text—it enriches each chunk with citation counts from Semantic Scholar, publication venue, author information, and retraction status from multiple providers (Crossref, Unpaywall). This metadata becomes part of the search context, allowing the system to weight heavily-cited findings appropriately and flag retracted research.

Here's what a basic workflow looks like in code:

from paperqa import Settings, ask

# Configure with your preferred LLM backend
settings = Settings(
    llm="gpt-4o-mini",
    summary_llm="gpt-4o-mini",
    embedding="text-embedding-3-small",
)

# Ask a question with automatic paper fetching
answer = await ask(
    "What are the main mechanisms of mRNA vaccine efficacy?",
    settings=settings,
)

print(answer.answer)  # Answer with inline citations
print(answer.references)  # Full citation details

Under the hood, this simple call triggers a sophisticated pipeline. If you're in agentic mode (the default), an LLM agent iteratively decides what to search for, evaluates whether retrieved passages are sufficient, and may reformulate queries multiple times before synthesizing an answer. This is fundamentally different from one-shot RAG systems—it mimics how a human researcher would approach a literature review.

The RCS component is where things get interesting. After initial retrieval, PaperQA2 doesn't just dump context into the LLM. Instead, it passes each retrieved passage through a summarization step that's aware of the original query. For a question about vaccine mechanisms, a 10-page methods section gets compressed into a targeted 200-word summary highlighting only the relevant immunological pathways. This dramatically reduces context window waste while maintaining the information density needed for accurate answers.

For production deployments, you'll likely want more control over the document indexing pipeline:

from paperqa import Docs, Settings

docs = Docs()

# Add papers from various sources
await docs.aadd_url(
    "https://www.nature.com/articles/s41586-020-123",
    citation="Smith et al. Nature 2020",
)

await docs.aadd(
    Path("local_papers/important_study.pdf"),
    citation="Jones et al. Science 2021",
)

# Query with specific evidence requirements
answer = await docs.aget_evidence(
    query="What contradictions exist in the literature about X?",
    settings=Settings(
        answer=AnswerSettings(
            evidence_k=15,  # Retrieve top 15 passages
            answer_max_sources=5,  # Use max 5 sources in final answer
        )
    ),
)

The evidence object returned includes not just the answer text but a structured breakdown of which passages from which papers supported each claim, with character-level precision. This citation granularity is essential for scientific applications where verifiability isn't optional.

PaperQA2's tool integration for agentic mode is built on a plugin architecture. The agent has access to tools for paper search, gather evidence, and generate answer, and it can invoke them in any order. Watching the agent work is instructive—it often searches multiple times with progressively refined queries, just as a human would. The system uses LiteLLM under the hood, meaning you can swap OpenAI for Anthropic, Gemini, or local models without changing your code:

settings = Settings(
    llm="claude-3-5-sonnet-20241022",
    summary_llm="gpt-4o-mini",  # Mix and match models
    llm_config={"temperature": 0.1},  # Low temp for factual accuracy
)

The vector database abstraction is minimal but effective. By default it uses Numpy for small collections, but swapping to Postgres with pgvector or Qdrant for production scale is a configuration change, not a rewrite.

Gotcha

PaperQA2's Achilles' heel is its scientific literature specialization. The system's entire pipeline—from metadata enrichment to citation extraction—assumes you're working with academic papers. Point it at legal contracts, technical manuals, or internal corporate docs and you'll get suboptimal results because the metadata providers (Semantic Scholar, Crossref) won't find anything, and the citation extraction logic expects academic reference formats.

API costs are the other elephant in the room. The default configuration hits OpenAI for embeddings, summarization, and answer generation. On a corpus of 100 papers with 10 questions, expect to burn through thousands of tokens per query when operating in agentic mode—the agent might search 5 times, summarize 20 passages, and generate multiple answer candidates. At GPT-4 pricing, this gets expensive fast for high-volume applications. You can mitigate this by using cheaper models for summarization or switching to local embeddings, but there's an accuracy trade-off the maintainers haven't fully benchmarked with non-OpenAI models.

The versioning situation is genuinely confusing. The repo recently switched from semantic versioning to CalVer (2025.1.10 format), and the distinction between "PaperQA" and "PaperQA2" isn't always clear in documentation. The maintainers acknowledge this complexity. If you're integrating this into production systems, pin your versions carefully and test upgrades thoroughly—breaking changes have happened between point releases.

Verdict

Use if: You're building research tools where accuracy and proper citations are non-negotiable, you're working specifically with scientific literature (papers, preprints, academic articles), you have budget for OpenAI API calls or resources to properly tune alternative models, and you need state-of-the-art RAG performance backed by published benchmarks. The agentic mode's iterative refinement genuinely produces better answers than one-shot retrieval, especially for complex multi-paper questions. Skip if: You're working with general documents outside academia, you need a purely offline/on-premise solution with zero external API dependencies, you're building a simple Q&A bot where basic RAG with LlamaIndex would suffice, or you're cost-sensitive and can't justify the token consumption of agentic workflows. For non-scientific use cases, you'll spend more time fighting the tool's assumptions than benefiting from its specialized features.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/future-house-paper-qa.svg)](https://starlog.is/api/badge-click/data-knowledge/future-house-paper-qa)