Back to Articles

PaperQA2: The Agentic RAG System That Beat Humans at Scientific Literature Analysis

[ View on GitHub ]

PaperQA2: The Agentic RAG System That Beat Humans at Scientific Literature Analysis

Hook

A RAG system recently achieved superhuman performance on scientific question answering, summarization, and contradiction detection. It’s open source, and it runs on your machine.

Context

Traditional RAG systems treat all documents equally—a fatal flaw when working with scientific literature. A Nature paper from a high-impact journal carries more epistemic weight than an unreviewed preprint. A retracted study shouldn’t inform your answer at all. Yet most RAG implementations blindly chunk PDFs and stuff nearest-neighbor results into context windows, ignoring the rich metadata that scientists instinctively use to evaluate sources.

PaperQA2 was built by Future House to solve this problem. Their 2024 research paper demonstrated superhuman performance on scientific tasks, not through larger models or more documents, but through a fundamentally different architecture: one that treats scientific documents as structured knowledge artifacts with metadata, citations, and provenance. The system automatically enriches papers with citation counts, journal quality metrics, and retraction checks from Semantic Scholar, Crossref, and Unpaywall—then uses that metadata throughout the retrieval and generation pipeline.

Technical Insight

Search Query

Iterate/Refine

Synthesize

PDF Documents

Document Ingestion & Parsing

Scholarly APIs

Semantic Scholar/Crossref

Metadata Enrichment

Citations/Retractions

Chunking & Embedding

OpenAI/LiteLLM

Vector Store

Numpy-based

User Query

Agent Loop

LLM Decision Engine

Hybrid Search

Vector + Tantivy

LLM Re-ranking

Context Assembly

Answer with Citations

System architecture — auto-generated

PaperQA2’s architecture centers on metadata-aware embeddings and a multi-stage retrieval pipeline that mirrors how scientists actually read papers. When you ingest a PDF, the system doesn’t just chunk text. It reaches out to multiple scholarly APIs simultaneously, pulls citation counts and journal metadata, checks for retractions, then embeds chunks alongside this context. A sentence from a highly-cited Nature paper gets semantically different treatment than identical text from an obscure preprint.

The simplest workflow looks deceptively minimal:

from paperqa import Settings, ask

# Agentic mode: LLM decides what to search and read
answer = await ask(
    "What is the role of tau protein in Alzheimer's?",
    settings=Settings(paper_directory="./papers")
)

print(answer.answer)  # Full response with in-text citations
print(answer.context)  # Source excerpts used

Under the hood, this triggers an agent loop. The LLM formulates search queries, decides which papers to read based on titles and abstracts, requests specific sections, and iteratively refines its understanding. It’s not a single retrieval-then-generate pass—it’s a conversation between the agent and the document index.

For more control, you can bypass the agent and orchestrate retrieval manually:

from paperqa import Docs, Settings

docs = Docs()
await docs.aadd("path/to/paper.pdf")  # Async PDF ingestion

# Manual retrieval with custom settings
settings = Settings(
    llm="gpt-4o-mini",
    summary_llm="gpt-4o",  # Use stronger model for final synthesis
    embedding="text-embedding-3-small",
    k=15,  # Retrieve 15 chunks
    max_sources=5  # Cite at most 5 sources
)

response = await docs.aquery(
    "How do neurons die in Alzheimer's?",
    settings=settings
)

The retrieval pipeline implements what the authors call RCS: re-ranking and contextual summarization. After initial vector search retrieves candidates, an LLM re-ranks them based on relevance to the specific question. Then, instead of dumping raw chunks into the final prompt, PaperQA2 generates concise summaries of each chunk in the context of the query. This compression step is critical—it lets you retrieve more documents (k=15 or higher) without blowing past context windows, while preserving the signal.

PaperQA2 uses tantivy for full-text search, which runs locally and indexes the complete parsed content of your PDFs. This hybrid approach—vector search for semantic similarity, full-text for exact phrase matching—catches both conceptual connections and specific terminology. You can even create persistent indexes:

from paperqa import Docs

docs = Docs()
await docs.aadd_dir("./my_papers")
await docs.index.write_index("my_index")

# Later, reload without re-embedding
docs2 = Docs()
await docs2.index.load_index("my_index")

Multimodal support is available if you use a vision-capable model. The system also handles source code, Microsoft Office documents, and plain text—useful when your “literature” includes software documentation or technical reports. The embedding layer abstracts over LiteLLM, so you can swap in any model from OpenAI, Anthropic, Cohere, or locally-hosted options like Ollama without changing code.

Gotcha

PaperQA2’s reliance on external APIs is both a strength and a liability. Every document you add triggers requests to Semantic Scholar, Crossref, and Unpaywall for metadata enrichment. If you’re ingesting hundreds of papers, you’ll hit rate limits or need API keys. The automatic metadata fetching is powerful until you encounter papers that aren’t indexed in these databases (common for very recent preprints or non-English literature), where you lose that enrichment.

Cost is the other consideration. The agentic mode is powerful but can incur significant API costs. Each agent loop iteration calls your LLM multiple times: to decide next actions, to re-rank chunks, to summarize context, and finally to generate the answer. On a complex question spanning multiple papers, token usage can add up quickly. The README recommends using cheaper models like GPT-4o-mini for intermediate steps and reserving GPT-4 for final synthesis, but even this hybrid approach requires budgeting for API costs if you’re building a user-facing application.

The versioning scheme changed from semantic versioning to calendar versioning in December 2025. Prior releases used semantic versioning (e.g., v5), while future releases will use date-based versioning (e.g., 2025.12.x). The README explains that version 5 and onward is termed “PaperQA2” to mark the achievement of superhuman performance on key metrics, while earlier versions are retrospectively called “PaperQA1.”

Verdict

Use PaperQA2 if you’re building research tools where accuracy and attribution matter more than speed or cost. It excels at synthesis tasks—literature reviews, contradiction detection, evidence gathering for grant proposals—where you need verifiable citations and can’t afford hallucinations. The metadata-aware retrieval genuinely improves quality on scientific corpora in ways generic RAG frameworks don’t match. The agentic mode shines for exploratory research where you don’t know exactly what you’re looking for. Skip it if you’re working with non-scientific documents (the scholarly metadata enrichment becomes less useful), need real-time query performance (the multi-stage pipeline is thorough but not optimized for speed), have strict API cost constraints (agentic loops can consume significant tokens), or want true offline operation (the metadata APIs are integral to the system). For those cases, LlamaIndex or a simpler vector search setup will serve you better. But if you’re a researcher, research engineer, or building tools for scientists, PaperQA2 is the best open-source option available.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/future-house-paper-qa.svg)](https://starlog.is/api/badge-click/data-knowledge/future-house-paper-qa)