PageIndex: Why Reasoning-Based RAG Beats Vector Search for Financial Documents
Hook
Vector databases promised to solve RAG, but PageIndex just hit 98.7% accuracy on FinanceBench without using a single embedding. The secret? Trading similarity search for actual reasoning.
Context
Traditional RAG systems have a fundamental flaw: they confuse similarity with relevance. When you ask a vector database to find relevant passages about “revenue recognition policy changes,” it returns chunks with similar word patterns—not necessarily the sections that answer your question through logical reasoning. This breaks down catastrophically with professional documents like financial reports, where understanding context requires following chains of reasoning across multiple sections. A query about revenue might need you to first check the accounting policies section, then cross-reference specific line items in the income statement, then validate against footnote disclosures. Vector similarity can’t replicate this reasoning process.
PageIndex takes a radically different approach inspired by AlphaGo’s tree search algorithms. Instead of embedding documents into vector space, it builds a hierarchical tree structure that mirrors how documents are actually organized—like a detailed table of contents with nested sections. During retrieval, an LLM navigates this tree through multi-step reasoning, simulating how a human expert would skim through a document’s structure to find relevant information. The system proved itself by achieving state-of-the-art 98.7% accuracy on FinanceBench, a benchmark specifically designed to test RAG systems on complex financial document analysis.
Technical Insight
PageIndex implements retrieval as a two-phase process that completely bypasses vector embeddings. During the indexing phase, it parses documents into a semantic tree structure. Instead of splitting text into arbitrary chunks, it preserves natural document hierarchy—sections, subsections, paragraphs—maintaining semantic coherence. The system supports both text-based parsing and a vision-based mode that works directly on page images without OCR, which is particularly valuable for documents with complex layouts like tables and charts.
The retrieval phase is where PageIndex diverges most dramatically from traditional RAG. Instead of computing cosine similarities, it performs LLM-guided tree search. Given a query, the system uses the LLM to reason about which branches of the document tree are most likely to contain relevant information, progressively narrowing down to specific sections. This is fundamentally agentic retrieval—the LLM makes decisions about where to look next based on what it’s already seen, just like a human expert flipping through a report.
Here’s a simplified example from the repository showing vectorless RAG in action:
from pageindex import PageIndexClient
# Initialize client (self-hosted or cloud)
client = PageIndexClient(api_key="your_key")
# Index a document - builds hierarchical tree structure
index_result = client.index_document(
file_path="financial_report.pdf",
mode="vision" # OCR-free, works on page images
)
# Reasoning-based retrieval via tree search
response = client.query(
index_id=index_result.index_id,
query="What was the revenue recognition policy change in Q3?",
reasoning_depth="deep" # Controls tree search thoroughness
)
# Get traceable results with page references
for result in response.results:
print(f"Section: {result.section_title}")
print(f"Page: {result.page_number}")
print(f"Reasoning path: {result.reasoning_trace}")
print(f"Content: {result.content}")
The architecture enables full explainability—every retrieval decision comes with a reasoning trace showing which tree branches were explored and why. This solves the “black box” problem of vector similarity where you never really know why a chunk was retrieved beyond having a high cosine similarity score. With PageIndex, you get page numbers, section titles, and a logical explanation of the retrieval path.
The system can be deployed self-hosted or accessed via their cloud service through API or MCP (Model Context Protocol) integration. The repository includes an agentic vectorless RAG example using OpenAI’s Agents SDK, demonstrating how to integrate PageIndex into multi-agent workflows. The core innovation is treating retrieval as a reasoning task rather than a similarity matching task—a conceptual shift that requires LLM inference at retrieval time but delivers dramatically better relevance for complex queries.
One particularly clever design choice is the hierarchical indexing strategy. By preserving document structure, PageIndex avoids the chunk boundary problem that plagues traditional RAG. When you chunk a financial statement arbitrarily, you might split a table across two chunks, or separate a disclosure from its referenced line item. PageIndex’s tree structure keeps related information together by respecting the document’s inherent organization, which authors specifically designed to be navigable.
Gotcha
PageIndex makes a deliberate trade-off: higher accuracy for higher latency and cost. Retrieval operations require LLM inference as the system traverses the document tree, reasoning about which branches to explore. This is fundamentally slower and more expensive than a single vector similarity computation. The exact performance characteristics depend on your deployment configuration and document complexity.
Another critical limitation is dependency on LLM quality. PageIndex’s retrieval performance is bounded by the reasoning capabilities of the underlying language model. If you’re using a weaker model to reduce costs, you might not see the dramatic accuracy improvements that the FinanceBench results demonstrate. The system requires LLM API access for both indexing and retrieval (whether self-hosted or cloud-based), which means considerations around API costs or infrastructure for hosting local models. This isn’t a lightweight solution—it requires either API costs or investment in hosting infrastructure capable of running capable language models.
Verdict
Use PageIndex if you’re working with professional long-form documents where accuracy is non-negotiable and you can tolerate higher latency—financial analysis, legal contract review, regulatory compliance, technical due diligence. The 98.7% FinanceBench accuracy isn’t marketing fluff; it represents a fundamental improvement for domains where vector search’s approximate retrieval causes real problems. Use it when explainability matters, when you need to cite specific pages and sections, when your queries require multi-hop reasoning across document sections. Use it when a wrong answer costs more than the LLM inference costs. Skip it for simple Q&A over short documents, chatbots needing sub-second response times, or applications where vector search already performs adequately. Skip it if LLM inference costs are a primary constraint for your use case, or if you’re working with documents that lack clear hierarchical structure. This is a specialized tool for high-stakes document analysis, not a drop-in replacement for every RAG system.