PageIndex: Why This RAG System Ditches Vector Databases for Tree-Based Reasoning

Hook

Vector databases have become the default for RAG systems, but PageIndex's 98.7% accuracy on FinanceBench suggests we've been solving the wrong problem: semantic similarity isn't the same as relevance.

Context

The RAG (Retrieval-Augmented Generation) landscape has standardized around a familiar pattern: chunk documents into fragments, convert them to embeddings, store in a vector database, then retrieve via cosine similarity. This approach works remarkably well for general-purpose question answering, which explains why every AI startup seems to be building on Pinecone, Weaviate, or Chroma.

But there's a fundamental flaw in this architecture. When you ask a financial analyst to find revenue figures in a 200-page annual report, they don't search for semantically similar paragraphs. They reason about document structure: navigating to the financial statements section, locating the income statement, then finding the specific line item. They use their understanding of how documents are organized, not just what words mean. PageIndex tackles this gap head-on by replacing vector similarity with LLM-powered reasoning over hierarchical document structures, treating documents as navigable trees rather than flat collections of embeddings.

Technical Insight

PageIndex's architecture revolves around two distinct phases that fundamentally differ from traditional vector-based RAG. During the indexing phase, documents are parsed into a hierarchical tree structure that mirrors their natural organization—think table of contents, sections, subsections, and paragraphs. This isn't just OCR followed by chunking; it's structural analysis that preserves the document's inherent hierarchy.

Here's what a basic PageIndex implementation looks like:

from pageindex import PageIndexClient

# Initialize client (can be self-hosted or cloud)
client = PageIndexClient(api_key="your_key")

# Index a document - creates hierarchical tree structure
index_result = client.index_document(
    file_path="annual_report_2023.pdf",
    metadata={"year": 2023, "company": "Acme Corp"}
)

# Query using reasoning-based retrieval
response = client.query(
    query="What was the revenue growth rate in Q4?",
    context="We're analyzing year-over-year performance",
    index_id=index_result.index_id,
    reasoning_depth="thorough"  # Controls tree traversal depth
)

# Response includes full provenance
print(f"Answer: {response.answer}")
print(f"Sources: {response.citations}")  # Page numbers and section paths
print(f"Reasoning trace: {response.reasoning_steps}")  # How it navigated the tree

The retrieval phase is where things get interesting. Instead of computing embedding similarity, PageIndex uses the LLM to traverse the document tree through multi-step reasoning. The system starts at the root, evaluates which branches are relevant based on section titles and summaries, then recursively descends into promising sections. This is fundamentally different from nearest-neighbor search—it's more akin to how a human would navigate a filing cabinet.

The tree structure enables context-aware retrieval that vector search struggles with. When you ask a follow-up question in a conversation, PageIndex can reason about which parts of the document relate to the previous exchange. It can also incorporate domain knowledge into the traversal strategy. For a financial document, it knows to prioritize the "Financial Statements" section over "Executive Summary" when looking for specific numbers.

For scaling to large document collections, PageIndex introduces a two-tier architecture called PageIndex File System. The upper tier maintains a file-level tree organizing documents by metadata (date, category, entity), while the lower tier contains the individual document trees. This enables corpus-wide reasoning:

# Create a file system index for multiple documents
fs_index = client.create_filesystem_index(
    name="corporate_filings",
    organization_schema={
        "levels": ["year", "quarter", "company"],
        "metadata_fields": ["filing_type", "sector"]
    }
)

# Add documents with structured metadata
for pdf in document_collection:
    client.add_to_filesystem(
        fs_index_id=fs_index.id,
        file_path=pdf.path,
        metadata={
            "year": pdf.year,
            "quarter": pdf.quarter,
            "company": pdf.company,
            "filing_type": "10-K"
        }
    )

# Query across the entire corpus with reasoning
response = client.query_filesystem(
    query="Compare revenue trends for tech companies in 2023",
    fs_index_id=fs_index.id,
    filters={"sector": "technology", "year": 2023}
)

One particularly powerful feature is vision-based RAG, which operates directly on PDF page images without requiring OCR. This is crucial for documents with complex layouts, tables, or charts where text extraction loses critical structure. The LLM can reason about visual elements in context, understanding that a table in the "Revenue by Segment" section is more relevant than a similar-looking table in "Operating Expenses."

The tradeoff, of course, is cost and latency. Where vector search is a single embedding comparison, PageIndex makes multiple LLM calls to traverse the tree. For a complex query on a large document, you might see 5-10 LLM invocations. The system mitigates this through intelligent pruning—if a section is clearly irrelevant based on its title and summary, the entire subtree is skipped. But you're still looking at significantly higher compute costs than traditional RAG, which is why this approach makes sense for high-value queries where accuracy justifies the expense.

Gotcha

The dependency on LLM reasoning quality means PageIndex inherits all the unpredictability of modern language models. If your LLM hallucinates during tree traversal—deciding that a section is relevant when it's not, or missing a critical branch—your retrieval quality suffers. This is particularly concerning because the reasoning happens internally; you're trusting the model to navigate correctly. While the system provides reasoning traces for debugging, you're still at the mercy of prompt engineering and model capabilities.

Document structure assumptions create another significant limitation. PageIndex excels with professionally formatted documents that have clear hierarchical organization: annual reports, legal contracts, technical manuals, academic papers. But throw it poorly structured content—scraped web pages, informal emails, chat transcripts, social media threads—and the tree-based approach loses its advantage. You can't reason about hierarchical structure that doesn't exist. The system will still attempt to create a tree, but it becomes an artificial overlay rather than a natural representation, and you'd likely get better results from traditional chunking and vector search. There's also the elephant in the room regarding the open-source version: the repository mentions that self-hosted deployment uses "standard PDF parsing" while production features like enhanced OCR require the cloud service, suggesting that the truly powerful capabilities may be locked behind the commercial offering.

Verdict

Use if: You're working with long, structured documents where accuracy and explainability are paramount—financial analysis, legal document review, regulatory compliance, technical support over product manuals. The higher cost per query is justified when errors are expensive or when you need to show your work with citations. PageIndex particularly shines in agentic workflows where the AI needs to reason across multiple sections or combine information from different parts of a document, and in domains where professional expertise traditionally guides document navigation. Skip if: You're building general-purpose semantic search, operating under tight cost constraints, processing high query volumes where latency matters, or working primarily with unstructured text that lacks clear hierarchical organization. For these scenarios, traditional vector-based RAG with a good reranking layer will give you better cost-performance characteristics. Also skip if you need complete control over your infrastructure, as the most powerful features appear to require the cloud service rather than pure open-source deployment.

PageIndex: Why This RAG System Ditches Vector Databases for Tree-Based Reasoning

PageIndex: Why This RAG System Ditches Vector Databases for Tree-Based Reasoning

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

PageIndex: Why This RAG System Ditches Vector Databases for Tree-Based Reasoning

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]