Back to Articles

LLM Sherpa: How Smart PDF Chunking Fixes Broken RAG Pipelines

[ View on GitHub ]

LLM Sherpa: How Smart PDF Chunking Fixes Broken RAG Pipelines

Hook

Your RAG system returns nonsense because your PDF parser is splitting tables mid-row and orphaning list items from their context. Traditional PDF parsers treat documents like raw text dumps, losing the structural context that makes retrieval work.

Context

Retrieval Augmented Generation lives or dies by chunk quality. Feed an LLM a paragraph ripped from page 47 with no section context, and it hallucinates. Show it a table where half the rows are missing, and it invents data. The problem isn’t your vector database or your prompt engineering—it’s that traditional PDF parsers treat documents like raw text dumps.

Most PDF extraction tools use arbitrary chunking: split every 512 tokens, break on double newlines, or worse, preserve the random line breaks that PDFs inject mid-sentence. When you vectorize these fragments, you lose critical context. A paragraph about “Model Performance” becomes meaningless without knowing it’s from the “BERT Fine-tuning” section. A list item about dosage instructions becomes dangerous without the preceding paragraph explaining which medication. LLM Sherpa emerged from nlmatics to solve this structural blindness by parsing PDFs with layout awareness, preserving hierarchical relationships between sections, tables, lists, and paragraphs.

Technical Insight

Client Side

Backend

PDF URL/Path

Layout Analysis

Hierarchical JSON

Structured Document

doc.chunks

Semantic Chunks

Query

Python Client

LayoutPDFReader

API Endpoint

nlm-ingestor service

Docker Container

PDF Parser

Smart Chunking

Context-Aware Splits

Vector Store

LlamaIndex/RAG

LLM Application

System architecture — auto-generated

LLM Sherpa operates as a Python client that calls a backend parsing service called nlm-ingestor. You send a PDF URL or file path to the API endpoint, and it returns a hierarchical document object with preserved structure. The backend service was recently open-sourced under Apache 2.0 and can be self-hosted using Docker.

Here’s the basic workflow:

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

# Access hierarchical structure
for section in doc.sections():
    print(f"Section: {section.title}")

The key feature is the chunks() method. Instead of arbitrary 512-token splits, LLM Sherpa creates semantically coherent chunks:

from llama_index.core import Document, VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    # Each chunk includes section headers as context
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
    
query_engine = index.as_query_engine()
response = query_engine.query("what is the bart performance score on squad")
print(response)  # "The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1."

The chunking strategy keeps tables intact, bundles list items with their introductory paragraphs, and threads section headers through nested subsections. When you vectorize a chunk deep in a document, the text includes contextual breadcrumbs from parent sections.

The backend service handles edge cases: content spanning page breaks gets rejoined, repeating headers and footers get stripped, and watermarks get removed. It also exposes bounding box coordinates via the bbox property on blocks like sections, enabling layout-aware processing.

You can extract specific sections programmatically:

# Find and process a specific section
for section in doc.sections():
    if 'fine-tuning' in section.title.lower():
        # Use include_children=True and recurse=True to fully expand
        full_text = section.to_html(include_children=True, recurse=True)

The nlm-ingestor backend supports DOCX, PPTX, HTML, TXT, and XML beyond PDFs. OCR support is built into nlm-ingestor. The Docker deployment model means you control data privacy—critical for processing confidential documents.

Gotcha

The free public API at readers.llmsherpa.com is being decommissioned, which means you must self-host the nlm-ingestor Docker container using instructions at github.com/nlmatics/nlm-ingestor. This isn’t a “pip install and forget” experience anymore.

Parsing accuracy varies across PDF types. The README explicitly warns: “it is still challenging to get every PDF parsed correctly.” The LayoutPDFReader currently does not support OCR—only PDFs with a text layer are supported (though the nlm-ingestor backend has OCR capabilities). If your use case involves scanned documents or PDFs without text layers, you’ll need to use the nlm-ingestor backend directly.

The community is relatively small (1,749 stars). The nlm-ingestor backend is actively developed, but you’ll need to read both repositories to understand the full feature set. Some capabilities (like bounding box coordinates and OCR) are backend features that require diving into the ingestor documentation.

The README notes that both the free server and paid Azure Marketplace offering are being decommissioned, with users encouraged to spawn their own servers.

Verdict

Use LLM Sherpa if you’re building RAG systems that process research papers, technical documentation, financial reports, or any PDFs where section hierarchy and table integrity directly impact answer quality. The smart chunking alone can justify the self-hosting setup for production applications. It’s especially valuable if you need to extract specific sections programmatically or maintain layout relationships for citation purposes. Skip it if you’re working with simple text documents where structure doesn’t matter, if your PDFs lack text layers (scanned documents require the nlm-ingestor backend with OCR), or if you want a zero-maintenance hosted API. For quick prototypes, the self-hosting requirement may add unwanted complexity. For enterprise use cases with complex documents and the capacity to run Docker containers, LLM Sherpa’s structural intelligence addresses real RAG quality problems that simpler parsers ignore.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/nlmatics-llmsherpa.svg)](https://starlog.is/api/badge-click/llm-engineering/nlmatics-llmsherpa)