Back to Articles

LLM Sherpa: How Smart Chunking Fixes RAG's Biggest Problem

[ View on GitHub ]

LLM Sherpa: How Smart Chunking Fixes RAG's Biggest Problem

Hook

Your RAG system retrieves individual sentences from a 10-step procedure, missing critical context from steps 1-3. This is why naive PDF chunking breaks retrieval quality—and why document-aware parsing matters.

Context

Retrieval Augmented Generation has become the de facto pattern for grounding LLMs in domain-specific knowledge, but there's a dirty secret: most RAG pipelines destroy document structure before they even start. The typical workflow extracts raw text from PDFs, splits it into fixed-size chunks (say, 512 tokens), embeds those chunks, and hopes semantic search finds the right context. This approach falls apart the moment you encounter real-world documents.

Consider a technical manual with numbered procedures, nested subsections, and reference tables. When you chunk naively by character count, you split mid-paragraph, separate tables from their captions, and orphan list items from their parent context. A user asks "What are the safety requirements?" and your retriever returns Step 4 of a procedure without Steps 1-3, or a table row without column headers. The LLM hallucinates or refuses to answer because the retrieved context is incoherent. LLM Sherpa emerged from nlmatics to solve this structural awareness problem—not by improving embeddings or retrieval algorithms, but by fixing what happens before chunking ever begins.

Technical Insight

LLM Sherpa's architecture consists of a lightweight Python client library that communicates with nlm-ingestor, a backend PDF parsing service that extracts hierarchical document structure. Unlike traditional PDF libraries that treat documents as flat streams of text, nlm-ingestor builds a tree representation where sections contain paragraphs, lists, and tables, preserving parent-child relationships and heading levels.

The client API is deliberately simple. You point it at a PDF URL or local file, and it returns a Document object with hierarchical access:

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all"
reader = LayoutPDFReader(llmsherpa_api_url)

# Parse document into hierarchical structure
doc = reader.read_pdf("https://example.com/technical-manual.pdf")

# Access sections as a tree
for section in doc.sections():
    print(f"Section: {section.title} (Level {section.level})")
    for child in section.children:
        print(f"  - {child.to_text()[:100]}")

# Smart chunking: chunks include parent context
for chunk in doc.chunks():
    print(f"Chunk with context: {chunk.to_context_text()}")
    # Includes parent section headers automatically

The magic happens in the chunks() method. Instead of splitting text at arbitrary boundaries, it creates semantically coherent units. A table stays together as one chunk. A numbered list with five items becomes a single retrievable unit. Crucially, each chunk includes its parent section headers as context, so when you retrieve "Step 4: Verify connections," the chunk text also includes "Section 3.2: Installation Procedure" as a prefix.

The backend nlm-ingestor service (which you'll need to self-host now that the public API is shutting down) uses computer vision and layout analysis to detect document structure. It identifies heading levels through font size and styling, recognizes table boundaries, detects multi-column layouts, and filters repeating headers and footers that would otherwise pollute your chunks. Each parsed element includes bounding box coordinates, page numbers, and spatial relationships:

# Access layout information for custom processing
for table in doc.tables():
    print(f"Table on page {table.page_idx}")
    print(f"Bounding box: {table.bbox}")
    print(f"Rows: {len(table.rows)}, Cols: {len(table.cols)}")
    
    # Table already parsed into structured format
    for row in table.rows:
        print([cell.to_text() for cell in row.cells])

The LlamaIndex integration demonstrates the intended use case. Instead of feeding raw text to vector stores, you pass LLM Sherpa's structured chunks, which maintain document hierarchy through the entire RAG pipeline:

from llama_index import Document as LlamaDocument
from llama_index import VectorStoreIndex

# Convert LLM Sherpa chunks to LlamaIndex documents
llama_docs = []
for chunk in doc.chunks():
    llama_docs.append(LlamaDocument(
        text=chunk.to_context_text(),  # Includes parent headers
        metadata={
            'section_title': chunk.parent_section.title if chunk.parent_section else None,
            'page': chunk.page_idx,
            'bbox': chunk.bbox
        }
    ))

index = VectorStoreIndex.from_documents(llama_docs)
query_engine = index.as_query_engine()
response = query_engine.query("What are the installation steps?")

When a user queries "What are the installation steps?" the retriever finds chunks that include both the relevant content AND the hierarchical context ("Chapter 3 > Installation > Prerequisites"). The LLM sees properly scoped information instead of orphaned fragments.

The section navigation API enables another powerful pattern: selective document processing. You can extract and process specific sections without parsing entire documents:

# Find and process specific sections
installation_section = doc.section_by_title("Installation")
if installation_section:
    # Process only this section and its children
    for subsection in installation_section.children:
        if subsection.tag == 'table':
            # Extract structured data from tables in this section
            structured_data = parse_table(subsection)
        elif subsection.tag == 'list':
            # Process lists with full context
            steps = [item.to_text() for item in subsection.list_items]

This architectural choice—separating structure extraction from chunk creation—means you can experiment with different chunking strategies while preserving the underlying document tree. Want bigger chunks? Combine sibling sections. Need finer granularity? Split at the paragraph level but keep section headers attached.

Gotcha

The most immediate gotcha: you must self-host the nlm-ingestor backend service. The public demo API mentioned in many tutorials is being decommissioned, so budget time for Docker deployment. The official nlm-ingestor repository provides docker-compose configurations, but you'll need infrastructure with sufficient memory (recommended 4GB+) and storage for processing large documents. This isn't a deal-breaker for production systems, but it adds operational overhead compared to managed PDF parsing APIs.

Parsing quality varies significantly across PDF types. The maintainers are transparent about this: "it is still challenging to get every PDF parsed correctly." PDFs generated from LaTeX or structured authoring tools generally parse well because they have clean internal structure. But legacy scanned documents, PDFs with complex multi-column layouts, or files with embedded forms can produce inconsistent results. You'll encounter cases where tables aren't properly detected, heading levels are misidentified, or content order gets scrambled. The repository includes examples and test cases, but expect to manually validate parsing on your specific document types before going to production. There's no substitute for testing with your actual corpus—academic papers parse differently than legal contracts or technical manuals. The OCR capability exists for scanned documents but quality depends heavily on image resolution and document condition. This isn't unique to LLM Sherpa (PDF parsing is genuinely hard), but it means you should plan for a fallback strategy when parsing fails or produces low-quality output.

Verdict

Use if: You're building RAG applications that consume structured documents (technical documentation, research papers, legal contracts, reports) where maintaining hierarchy and context is critical for retrieval quality. The smart chunking and section-aware parsing will dramatically improve your RAG system's ability to return coherent, contextually complete answers. You have the infrastructure capacity to self-host the nlm-ingestor service or are already running containerized services. You're working with PDFs that have inherent structure (headings, lists, tables) rather than purely narrative text. Skip if: Your documents are simple, unstructured prose where naive chunking works fine, or you need guaranteed parsing accuracy across arbitrary PDFs without manual validation. If you can't self-host services or need a fully managed solution with SLAs, look at commercial alternatives like Unstructured.io's hosted API or Adobe Extract. Also skip if you're processing mainly plain text, Markdown, or HTML where structure is already explicit—LLM Sherpa's value proposition is specifically about extracting hidden structure from PDFs.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/nlmatics-llmsherpa.svg)](https://starlog.is/api/badge-click/llm-engineering/nlmatics-llmsherpa)