Back to Articles

Marker: How a Multi-Stage CV Pipeline Achieves 25 Pages/Second PDF Parsing

[ View on GitHub ]

Marker: How a Multi-Stage CV Pipeline Achieves 25 Pages/Second PDF Parsing

Hook

While most developers reach for regex and hope when extracting tables from PDFs, Marker processes 200 million pages per week in production using a completely different approach: treating document conversion as a computer vision problem, not a text parsing one.

Context

Anyone who's tried to extract structured data from PDFs knows the pain. Simple text extraction with PyPDF2 fails the moment you encounter a multi-column layout. Cloud services like Adobe PDF Services or Mathpix work but lock you into vendor pricing and rate limits. LLM-based solutions like GPT-4 Vision are accurate but prohibitively expensive at scale—processing a 100-page technical report could cost several dollars in API calls.

Marker emerged from this gap: the need for high-accuracy document conversion that could handle real-world complexity (equations, tables, mixed layouts) while remaining fast and cost-effective enough for production workloads. Built by datalab.to, it's already processing hundreds of millions of pages weekly for RAG pipelines, document indexing, and data extraction workflows. The tool supports not just PDFs but DOCX, PPTX, XLSX, EPUB, HTML, and images, all converted to clean markdown, JSON, or HTML output.

Technical Insight

Marker's architecture is a multi-stage pipeline that fundamentally treats document understanding as a computer vision task, not text parsing. The first stage uses specialized OCR models from the Surya family for text detection and recognition, followed by layout analysis to identify document structure. Unlike simple PDF parsers that rely on embedded text streams, Marker renders pages as images and applies deep learning models to understand spatial relationships.

The pipeline's power comes from its processor chain. After initial extraction, format-specific processors handle different element types: tables get parsed into structured markdown, LaTeX equations are detected and preserved, code blocks are identified and formatted, and common PDF artifacts (headers, footers, watermarks) are stripped. Each processor operates on block-level elements, allowing surgical precision in handling complex documents.

Here's how you'd use Marker for basic conversion:

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser

# Initialize models (downloads on first run)
config = ConfigParser({})
models = create_model_dict(config)

# Convert a PDF
converter = PdfConverter(
    config=config,
    artifact_dict=models,
    processor_list=None  # Uses default processors
)

rendered = converter("technical_report.pdf")

# Access structured output
markdown = rendered.markdown
meta = rendered.metadata  # Page count, languages, etc.
images = rendered.images  # Extracted images as dict

The real architectural innovation is the hybrid LLM mode, activated with --use_llm. This combines the speed of the CV/OCR pipeline with LLM-based refinement for edge cases. Instead of sending entire documents to an LLM (expensive), Marker uses the vision pipeline first, then selectively applies LLM processing to problematic regions—merged table cells, inline mathematical notation, or complex forms. You can configure this with custom prompts:

from marker.config.parser import ConfigParser

config = ConfigParser({
    "use_llm": True,
    "llm_provider": "gemini",  # or "ollama" for local
    "custom_prompts": {
        "table_merge": "Analyze this table region and merge cells that span multiple rows/columns..."
    }
})

The extensibility goes deeper. You can inject custom processors into the pipeline to handle domain-specific requirements. Say you're processing legal documents with citation patterns that need special handling:

from marker.processors import BaseProcessor
from marker.schema import BlockTypes

class CitationProcessor(BaseProcessor):
    block_types = [BlockTypes.Text]
    
    def __call__(self, blocks, config):
        for block in blocks:
            if block.block_type in self.block_types:
                # Custom citation detection logic
                text = block.text
                if self.is_citation(text):
                    block.metadata["citation"] = True
                    block.formatted_text = self.format_citation(text)
        return blocks

# Add to processor list
converter = PdfConverter(
    config=config,
    artifact_dict=models,
    processor_list=[CitationProcessor(), *default_processors]
)

Performance scales dramatically with hardware. On CPU, expect 1-2 pages per second for typical documents. A consumer GPU (RTX 3090) pushes this to 5-8 pages/second. But the architecture truly shines with batch processing on enterprise hardware—the codebase projects 25 pages/second on H100 GPUs when processing documents in parallel batches. This is achieved through PyTorch's CUDA optimizations and batched inference across the vision models.

The output format flexibility matters for real applications. Beyond markdown, you can request JSON with complete structural metadata (bounding boxes, confidence scores, element hierarchies) or HTML with preserved styling. This makes Marker suitable not just for RAG pipelines but for document preview systems, accessibility tools, or data extraction workflows where you need programmatic access to document structure.

Gotcha

The licensing model will stop many commercial users in their tracks. While the code is GPL-3.0, the model weights use a modified OpenRAIL-M license that's free only for research, personal use, and startups under $2M revenue. Any larger commercial use requires a paid license from datalab.to. This dual-licensing approach means you can't just clone the repo and deploy it in your SaaS product without legal review—something that's bitten teams who assumed "open source" meant unrestricted commercial use.

Resource requirements are another consideration. While Marker technically runs on CPU, you're leaving massive performance on the table—expect 10-20x slower processing compared to GPU execution. The full feature set with LLM mode requires external API access to Gemini or a local Ollama instance, adding infrastructure complexity. Memory usage scales with document complexity; processing a 500-page technical manual with many images can easily consume 8GB+ RAM. And while the managed platform (datalab.to) handles this infrastructure, it introduces vendor dependency that some teams want to avoid. The LLM features, while powerful, are still maturing—inline math processing is marked beta, and some edge cases with complex table merging still require manual review.

Verdict

Use if: You're building RAG systems, document search, or data extraction pipelines that need accurate structure preservation from PDFs at scale. Marker excels with technical documents (academic papers, reports, manuals) containing equations, tables, and complex layouts where simpler parsers fail. The GPL licensing works for internal tools, research projects, or early-stage startups, and you have GPU resources available or can justify the managed platform cost. The extensible processor architecture is invaluable if you need domain-specific customization. Skip if: You're doing simple text extraction from clean PDFs (PyMuPDF or pdfplumber will be faster and lighter), can't meet the commercial licensing requirements, or need guaranteed processing of every edge case without review (no tool is perfect, but Marker's LLM features are still evolving). Also skip if zero-dependency deployment is critical—the PyTorch and model weight requirements make this a heavyweight solution.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/datalab-to-marker.svg)](https://starlog.is/api/badge-click/developer-tools/datalab-to-marker)