Marker: How Computer Vision and LLMs Team Up to Parse PDFs at 25 Pages Per Second
Hook
Most PDF parsers force you to choose between speed and accuracy. Marker’s hybrid architecture combines computer vision models with optional LLM correction to deliver both—processing 25 pages per second on an H100 while outperforming services that cost 4x more.
Context
Anyone who’s built a RAG pipeline or document processing system knows the pain: PDFs are a presentation format, not a data format. Text doesn’t flow in reading order. Tables fragment across columns. Equations render as gibberish. Multi-column layouts confuse extraction tools into producing nonsense. Commercial services like Mathpix and Llamaparse emerged to solve this with proprietary ML models, but they’re expensive ($0.10+ per page) and lock you into their infrastructure. Pure open-source alternatives like PyMuPDF extract text fast but struggle with anything beyond single-column documents—they can’t understand layout semantics. Marker enters this space with a different approach: an extensible processor pipeline built on custom computer vision models (including Surya OCR), optionally enhanced with LLM correction for complex structures. It’s designed for developers who need to convert thousands of technical documents—research papers, textbooks, financial reports—into clean markdown or structured JSON without paying per-page fees or sacrificing accuracy on equations and tables.
Technical Insight
Marker’s architecture is a multi-stage pipeline that processes documents through specialized processors. Each processor handles one aspect of document understanding: layout detection, table extraction, equation recognition, code block formatting, header/footer removal. This modular design means you can inject custom logic between stages without rewriting core functionality. The pipeline uses PyTorch models running on GPU, CPU, or Apple’s MPS, with the heavy lifting done by Surya—a custom OCR system optimized for document structure detection.
Here’s how you’d convert a PDF with basic CLI usage:
marker_single research_paper.pdf \
--output_format markdown \
--force_ocr # Use OCR for inline math formatting
The force_ocr flag is particularly interesting. Even “digital” PDFs often embed poorly formatted text, especially in equations. Forcing OCR means Marker re-reads the document visually and converts inline math to proper LaTeX, producing cleaner output than trusting embedded text streams.
The hybrid LLM mode takes this further. When you pass --use_llm, Marker runs its standard CV pipeline first, then sends problematic sections to an LLM (Gemini 2.0 Flash by default, or any Ollama model) for correction. This isn’t naive “dump the whole PDF into GPT”—it’s surgical. The LLM sees only the sections where CV models are uncertain: fragmented tables that span pages, inline math notation, form field extraction. According to their benchmarks, this hybrid approach scores higher on table extraction than pure CV or pure LLM approaches, because it combines the speed and structure understanding of vision models with the semantic reasoning of language models.
marker_single research_paper.pdf \
--use_llm \
--output_format json \
--page_range "5-10"
This command processes only pages 5-10, uses LLM enhancement, and outputs structured JSON instead of markdown. The JSON format appears to include hierarchical document structure based on the codebase—useful if you’re building a search index or training downstream models.
The processor chain extensibility is where Marker shines for production use. You can write custom processors that hook into the pipeline using the --processors flag by providing their full module paths. For example, if you’re processing financial documents and need to extract specific form fields, you can inject a processor that runs regex patterns or calls a fine-tuned model on detected form regions. The README mentions beta support for structured extraction via JSON schemas, which integrates with the LLM mode.
Batch mode is critical for throughput. When you process multiple documents using the marker command on a folder, Marker batches inference calls across them, saturating GPU utilization. The claimed 25 pages/second on an H100 assumes batch processing—single document conversion is slower because you pay GPU kernel launch overhead on every operation. This makes Marker ideal for overnight batch jobs processing document archives, but less optimal for real-time web API scenarios where users upload one PDF and expect instant results.
One subtle detail: Marker removes headers, footers, and page numbers automatically. Most PDF parsers include this junk in output, forcing you to write brittle regex cleanup. Marker’s layout detection identifies these artifacts by position and repetition patterns, filtering them out. For textbooks and papers with complex formatting, this alone saves hours of post-processing work.
Gotcha
Marker’s licensing is a landmine. The model weights use AI Pubs Open Rail-M with commercial restrictions: free for research and startups under $2M revenue/funding, but requires paid licensing beyond that. The code itself is GPL, which means if you integrate Marker into a closed-source product, you may need to open-source your integration layer or negotiate a commercial exception. This isn’t unusual for ML tools, but it’s stricter than Apache-2.0 alternatives like Unstructured.io or IBM’s Docling. If you’re at a mid-size company building a proprietary document pipeline, budget for licensing fees or plan to keep Marker behind an API boundary where GPL doesn’t propagate.
The LLM hybrid mode adds latency and cost. Gemini Flash API calls aren’t free, and even fast models add 2-5 seconds per page when they’re invoked. Marker tries to minimize this by only calling the LLM on complex sections, but a 100-page document with lots of tables could rack up noticeable API costs. The Ollama integration helps if you can self-host an LLM, but then you’re managing another service and need enough VRAM for both Marker’s CV models and the language model—easily 24GB+ for good results. For many use cases, the standalone CV mode is accurate enough, and you should benchmark whether the LLM uplift justifies the complexity.
Force OCR has a trade-off: it improves math formatting but slows processing significantly because every text region gets re-recognized. If your PDFs are clean digital documents without equations, skip it. The --strip_existing_ocr option exists because some PDFs have both good digital text and bad OCR embedded (usually from scanning printed pages)—you want to keep the digital text but discard the OCR garbage. Figuring out the right OCR flags for your document corpus takes experimentation.
Verdict
Use Marker if you’re building document intelligence features for a startup (under the revenue threshold), processing research papers or technical docs at scale, or need markdown conversion quality that beats commercial APIs without recurring per-page fees. The hybrid LLM mode is worth enabling when you’re handling complex tables or equations and accuracy matters more than speed—think legal discovery or scientific literature indexing. The extensible processor chain makes it ideal for custom document types where you need to inject domain-specific logic. Skip it if you need permissive licensing for a commercial product over $2M (evaluate Docling or pay for Marker’s commercial license instead), are processing simple single-column documents where PyMuPDF would suffice, or want a zero-infrastructure SaaS solution—Marker requires hosting Python services and managing GPU resources. Also skip if you need sub-second conversion times for user-facing features; the batch throughput is impressive but single-document latency is measured in seconds, not milliseconds.