Building an Auto-OCR Pipeline for Notion: A Deep Dive into notion-auto-ocr

Hook

Notion's search is blind to text trapped inside images—a problem that becomes critical when your knowledge base contains hundreds of screenshots, scanned documents, or whiteboard photos that should be searchable but aren't.

Context

Notion has evolved from a simple note-taking app into a comprehensive workspace platform where teams manage everything from project documentation to product roadmaps. As adoption grows, so does the volume of visual content: architecture diagrams annotated with handwritten notes, screenshots of error messages, photos of physical whiteboards from strategy sessions. All of this visual information is effectively invisible to Notion's search engine.

While cloud storage providers like Google Drive and Microsoft OneDrive have offered OCR-indexed search for years, Notion's API-first approach creates an opportunity for developers to build this functionality themselves. The notion-auto-ocr project by Michiel van Beers takes advantage of Notion's robust API to create a background service that continuously monitors databases for images, extracts text using commercial OCR engines, and writes the results back as searchable content. It's infrastructure-as-code for content intelligence, turning a manual copy-paste workflow into an automated pipeline.

Technical Insight

The architecture of notion-auto-ocr is elegantly simple: a long-running .NET service that polls Notion databases, identifies images needing OCR processing, calls an OCR backend, and updates the results. The interesting decisions lie in how it handles state management, backend abstraction, and deployment flexibility.

At its core, the application uses a polling loop with configurable intervals. When SCAN_FREQUENCY is set (e.g., "00:05:00" for five minutes), the service runs continuously:

while (true)
{
    await ProcessDatabaseAsync(notionClient, ocrService, databaseId, scanMode);
    await Task.Delay(scanFrequency);
}

This simple pattern works surprisingly well for a containerized service. The state lives entirely in Notion—no external databases, no Redis caches, no complex coordination. Each scan queries the Notion database for unprocessed images using one of two strategies: the "checkbox" mode looks for a manually-toggled property (OCR Parsing = true), while the "createtime" mode processes any image added since the last scan. This dual-mode design is clever because it accommodates different workflows: checkbox mode gives users fine-grained control for selective processing, while createtime mode enables fully automatic pipelines.

The OCR abstraction layer is where the project shows architectural maturity. Rather than hard-coding a single OCR provider, the code defines an IOcrService interface with implementations for both Azure Computer Vision and IronOCR:

public interface IOcrService
{
    Task<string> ExtractTextAsync(string imageUrl);
}

public class AzureOcrService : IOcrService
{
    public async Task<string> ExtractTextAsync(string imageUrl)
    {
        var client = new ComputerVisionClient(credentials);
        var result = await client.RecognizePrintedTextAsync(true, imageUrl);
        return ParseRegions(result.Regions);
    }
}

This abstraction means users can swap OCR backends based on cost, privacy requirements, or accuracy needs without modifying the core logic. Azure Computer Vision is cloud-based and usage-priced (with a generous free tier of 5,000 transactions/month), while IronOCR runs entirely in-process, making it suitable for air-gapped deployments or high-volume scenarios where per-image pricing becomes expensive.

The Notion integration leverages the official .NET SDK but has to work around API limitations. Notion's block API doesn't support directly modifying image captions, so the service either replaces the caption text (for the caption-based approach) or appends a new paragraph block after the image containing the extracted text. This append-only pattern is actually beneficial—it preserves the original caption while making OCR results clearly visible and editable.

Deployment flexibility is built into the Docker setup. The same container image can run as a daemon (continuous scanning) or as a one-shot job triggered by cron. For low-frequency use cases, a cron job is more resource-efficient:

# Run once daily at 2 AM
0 2 * * * docker run --rm --env-file .env notion-auto-ocr

This approach avoids keeping a container running 24/7 when your database only gets new images sporadically. The stateless design makes this pattern safe—each run independently queries Notion for work, processes it, and exits.

Gotcha

The biggest limitation is platform compatibility with IronOCR. While Azure Computer Vision works anywhere with internet access, IronOCR has significant constraints on ARM architectures. The official documentation warns that ARM64 Linux support is experimental, and Apple Silicon Macs require x64 emulation, which can tank performance. If you're running this on AWS Graviton instances or modern MacBooks, you're effectively locked into the Azure backend unless you're willing to accept significant slowdowns.

The Notion setup friction is real. You need to create a Notion integration, grant it access to your database, add specific properties (either a "Created time" property or an "OCR Parsing" checkbox), and retrieve your database ID from URLs. For developers, this is straightforward but time-consuming. For non-technical users hoping to enable OCR in their workspace, it's a deal-breaker. There's no installation wizard, no one-click setup—just a README with manual configuration steps. Additionally, the service has no built-in error handling for malformed images or OCR failures. If Azure returns garbage text or IronOCR throws an exception, that result gets written back to Notion without validation. You'll need to manually review and clean up results, especially for low-quality scans or images with complex layouts that confuse the OCR engine.

Verdict

Use if: You're running a Notion workspace with frequent image uploads (documentation screenshots, scanned receipts, meeting notes) and need searchability without manual transcription. The Azure backend is perfect for teams processing dozens to hundreds of images monthly who can stay within the free tier, while IronOCR suits privacy-focused organizations or high-volume users on x64 infrastructure. The Docker-based deployment integrates cleanly into existing container orchestration setups. Skip if: You need one-off OCR for occasional images (browser extensions like Copyfish are faster), require advanced features like table extraction or handwriting recognition (dedicated OCR services offer more), or aren't deeply invested in Notion as your primary knowledge base. The setup overhead and lack of validation make this overkill for casual use. Also skip if you're on ARM-based infrastructure and want to use IronOCR—the performance penalty from emulation defeats the purpose of a background automation tool.

Building an Auto-OCR Pipeline for Notion: A Deep Dive into notion-auto-ocr

Building an Auto-OCR Pipeline for Notion: A Deep Dive into notion-auto-ocr

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building an Auto-OCR Pipeline for Notion: A Deep Dive into notion-auto-ocr

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]