Back to Articles

Building a Notion OCR Pipeline with Azure Computer Vision: A Polling Architecture Breakdown

[ View on GitHub ]

Building a Notion OCR Pipeline with Azure Computer Vision: A Polling Architecture Breakdown

Hook

Notion stores over 20 million images uploaded by users, yet offers no native way to make the text inside them searchable. For anyone building a knowledge base from scanned documents, this isn’t just inconvenient—it’s a dealbreaker.

Context

The promise of Notion as a unified workspace breaks down the moment you upload a screenshot of a code snippet, a photo of handwritten notes, or a scanned receipt. That text becomes locked inside pixels, unsearchable and unindexable. While cloud providers like Microsoft, Google, and AWS have solved optical character recognition at scale, Notion—despite its powerful API—hasn’t integrated these capabilities natively.

This gap creates friction for researchers archiving papers, developers collecting visual documentation, and anyone managing receipt-heavy expense workflows. The manual alternative—transcribing text yourself or using external OCR tools then copy-pasting results—scales poorly. MichielvanBeers/notion-auto-ocr addresses this by creating an automation bridge: a Python service that monitors your Notion database, detects images marked for processing, sends them to Azure’s Computer Vision API, and writes extracted text back into your pages. It’s narrow in scope but solves a real workflow pain point for the subset of users who need automated text extraction without leaving their Notion environment.

Technical Insight

Azure

Service

Notion

Query new/checked pages

Return pages

Find 'ocr_text' markers

Image blocks

Image data

Recognized text

Configurable interval

Notion Database

Pages with Images

Polling Loop

Marker Scanner

Image Extractor

Computer Vision OCR

System architecture — auto-generated

The architecture centers on a polling loop that queries Notion’s API at configurable intervals, checking for images that need OCR processing. The tool uses a marker-based detection system: it scans image blocks for specific text patterns—either ‘ocr_text’ in the image caption or as placeholder text—to determine what should be processed. This explicit opt-in approach prevents accidental processing of decorative images or diagrams where OCR would be meaningless.

The scanning trigger mechanism offers two modes. In checkbox mode, you add a property to your Notion database (typically called ‘OCR’ or similar), and the service only processes pages where this box is checked. In timestamp mode, it uses Notion’s ‘Created time’ property to process new entries since the last run. Here’s how the timestamp-based filtering works in practice:

# Simplified version of the scanning logic
def get_pages_to_process(notion_client, database_id, last_run_time):
    query_filter = {
        "timestamp": "created_time",
        "created_time": {
            "after": last_run_time.isoformat()
        }
    }
    
    results = notion_client.databases.query(
        database_id=database_id,
        filter=query_filter
    )
    
    pages_with_ocr_markers = []
    for page in results['results']:
        blocks = notion_client.blocks.children.list(page['id'])
        for block in blocks['results']:
            if block['type'] == 'image':
                caption = block['image'].get('caption', [])
                caption_text = ''.join([t['plain_text'] for t in caption])
                if 'ocr_text' in caption_text.lower():
                    pages_with_ocr_markers.append((page, block))
    
    return pages_with_ocr_markers

Once an image is identified, the service downloads it from Notion’s CDN (or from external URLs if the image is externally hosted), then sends it to Azure’s Computer Vision API endpoint. The Azure SDK handles authentication via subscription keys and returns structured JSON containing detected text regions, bounding boxes, and confidence scores. The tool extracts the raw text and appends it to the Notion page as a new text block.

The Docker containerization is straightforward but effective. The Dockerfile uses a multi-stage approach: a Python 3.9 base image, pip installation of dependencies (notion-client, azure-cognitiveservices-vision-computervision, python-dotenv), and environment variable configuration for API keys and database IDs. The container runs as a long-lived process when configured for continuous polling, or as a one-shot job for manual triggering:

# Simplified Dockerfile structure
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENV NOTION_TOKEN="" \
    NOTION_DATABASE_ID="" \
    AZURE_ENDPOINT="" \
    AZURE_SUBSCRIPTION_KEY="" \
    POLLING_INTERVAL="300"

CMD ["python", "main.py"]

The polling interval is configurable (default 300 seconds), but this introduces inherent latency—images won’t be processed until the next poll cycle completes. For workflows where immediate OCR is critical, this delay matters. The architecture also lacks idempotency guarantees: if the service crashes mid-processing, it might re-process the same images or miss some entirely. A more robust design would track processing state in a separate database or use Notion properties as processing flags.

Error handling is minimal in the current implementation. Network failures, Azure API rate limits, or malformed Notion responses could cause the polling loop to crash. Production deployments would need retry logic with exponential backoff, dead-letter queues for failed images, and monitoring hooks. The single-file codebase mentioned in the repo’s limitations makes these extensions harder—refactoring into separate modules for Notion interaction, Azure communication, and orchestration logic would improve testability and maintainability.

Gotcha

The biggest limitation is the mandatory Azure dependency. You can’t run this tool without a Microsoft Azure account and a provisioned Computer Vision resource. While Azure offers a free F0 tier (5,000 transactions per month), you’re still locked into Microsoft’s ecosystem and subject to their API availability, pricing changes, and regional restrictions. If Azure’s Computer Vision service goes down or introduces breaking API changes, your OCR pipeline breaks.

The polling-based architecture is fundamentally inefficient at scale. If you’re monitoring a database with thousands of pages, every poll cycle queries Notion’s API, retrieves page metadata, and scans blocks—even when nothing has changed. Notion’s API has rate limits (3 requests per second for paid plans), so aggressive polling could hit throttling. The tool doesn’t implement webhook support, which would enable event-driven processing where Notion pushes notifications when pages are created or updated. This would eliminate wasteful polling and reduce latency to near-zero. However, Notion’s API doesn’t currently expose webhooks, making this limitation partly a platform constraint rather than purely a design choice. The workaround—using third-party automation platforms like Zapier or Make as webhook intermediaries—adds another layer of complexity and cost.

Verdict

Use if: You maintain a Notion database with regular inflows of scanned documents, screenshots with embedded text, or images from research sources, and you need automated text extraction without manual copy-paste workflows. This tool is ideal for personal knowledge bases, academic research collections, or small-team documentation hubs where you already have Azure credits or can stay within the free tier limits. It’s also valuable if you prefer explicit control over which images get processed via marker-based triggers. Skip if: You only occasionally need OCR (browser extensions or one-off tools like Adobe Acrobat would be simpler), you want to avoid cloud provider dependencies beyond Notion, or you’re working with high-volume image processing that would exceed Azure’s free tier. Also skip if you need advanced image analysis beyond text extraction—this tool doesn’t leverage Azure’s object detection, face recognition, or image categorization capabilities. For large-scale deployments, the polling overhead and lack of production-grade error handling make it unsuitable without significant refactoring.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/michielvanbeers-notion-auto-ocr.svg)](https://starlog.is/api/badge-click/developer-tools/michielvanbeers-notion-auto-ocr)