Dangerzone: Converting Untrusted Documents Through Pixel Buffers
Hook
Every PDF reader has vulnerabilities—Adobe Acrobat alone had 244 CVEs between 2019-2023. Dangerzone's answer? Stop trying to parse dangerous documents and convert them to pixels instead.
Context
Document-based exploits remain a primary vector for targeted attacks. Journalists receive leaked PDFs from anonymous sources, legal teams process submissions from adversarial parties, security researchers analyze malware samples, and activists handle documents in high-surveillance environments. Traditional approaches—scanning with antivirus, opening in sandboxed readers, or manually reviewing file internals—all share a fatal flaw: they assume you can safely parse untrusted input to determine if it's malicious.
The PDF specification alone spans over 1,000 pages and supports JavaScript execution, embedded files, launch actions, and font parsing—each a potential exploit surface. Office documents add VBA macros, external data connections, and ActiveX controls. Even image formats like TIFF have historically harbored vulnerabilities in their parsers. Dangerzone, developed by Freedom of the Press Foundation, takes a fundamentally different approach: assume every document is malicious and convert it through a medium that cannot carry executable code—raw pixel data. This isn't sanitization through filtering; it's destruction and reconstruction.
Technical Insight
Dangerzone's architecture splits document conversion into two isolated stages. Stage one executes inside a container with no network access, running under gVisor's application kernel. This container uses GraphicsMagick and LibreOffice to convert input documents to PDF format, then renders each page to raw RGB pixel data. Stage two, running outside the sandbox, takes this pixel buffer and reconstructs a clean PDF using Python's PIL and reportlab libraries, optionally applying Tesseract OCR to restore a searchable text layer.
The gVisor integration is critical here. Standard Docker containers share the host kernel, meaning a container escape exploit could compromise the entire system. gVisor implements Linux system calls in userspace Go code, creating an application kernel that intercepts syscalls before they reach the host. When malicious code inside the container attempts operations like file access or process creation, it's handled by gVisor's sandboxed kernel, not the host OS. This defense-in-depth approach survived a December 2023 security audit by Include Security with no high-risk findings.
Here's how the pixel conversion pipeline handles a potentially malicious PDF:
# Simplified version of Dangerzone's core conversion logic
import subprocess
from PIL import Image
import io
def convert_page_to_pixels(pdf_path, page_num, dpi=150):
"""Stage 1: Render PDF page to raw RGB pixels (runs in container)"""
# Use pdftoppm from poppler-utils to rasterize
result = subprocess.run([
'pdftoppm',
'-f', str(page_num),
'-l', str(page_num),
'-r', str(dpi),
'-rgb',
pdf_path
], capture_output=True, check=True)
# Return raw pixel data, no PDF parsing beyond rendering
return result.stdout
def pixels_to_safe_pdf(pixel_data, width, height):
"""Stage 2: Reconstruct PDF from pixels (runs outside container)"""
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader
# Create image from raw RGB buffer
img = Image.frombytes('RGB', (width, height), pixel_data)
# Build new PDF with only visual content
pdf_buffer = io.BytesIO()
c = canvas.Canvas(pdf_buffer)
c.drawImage(ImageReader(img), 0, 0)
c.save()
return pdf_buffer.getvalue()
This pixel buffer approach is beautifully simple: malicious JavaScript, embedded executables, font exploits, and parser vulnerabilities cannot survive rasterization. The output PDF contains only image objects—no interactive elements, no fonts, no embedded files. It's the visual representation of the document without any of the semantic structure that could harbor exploits.
The container configuration is equally paranoid. Dangerzone launches conversion containers with --network=none to prevent data exfiltration, mounts input files read-only, and limits resource consumption through cgroup constraints. The conversion process runs as a non-root user inside the container, and the container itself is ephemeral—destroyed immediately after processing completes. For developers integrating Dangerzone, the CLI interface is straightforward:
# Basic conversion with default settings
dangerzone-cli suspicious-document.pdf safe-output.pdf
# Preserve OCR text layer and apply compression
dangerzone-cli --ocr --compression-quality 75 untrusted.docx safe.pdf
# Process multiple documents with custom DPI
for doc in *.pdf; do
dangerzone-cli --dpi 200 "$doc" "safe_${doc}"
done
The OCR stage deserves special attention. After pixel conversion, documents lose their embedded text, becoming unsearchable image PDFs. Dangerzone optionally runs Tesseract OCR on each rasterized page to reconstruct a text layer. This happens in stage two, outside the container, since OCR is computationally expensive and Tesseract itself has had security vulnerabilities. The trade-off is accepting OCR inaccuracies—typically 95-98% accuracy on clean documents—versus preserving potentially malicious embedded text. For threat models involving nation-state adversaries or zero-day exploits, this trade-off is acceptable.
The codebase also handles edge cases like password-protected PDFs, multi-page TIFFs, and documents with embedded fonts that may contain exploits. By converting everything through pixels, Dangerzone sidesteps the entire class of parser vulnerabilities. The price is file size—a 500KB vector PDF might become a 5MB rasterized version, though JPEG compression in the output stage mitigates this somewhat.
Gotcha
Pixel conversion is destructive in ways that matter for many workflows. Vector graphics become rasterized bitmaps, losing scalability. Embedded fonts are replaced with OCR-detected text using system fonts, changing visual appearance. Interactive PDF forms, fillable fields, digital signatures, and annotations are completely stripped—the output is purely visual. For documents requiring these features, Dangerzone isn't sanitization; it's format conversion that preserves only appearance.
Performance and file size are genuine concerns. Converting a 100-page PDF at 150 DPI takes 2-3 minutes on modern hardware, with each page requiring container initialization, rendering, OCR, and PDF reconstruction. Output files are typically 3-10x larger than vector originals despite compression. For batch processing thousands of documents, you'll need to architect a queuing system around Dangerzone, as concurrent container spawning can exhaust system resources. The project recommends dedicated hardware for high-volume scenarios, and the container overhead makes it impractical for real-time or latency-sensitive applications.
Verdict
Use if: You're handling documents from untrusted or adversarial sources where document exploits are a realistic threat—journalists receiving anonymous leaks, security researchers analyzing malware, legal teams processing external submissions, or organizations with sophisticated adversaries. Also use it when operating in airgapped environments or when you need audited, open-source document sanitization that doesn't rely on parsing potentially malicious files. Skip if: You're processing trusted internal documents, need to preserve interactive PDF features like forms or signatures, require perfect text fidelity for legal or archival purposes, or operate at scale where 2-3 minutes per document is prohibitive. For low-threat scenarios, Dangerzone's security guarantees come at too high a cost in functionality, file size, and processing time. For high-threat scenarios, it's the most practical open-source solution that doesn't require Qubes OS.