Photon: Building an OSINT Crawler That Extracts Intelligence, Not Just Links
Hook
Most web crawlers collect URLs. Photon collects ammunition—API keys leaked in JavaScript, employee emails scattered across pages, AWS buckets exposed in HTML comments. It’s the difference between indexing a website and reconnaissance.
Context
Traditional web crawlers were built for search engines and data mining—broad, shallow passes across the web to build indexes. But OSINT (Open Source Intelligence) researchers and security professionals need something fundamentally different: depth over breadth, intelligence over volume. When you’re mapping an organization’s digital footprint for a penetration test or bug bounty, you don’t need every page crawled. You need the pages that leak information.
Before tools like Photon, security researchers cobbled together scripts with Beautiful Soup and regex patterns, manually categorizing findings, or relied on commercial tools with opaque implementations. Photon emerged from s0md3v’s security research toolkit as a purpose-built intelligence extractor. It’s not trying to compete with Scrapy’s production-grade scraping capabilities or Puppeteer’s JavaScript rendering. Instead, it optimizes for a specific workflow: rapidly extract structured intelligence from websites with minimal configuration.
Technical Insight
Photon’s architecture revolves around pattern-based extraction rather than semantic understanding. At its core, it’s a multi-threaded breadth-first crawler, but the real engineering is in how it processes each fetched page. Instead of just parsing links, Photon runs every HTML response through a battery of regex patterns designed to identify intelligence indicators.
The extraction pipeline is deceptively simple but powerful. Here’s how you’d use Photon to extract intelligence from a target:
# Basic crawl with intelligence extraction
python photon.py -u https://example.com -l 3 -t 10
# Use Wayback Machine to avoid hitting target directly
python photon.py -u https://example.com --wayback
# Extract with custom regex patterns and export to JSON
python photon.py -u https://example.com --regex '\bAPI[_-]?KEY\b' --export
Each crawled page gets categorized into separate output directories: urls/, intel/, files/, secret/, scripts/, and external/. This organizational model reflects the OSINT mindset—you’re not analyzing a single data stream, you’re triaging multiple intelligence types simultaneously. The intel/ directory might contain extracted emails and social media handles, while secret/ captures potential API keys and authentication tokens using patterns for common formats (AWS keys, Google API keys, private keys).
The JavaScript endpoint extraction is particularly clever. When Photon encounters a .js file, it doesn’t execute it—it parses it as text and extracts strings that look like API endpoints using regex patterns for common route structures (/api/, /v1/, etc.). This static analysis approach means you can discover endpoints in React or Angular apps without needing a headless browser:
# Photon automatically extracts endpoints from JavaScript
# Results appear in example.com/scripts/ directory
# Format: filename.js with extracted endpoints listed
The thread management system balances speed with resource constraints. Photon uses traditional threading with a configurable thread count (adjustable via -t). This design choice prioritizes simplicity and debuggability over raw performance. For OSINT workflows where you’re typically crawling dozens of targets rather than millions of pages, this trade-off makes sense.
The Wayback Machine integration (--wayback flag) showcases domain-specific optimization. Instead of hitting the target server, Photon queries archive.org for historical URLs, then crawls those archived snapshots. This is invaluable for reconnaissance where you want to avoid detection or when investigating defunct websites. The plugin architecture is minimal but functional—plugins live in the plugins/ directory and follow a simple interface for extending extraction capabilities.
Scope control is handled through level limiting (-l flag) and regex exclusion patterns, giving you precision control over crawl depth without complex configuration files. The --seeds option lets you inject additional starting URLs, useful when you’ve already identified interesting subdirectories through other reconnaissance.
Gotcha
Photon’s focus on static HTML parsing means it’s blind to modern single-page applications. If your target is a React app that renders content client-side, Photon will see an empty div and miss everything. There’s no JavaScript execution, no DOM manipulation, no waiting for AJAX calls. You’re crawling what the server sends in the initial HTTP response, nothing more. For reconnaissance against modern web apps, you may need to combine Photon with tools that include headless browser capabilities.
The single-threaded Python implementation hits performance ceilings on large crawls. While the threading model works for typical OSINT targets (a few thousand pages), it appears less performant for massive sites compared to alternative crawler implementations. Memory consumption can climb on deep crawls since Photon appears to maintain URL state in memory rather than using external storage. The regex patterns for secret detection will produce false positives; not every string matching the AWS key format is actually a valid credential, so you’ll need manual triage of findings.
Verdict
Use Photon if you’re conducting security reconnaissance, bug bounty research, or OSINT investigations where you need to quickly map a target’s attack surface and extract structured intelligence from static HTML content. It excels at the initial discovery phase—finding employee emails, exposed files, potential secrets, and JavaScript endpoints without complex configuration. The Wayback Machine integration makes it perfect for historical analysis or stealth reconnaissance. Skip it if you’re scraping modern SPAs that rely on JavaScript rendering, need production-grade web scraping with extensive politeness controls and anti-bot evasion, or require crawling at massive scale. Also skip it for general-purpose crawling needs—Photon is a specialized intelligence tool, not a replacement for Scrapy or general web spiders.