Back to Articles

Photon: The Multi-Threaded OSINT Crawler That Weaponizes Web Archives

[ View on GitHub ]

Photon: The Multi-Threaded OSINT Crawler That Weaponizes Web Archives

Hook

Most web crawlers start from your homepage and work forward. Photon starts from Archive.org's Wayback Machine and works backward through time, discovering endpoints that disappeared years ago but still expose live API keys.

Context

Security researchers face a fundamental problem during reconnaissance: modern websites hide their most interesting endpoints behind JavaScript routers, authentication walls, and labyrinthine navigation structures. Traditional crawlers dutifully follow links from the homepage, but they miss the /api/v1/admin endpoint that was removed from the navigation three years ago yet still accepts unauthenticated requests. They overlook the backup file someone uploaded in 2019 that contains database credentials.

Photon emerged from the OSINT community's need for speed and intelligence extraction during bug bounty programs and penetration testing. Created by Somdev Sangwan (s0md3v), who also built XSStrike and other security tools, Photon treats crawling not as an indexing exercise but as an intelligence-gathering operation. It assumes every website leaks secrets—emails in HTML comments, API keys in JavaScript files, subdomains in certificates, AWS buckets in image URLs—and structures its entire architecture around rapidly extracting and categorizing these assets. Unlike general-purpose frameworks like Scrapy that provide building blocks, Photon is opinionated and purpose-built: get in, extract everything security-relevant, categorize it, and get out.

Technical Insight

Initial URLs

Historical URLs

Distribute

Pull URL

HTML Content

New URLs

Raw Matches

emails, files, API keys, subdomains

Optional

Seed URLs + Archive.org

URL Queue with Deduplication

ThreadPoolExecutor Workers

HTTP Fetcher with Delay/Timeout

HTML Parser + Regex Extraction

Intelligence Categorizer

Output Directories by Type

External APIs: DNSDumpster, Wayback

JSON Export

System architecture — auto-generated

Photon's architecture revolves around three intelligent design decisions that separate it from typical crawlers: multi-threaded queue management with URL deduplication, regex-based intelligence extraction pipelines, and what I call "temporal seeding" through Archive.org integration.

The crawling engine uses Python's concurrent.futures.ThreadPoolExecutor to spawn configurable worker threads (default 2, but you can push it to 10+ for internal networks). Each thread pulls URLs from a shared queue, fetches content, extracts new URLs, and feeds them back. The clever part is the deduplication logic—Photon maintains a set of processed URLs to avoid redundant requests, but it normalizes URLs before comparison, stripping query parameters that don't affect content. Here's the core crawling pattern:

# Simplified version of Photon's crawling logic
import requests
from urllib.parse import urljoin, urlparse
import re
from concurrent.futures import ThreadPoolExecutor

class PhotonCrawler:
    def __init__(self, seed_url, threads=2):
        self.seed = seed_url
        self.processed = set()
        self.queue = [seed_url]
        self.threads = threads
        self.results = {
            'emails': set(),
            'files': set(),
            'intel': set()
        }
    
    def extract_intel(self, html, url):
        # Email extraction
        emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', html)
        self.results['emails'].update(emails)
        
        # API key patterns (simplified)
        api_patterns = [
            r'api[_-]?key["\']?\s*[:=]\s*["\']([a-zA-Z0-9_\-]{20,})',
            r'AKIA[0-9A-Z]{16}'  # AWS keys
        ]
        for pattern in api_patterns:
            matches = re.findall(pattern, html)
            self.results['intel'].update(matches)
        
        # File discovery
        file_extensions = r'\.(pdf|xlsx?|docx?|zip|sql|log|backup)'
        files = re.findall(f'href=["\']([^"\'>]+{file_extensions})', html, re.I)
        self.results['files'].update([urljoin(url, f) for f in files])
    
    def crawl_url(self, url):
        if url in self.processed:
            return []
        
        self.processed.add(url)
        try:
            resp = requests.get(url, timeout=5)
            html = resp.text
            
            # Extract intelligence
            self.extract_intel(html, url)
            
            # Extract new URLs
            new_urls = re.findall(r'href=["\']([^"\'>]+)', html)
            normalized = [urljoin(url, u) for u in new_urls]
            return [u for u in normalized if urlparse(u).netloc == urlparse(self.seed).netloc]
        except:
            return []
    
    def run(self):
        with ThreadPoolExecutor(max_workers=self.threads) as executor:
            while self.queue:
                batch = self.queue[:10]
                self.queue = self.queue[10:]
                
                futures = [executor.submit(self.crawl_url, url) for url in batch]
                for future in futures:
                    new_urls = future.result()
                    self.queue.extend([u for u in new_urls if u not in self.processed])

The regex extraction pipelines are where Photon shows its OSINT DNA. It doesn't just extract links—it has dozens of regex patterns tuned for security artifacts: AWS S3 buckets, Google API keys, JSON Web Tokens, database connection strings, social security numbers, and even custom JavaScript variable assignments that might contain credentials. These patterns evolved from real-world bug bounty findings, making Photon a living catalog of what leaks where.

But the real innovation is Archive.org integration. When you enable the --wayback flag, Photon first queries the Wayback Machine API for all historical snapshots of your target domain, extracting every URL ever archived. These historical URLs seed the initial crawl queue before Photon even touches the live site. This means you're not just crawling the current site structure—you're crawling every endpoint that ever existed. That /admin/backup.php page removed in 2018? If it's still deployed (and you'd be surprised how often it is), Photon finds it. The output gets organized into subdirectories: domain.com/emails.txt, domain.com/files.txt, domain.com/intel.txt, making post-processing trivial.

Photon also implements smart politeness controls that most OSINT tools ignore. The --delay flag adds millisecond-level pauses between requests, and --timeout prevents hanging on slow endpoints. You can exclude URL patterns with regex filters, keeping the crawler focused on in-scope targets during bug bounties. The export system writes both human-readable text files and JSON for pipeline integration with other tools like Nuclei or custom scripts.

Gotcha

Photon's biggest blindspot is the modern web. It uses plain requests.get() calls with no JavaScript execution, meaning it completely misses single-page applications, React/Vue/Angular sites, and any content loaded via AJAX. If your target uses client-side routing, Photon sees the skeleton HTML and nothing else. This isn't a bug—it's an architectural decision for speed—but it means you need a headless browser crawler like Katana for modern web apps. I've seen Photon extract 50 URLs from a React application that actually has 500+ routes, all invisible without JavaScript rendering.

The GPL v3.0 license is also a practical limitation for security consultancies and commercial tool developers. You can't embed Photon into a proprietary scanning platform without open-sourcing your entire product. This has pushed many commercial security vendors toward alternatives with permissive licenses.

Finally, Photon's aggressive defaults can trigger modern web application firewalls (WAFs) and rate limiting. Running with 10 threads and no delay against a Cloudflare-protected site will get you blocked in seconds. The tool lacks sophisticated evasion techniques like user-agent rotation, cookie handling, or request header randomization. You'll need to manually tune delays and thread counts for each target, and even then, determined anti-bot systems will catch you. It's built for speed on cooperative targets, not stealth on hardened ones.

Verdict

Use if: You're conducting reconnaissance on traditional multi-page websites during bug bounties or penetration tests where speed and automated intelligence extraction matter more than exhaustive coverage. Photon excels at quickly categorizing exposed assets across domains with multiple subdomains and historical endpoints. It's perfect for the initial 30-minute reconnaissance phase where you need a broad overview of what's exposed before diving deeper with specialized tools. The Archive.org integration alone makes it invaluable for finding forgotten endpoints on long-lived domains. Skip if: Your target is a modern JavaScript-heavy SPA, you need commercial licensing flexibility, or you're scanning heavily-protected targets with aggressive WAFs where stealth matters. For those scenarios, invest in headless browser-based crawlers like Katana or commercial tools like Burp Suite. Also skip if you're doing general web scraping for non-security purposes—Photon's opinionated OSINT focus means it's overkill and you'll get better results with Scrapy or Beautiful Soup.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/s0md3v-photon.svg)](https://starlog.is/api/badge-click/ai-dev-tools/s0md3v-photon)