Eyeballer: Training Neural Networks to Triage Thousands of Pentest Screenshots
Hook
During a large enterprise penetration test, security researchers might capture 5,000+ screenshots of web applications. A human reviewer can analyze maybe 200 per hour. Eyeballer processes all 5,000 in minutes.
Context
Penetration testing reconnaissance follows a predictable pattern: scan networks for web services, capture screenshots of everything discovered, then manually review each one to identify interesting targets. Tools like EyeWitness and GoWitness excel at the first two steps, programmatically visiting URLs and saving visual snapshots. But the third step—the actual analysis—remains stubbornly manual.
This creates a bottleneck in large assessments. An enterprise network scan might discover thousands of web interfaces: employee portals, forgotten development servers, IoT device admin panels, login pages, parking pages from decommissioned services. Pentesters need to identify which targets deserve investigation, but staring at screenshot grids for hours is mind-numbing work prone to fatigue-induced errors. You might miss that one outdated login page running a vulnerable CMS version simply because it appeared during hour three of screenshot review. Bishop Fox built Eyeballer to solve this specific problem: automated visual triage of web reconnaissance data using convolutional neural networks.
Technical Insight
Eyeballer implements a multi-label CNN classifier using TensorFlow/Keras, trained to detect five categories simultaneously: old-looking sites, login pages, custom 404 pages, homepages, and parked domains. The multi-label approach is crucial—a single screenshot might represent both an 'old-looking site' AND a 'login page', which would be impossible with traditional single-label classification.
The architecture expects 224x224 RGB images as input, though screenshots should be captured at 1440x900 (16:10 aspect ratio) before being resized. This preprocessing detail matters more than you might expect. The model was trained on properly-proportioned images, so feeding it screenshots with incorrect aspect ratios produces squished or stretched inputs that degrade accuracy. From the repository:
# Basic usage after installation
python eyeballer.py --weights YOUR_WEIGHTS.h5 YOUR_SCREENSHOTS_DIR
# Example output shows probability scores per category
# Screenshot: admin-login.png
# Old Looking: 0.23
# Login Page: 0.94
# Custom 404: 0.05
# Homepage: 0.67
# Parked Domain: 0.02
The training pipeline leverages transfer learning concepts, though it implements a custom architecture rather than using pre-trained ImageNet weights. The model achieves 93.52% binary accuracy across all labels, meaning for any given category, it correctly classifies whether that label applies or doesn't apply 93.52% of the time. However, the all-or-nothing accuracy—where every label must be correct simultaneously—drops to 76.09%. This gap reveals the complexity of multi-label problems: getting one decision right is relatively easy, but getting five independent decisions all correct on the same image is significantly harder.
Integrating Eyeballer into reconnaissance workflows is straightforward. After screenshot collection, you run Eyeballer to generate both HTML and CSV outputs:
# Capture screenshots with EyeWitness
eyewitness --web -f urls.txt -d screenshots/
# Run Eyeballer on the screenshot directory
python eyeballer.py --weights pretrained.h5 screenshots/screens/ \
--html output.html --csv output.csv
# Filter high-confidence login pages from CSV
awk -F',' '$3 > 0.90 {print $1}' output.csv > priority-targets.txt
The CSV output enables pipeline automation. You might filter for login pages with >90% confidence, or flag anything marked as 'old-looking' with >70% confidence as potentially running outdated software. The HTML report provides a visual interface for human reviewers, displaying thumbnails with color-coded confidence scores.
Category performance varies significantly. Login page detection achieves 83.82% recall with 88.24% precision—strong enough to catch most authentication interfaces while keeping false positives manageable. But 'old-looking' sites show only 62.20% recall, meaning the model misses nearly 4 out of 10 dated websites. This variance stems from training data characteristics: login pages have consistent visual patterns (input fields, submit buttons, specific layouts), while 'old-looking' is more subjective and visually diverse.
For teams wanting to retrain on custom data, Eyeballer includes training scripts that expect a specific directory structure with labeled screenshots. The training process requires GPU acceleration for reasonable completion times, though the repository explicitly scopes GPU setup instructions as out-of-scope. This creates a practical barrier for customization—you'll need existing ML infrastructure or willingness to navigate TensorFlow GPU configuration.
Gotcha
The 76% all-or-nothing accuracy means roughly 1 in 4 screenshots will have at least one incorrect label. In a batch of 1,000 screenshots, expect around 250 to be partially mislabeled. This doesn't mean the tool is useless—it's still filtering thousands of images down to hundreds for human review—but you cannot blindly trust the classifications. A screenshot marked with low 'login page' confidence might still be a login interface; the model might have been confused by unusual styling or missing typical visual markers.
The aspect ratio dependency creates practical friction. If your screenshot tool generates 1920x1080 (16:9) captures, you'll need preprocessing to crop or pad them to 16:10 before feeding them to Eyeballer. The repository doesn't include this preprocessing, so you're building that pipeline yourself. Tools like ImageMagick can handle the conversion, but it's an extra step that adds complexity.
Category limitations are hardcoded into the model. If you care about identifying specific frameworks (WordPress, Joomla, SharePoint) or vulnerability indicators beyond the five categories, Eyeballer won't help. Retraining requires collecting labeled training data, which for security-specific categories might mean manually labeling thousands of screenshots—essentially the same tedious work you're trying to avoid. The parked domain category has particularly poor recall (66.43%), making the tool unreliable for identifying dead or placeholder sites.
Verdict
Use if: You're conducting network penetration tests that generate 200+ screenshots and need to quickly identify login pages, outdated-looking sites, or custom error pages. The tool excels at first-pass triage, cutting manual review time by 60-70% even accounting for false positives. It's especially valuable when combined with EyeWitness or GoWitness in automated reconnaissance pipelines where you need machine-readable classification data. Skip if: Your assessments typically involve fewer than 50 web targets where manual review takes under an hour, or if you need perfect accuracy (the 76% all-or-nothing rate means meaningful false negatives). Also skip if your screenshot tooling can't easily produce 16:10 aspect ratios, or if your analysis requires categories beyond the five built-in labels. For small-scale work or when you need complete confidence in results, traditional manual review remains more reliable despite being slower.