nmap-scrap: Mining HTTP Gold from Nmap XML Without the Manual Grind

Hook

You've just scanned 10,000 hosts with Nmap and found 847 open HTTP/HTTPS ports. Now comes the soul-crushing part: figuring out which ones actually matter. This is the problem nmap-scrap was built to solve—and where it reveals the messy reality of security tooling.

Context

Nmap remains the gold standard for network reconnaissance, but its XML output is notoriously verbose and designed for comprehensive data capture rather than actionable intelligence. A typical enterprise network scan generates thousands of lines of XML detailing every port, service version, and protocol quirk. For penetration testers and security researchers, the real work begins after Nmap finishes: you need to identify which discovered HTTP services are worth investigating, determine their response characteristics, and prioritize targets based on status codes, redirects, and visual appearance.

The traditional workflow involves manually parsing XML, extracting HTTP services, writing custom scripts to probe each endpoint, and somehow organizing the results. Tools like Burp Suite and OWASP ZAP excel at deep application testing but aren't designed for initial mass triage. Screenshot tools like EyeWitness emerged to fill this gap, but they often require specific input formats or lack the flexibility to quickly filter and categorize results. nmap-scrap positions itself as a lightweight, focused solution: take Nmap XML, extract HTTP services, probe them in parallel, and categorize by HTTP response—nothing more, nothing less. It's the kind of single-purpose utility that gets built when someone gets tired of running the same bash one-liners before every engagement.

Technical Insight

System architecture — auto-generated

At its core, nmap-scrap implements a straightforward pipeline architecture that prioritizes speed over sophistication. The tool parses Nmap's XML output using Python's built-in xml.etree.ElementTree, extracts services running on HTTP/HTTPS ports (with configurable port filtering), then dispatches concurrent HTTP requests using a thread pool. The interesting architectural decision here is the emphasis on parallel execution—pentesting often involves hundreds or thousands of targets, and sequential probing would be impractical.

The threading model uses Python's ThreadPoolExecutor with a default concurrency of 20 threads, configurable via command-line arguments. This isn't the async/await pattern you'd find in modern Python tools, but rather old-school threading that's simple to reason about and sufficient for I/O-bound HTTP requests:

# Conceptual implementation based on tool behavior
from concurrent.futures import ThreadPoolExecutor
import requests

def probe_http_service(host, port, scheme='http'):
    url = f"{scheme}://{host}:{port}"
    try:
        response = requests.get(url, timeout=5, allow_redirects=True, verify=False)
        return {
            'url': url,
            'status_code': response.status_code,
            'redirect_chain': [r.url for r in response.history],
            'final_url': response.url,
            'title': extract_title(response.text)
        }
    except requests.RequestException as e:
        return {'url': url, 'error': str(e)}

def process_nmap_xml(xml_path, threads=20):
    services = parse_xml_for_http_services(xml_path)
    with ThreadPoolExecutor(max_workers=threads) as executor:
        results = list(executor.map(lambda s: probe_http_service(*s), services))
    return categorize_by_status(results)

The categorization logic groups results by HTTP status codes (2xx, 3xx, 4xx, 5xx), which is deceptively useful for triage. A wall of 200 OK responses suggests standard web services worth deeper inspection, while 401/403 codes indicate authentication barriers that might warrant credential attacks. Redirect chains (3xx codes) often reveal infrastructure details—a redirect from HTTP to HTTPS on a non-standard port, or from an IP to a hostname, can expose internal DNS configurations or load balancer behavior.

The screenshot functionality represents the tool's most ambitious feature, though documentation suggests it may depend on external tooling. Based on the README mentions of 'massws' (likely MassScan Web Screenshots or similar), this probably shells out to a headless browser like Chromium or PhantomJS rather than implementing screenshot capture natively. This is a pragmatic architectural choice—delegating complex tasks to specialized tools rather than reimplementing them—but it introduces deployment dependencies that aren't well-documented.

One clever feature is the port filtering mode, which flips the tool's purpose from HTTP probing to simple service extraction:

# Extract all hosts running SSH (port 22) from Nmap results
python nmap-scrap.py -x scan.xml --filter-port 22

# Get all hosts with port 445 (SMB) in an open state
python nmap-scrap.py -x scan.xml --filter-port 445 --state open

This transforms nmap-scrap from a specialized HTTP tool into a general-purpose Nmap result query utility. It's the kind of feature that emerges from real-world usage—someone needed to quickly grep for specific services and realized they already had XML parsing logic in place.

The data flow emphasizes ephemeral processing: read XML, probe services, display results. There's no database, no persistent state beyond optional file output. This aligns with the reconnaissance phase workflow where you're generating disposable data to inform the next steps rather than building a long-term asset inventory. The tool outputs to stdout in a human-readable format, which works well for interactive use but might frustrate attempts at automation or integration with other toolchains.

Gotcha

The repository's low adoption rate (5 stars) isn't just about discoverability—it signals real maturity concerns. The README explicitly mentions dependency on unreleased python-requests features, which is a red flag for production use. You might clone this repository, run it, and encounter import errors or unexpected behavior because the author developed against a development branch of a core dependency. This isn't uncommon in early-stage pentesting tools where developers prioritize functionality over stable dependency management, but it means you should expect to troubleshoot environment issues.

Documentation gaps present practical barriers. The 'massws' installation is listed as TODO, so if you want screenshot functionality (arguably the tool's most valuable feature for reporting), you're left reverse-engineering what massws is, where to get it, and how to configure it. There's no discussion of error handling strategies—what happens when 20 threads simultaneously encounter connection timeouts? Does the tool retry? Rate-limit? Simply log and continue? For large-scale scans against production infrastructure, these details matter enormously. Aggressive parallel requests without rate limiting can trigger IDS/IPS alerts or even cause stability issues on fragile web services.

The threading model, while simple, won't scale to truly massive scans. If you're processing Nmap results from a /16 network range with thousands of HTTP services, 20 threads probing with 5-second timeouts means potentially hours of runtime. Modern alternatives using async I/O (like httpx from ProjectDiscovery) can handle orders of magnitude more concurrent connections efficiently. The tool also doesn't appear to handle SSL/TLS certificate verification in any meaningful way beyond disabling it, missing opportunities to fingerprint services based on certificate details.

Verdict

Use if: You're conducting penetration tests or bug bounty reconnaissance where you regularly process moderate-scale Nmap scans (hundreds to low thousands of hosts), need quick HTTP service triage organized by status codes, and are comfortable troubleshooting Python dependency issues when they arise. The tool fills a genuine workflow gap between raw Nmap output and deeper application testing, and its simplicity means you can fork and modify it easily for custom requirements. It's particularly valuable if you're already scripting your recon workflow and need a component that does one thing reasonably well without the overhead of heavyweight frameworks.

Skip if: You need production-ready tooling with active maintenance and comprehensive documentation, you're working at scale where async I/O matters (10,000+ targets), you require reliable screenshot functionality without hunting down undocumented dependencies, or you're in an environment where using tools with unclear dependency states creates compliance issues. Consider mature alternatives like EyeWitness for screenshot-focused workflows, httpx for modern high-performance HTTP probing, or even just writing your own Python script using the requests library—nmap-scrap's core value proposition isn't complex enough to justify wrestling with an apparently abandoned tool unless it already fits your exact workflow.

nmap-scrap: Mining HTTP Gold from Nmap XML Without the Manual Grind

nmap-scrap: Mining HTTP Gold from Nmap XML Without the Manual Grind

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

nmap-scrap: Mining HTTP Gold from Nmap XML Without the Manual Grind

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]