Back to Articles

nmap-scrap: Turning Nmap XML Into HTTP Reconnaissance Pipelines

[ View on GitHub ]

nmap-scrap: Turning Nmap XML Into HTTP Reconnaissance Pipelines

Hook

Every penetration tester has been there: Nmap finishes scanning 10,000 hosts, reports 847 open HTTP ports, and now you need to manually figure out which ones are actually worth investigating.

Context

The reconnaissance phase of penetration testing follows a predictable pattern: scan the network with Nmap, identify open ports, then manually sift through results to find interesting services. For small engagements, this works fine. But when you’re dealing with enterprise networks spanning hundreds or thousands of hosts, the gap between “ports are open” and “these web applications are worth examining” becomes a productivity black hole.

Nmap excels at port discovery but stops short of deep HTTP enumeration. You might know that port 8080 is open on 200 hosts, but you don’t know which ones return 200 OK, which redirect to login pages, or which expose interesting administrative interfaces. Tools like masscan can identify open ports faster, but they create the same downstream problem: a massive list of potential targets with no prioritization. nmap-scrap positions itself as the bridge between port scanning and HTTP reconnaissance, parsing Nmap’s XML output to automatically validate and categorize web services across your scan results.

Technical Insight

Output

Processing

Parse XML

Extract Hosts & Ports

Build URLs

Distribute Work

HTTP/HTTPS Requests

Responses

Status, Size, Redirects

Optional

Nmap XML File

XML Parser

Port Filter

URL Queue

Thread Pool

20 Workers

Target Web Services

Response Handler

Results Output

Screenshot Capture

System architecture — auto-generated

At its core, nmap-scrap is a purpose-built XML parser married to a concurrent HTTP client. The tool reads Nmap’s structured XML output (generated with the -oX flag), extracts hosts with specific open ports, constructs HTTP URLs, and fires requests using a thread pool. This design choice—parsing XML rather than implementing its own scanning logic—keeps the codebase lean and lets Nmap handle the complexity of network scanning.

The architecture centers around a simple workflow: filter ports, build URLs, make requests, capture responses. When you run nmap-scrap against an XML file, it iterates through host elements, identifies services matching your port filter, and queues HTTP requests. By default, it spins up 20 worker threads to parallelize operations, which provides reasonable throughput without overwhelming targets or triggering rate limiting on smaller networks.

Here’s a typical workflow showing how nmap-scrap processes scan data:

# First, run an Nmap scan and save XML output
# nmap -p 80,443,8000,8080,8443 -oX scan_results.xml 192.168.1.0/24

# Then process with nmap-scrap to filter for specific ports
# python nmap-scrap.py -x scan_results.xml -p 8080,8443

# The tool constructs URLs and validates HTTP services:
# http://192.168.1.45:8080 [200] (15234 bytes)
# https://192.168.1.67:8443 [302] -> https://192.168.1.67:8443/login
# http://192.168.1.89:8080 [401] (unauthorized)

The real value emerges when you combine nmap-scrap with Nmap’s service detection. If your initial scan used -sV to fingerprint services, the XML output contains service banners and version information. nmap-scrap can leverage this metadata to make smarter decisions about protocol selection (HTTP vs HTTPS) and filter out false positives where a port is open but not actually serving HTTP.

The multi-threading implementation uses Python’s concurrent.futures.ThreadPoolExecutor, a straightforward approach that works well for I/O-bound HTTP operations. Each thread pulls URLs from a queue, makes a request using the python-requests library, captures the status code and response size, and optionally follows redirects. The default of 20 threads represents a middle ground—aggressive enough to process medium-sized scans quickly but conservative enough to avoid connection exhaustion.

One interesting architectural decision is the tool’s dependency on a custom fork of python-requests. The standard requests library handles most HTTP scenarios, but pentesting workflows often require modifications: custom SSL/TLS verification behavior, extended timeout handling, or specialized redirect logic. This fork introduces changes that make the library more suitable for reconnaissance scenarios where you’re deliberately interacting with potentially misconfigured or hostile services.

The screenshot functionality integrates with massws (Mass Web Screenshots), though the documentation notes this as a TODO. The concept is sound: after validating HTTP services, automatically capture visual snapshots for later review. This becomes invaluable during large engagements where you need to triage hundreds of discovered services and prioritize which ones to investigate manually. A screenshot of a default Apache landing page gets lower priority than an exposed administrative dashboard.

Output persistence allows you to save results for further processing or integration with other tools. Since nmap-scrap operates as a post-processor, it fits naturally into pipeline workflows where you might feed results into vulnerability scanners, content discovery tools, or custom analysis scripts. The tool’s simplicity—read XML, make requests, output results—makes it easy to chain with other reconnaissance utilities.

Gotcha

The elephant in the room is the custom python-requests dependency. The repository requires a modified version that isn’t published to PyPI, forcing you to manually install it from a specific GitHub repository. This creates friction in deployment workflows and breaks standard Python packaging conventions. If you’re building automated pentesting pipelines with reproducible environments, this dependency management quirk will cause headaches. Virtual environment isolation helps, but it’s still a red flag for production use.

The project’s maintenance status raises concerns. With only 5 stars and incomplete documentation (the massws integration is marked TODO in the README), this appears to be an early-stage or possibly abandoned project. There’s no clear roadmap, no recent commit activity indicators in the analysis, and no community momentum. For one-off penetration tests where you need a quick solution, this might be acceptable. For building reliable infrastructure or tools you’ll use repeatedly, the lack of active development is problematic.

Error handling appears minimal based on the architecture description. When dealing with hundreds or thousands of HTTP requests against potentially misconfigured servers, you need robust error recovery: connection timeouts, SSL certificate validation failures, malformed HTTP responses, and rate limiting. The tool’s simple architecture suggests it may not gracefully handle edge cases that appear regularly in real-world pentesting. A crashed thread or unhandled exception could cause you to miss results or require restarting the entire analysis.

Verdict

Use if: You’re conducting time-boxed penetration tests with small-to-medium network scopes (under 1,000 hosts), already have Nmap XML results, and need a quick way to validate which HTTP services are actually responding. The tool excels at turning “port 8080 is open” into actionable intelligence about service availability and response codes, and the multi-threaded approach provides reasonable performance without requiring complex setup. It’s particularly useful for engagements where you need to document HTTP service discovery in a repeatable way and don’t mind managing the custom dependency.

Skip if: You need production-ready reliability, active maintenance, or plan to integrate this into automated workflows that run regularly. The custom python-requests dependency creates unnecessary deployment complexity, and the incomplete state of features like screenshot integration suggests you’ll hit walls quickly. For larger scopes or enterprise environments, mature alternatives like EyeWitness or httpx from ProjectDiscovery offer better error handling, active development communities, and more comprehensive feature sets. Also skip if you need HTTPS-specific options like custom certificate validation or client authentication—the tool’s simple architecture doesn’t support these advanced scenarios.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/bastienfaure-nmap-scrap.svg)](https://starlog.is/api/badge-click/developer-tools/bastienfaure-nmap-scrap)