Building an Unofficial API by Scraping DNS Reconnaissance Sites: A Study of PaulSec’s DNSDumpster Wrapper
Hook
What do you do when a valuable reconnaissance service doesn’t offer an API? You build one yourself by reverse-engineering their web forms—CSRF tokens, session cookies, and all.
Context
DNSDumpster.com has been a go-to resource for security researchers and penetration testers since its launch, offering free subdomain enumeration, DNS record mapping, and network visualization. The service aggregates data from multiple sources and presents it in a user-friendly interface complete with visual network maps. But there’s a catch: for years, there was no official API for programmatic access.
This posed a problem for security professionals who needed to integrate subdomain reconnaissance into automated workflows, CI/CD pipelines, or custom tooling. Manual web browsing doesn’t scale when you’re assessing dozens of domains or need to correlate DNS data with other intelligence sources. PaulSec’s API-dnsdumpster.com emerged to fill this gap—a Python library that treats the DNSDumpster website itself as an API by scraping its HTML responses and extracting structured data. It’s a perfect case study in pragmatic reverse engineering: when the official interface doesn’t exist, you build an unofficial one.
Technical Insight
The architecture of this wrapper is deceptively simple but demonstrates several key techniques for building robust web scrapers. At its core, it’s a session-aware HTTP client that mimics browser behavior to bypass basic bot protection.
The entry point is the DNSDumpsterAPI class, which maintains a requests session to preserve cookies across multiple HTTP calls. The critical challenge is handling CSRF protection—DNSDumpster’s form submission requires a valid token that’s embedded in the initial page load. Here’s how the library tackles this:
def search(self, domain: str) -> dict:
# First request: get the form page and extract CSRF token
req = self.session.get('https://dnsdumpster.com')
soup = BeautifulSoup(req.content, 'html.parser')
csrf_token = soup.find('input', attrs={'name': 'csrfmiddlewaretoken'})['value']
# Second request: submit the form with token
data = {
'csrfmiddlewaretoken': csrf_token,
'targetip': domain,
'user': 'free'
}
req = self.session.post('https://dnsdumpster.com/',
cookies={'csrftoken': csrf_token},
data=data,
headers={'Referer': 'https://dnsdumpster.com'})
# Parse the results
return self._parse_results(req.content)
This two-step dance is essential: the first GET request establishes a session and retrieves the CSRF token, while the second POST submits the target domain along with that token. The library even sets the Referer header to further mimic legitimate browser traffic.
The parsing logic is where things get fragile but interesting. BeautifulSoup traverses the HTML DOM looking for specific table structures that contain DNS records. For example, extracting MX records involves finding the table with class ‘table’, then iterating through rows to pull out mail server hostnames and their associated IP addresses:
def _parse_mx_records(self, soup):
mx_records = []
mx_section = soup.find('div', text=re.compile('MX Records'))
if mx_section:
table = mx_section.find_next('table')
for row in table.find_all('tr')[1:]: # Skip header
cols = row.find_all('td')
if len(cols) >= 3:
mx_records.append({
'host': cols[0].text.strip(),
'ip': cols[1].text.strip(),
'provider': cols[2].text.strip()
})
return mx_records
The library extends this pattern to extract A records, TXT records, and nameserver data. Each record type has its own parsing method that knows exactly where in the HTML structure to look.
One clever feature is the retrieval of DNSDumpster’s auto-generated network visualization. The service creates a PNG image showing the relationships between discovered hosts, and this image is embedded as a base64-encoded data URI in the HTML. The wrapper extracts this string and can optionally decode it to save as a proper image file:
def retrieve_image(self, domain: str, output_path: str = None) -> bytes:
# Image is in an <img> tag with src starting with 'data:image/png;base64,'
img_tag = soup.find('img', src=re.compile(r'^data:image/png;base64,'))
if img_tag:
base64_data = img_tag['src'].split(',')[1]
image_bytes = base64.b64decode(base64_data)
if output_path:
with open(output_path, 'wb') as f:
f.write(image_bytes)
return image_bytes
The library also supports downloading the Excel export that DNSDumpster generates, which contains a more complete dataset. It does this by making an additional POST request to a different endpoint after the initial search completes.
Error handling distinguishes between network failures (DNSDumpsterRequestError) and parsing failures (DNSDumpsterParseError). This separation is crucial because it tells you whether DNSDumpster is down or whether their HTML structure changed and broke your scraper. The addition of type hints in recent versions makes the API contract explicit—you know exactly what dictionary structure to expect from search().
What makes this implementation educational is how it balances pragmatism with maintainability. Yes, it’s scraping HTML which is inherently brittle. But by isolating parsing logic into separate methods, using well-defined error types, and maintaining comprehensive tests, the code becomes as robust as a web scraper can be.
Gotcha
The elephant in the room: this is an unofficial scraper that depends entirely on DNSDumpster’s HTML structure remaining stable. When (not if) DNSDumpster redesigns their interface, updates their CSS classes, or restructures their tables, this library breaks. There’s no SLA, no deprecation warnings, and no guarantee of compatibility. You’re essentially reverse-engineering someone else’s private implementation details.
There are also legal and ethical considerations. While DNSDumpster doesn’t explicitly prohibit automated access in their robots.txt, aggressive scraping could violate their terms of service or be considered abusive behavior. The library includes no rate limiting, exponential backoff, or request throttling. If you loop this over hundreds of domains, you risk getting your IP blocked or causing performance issues for a free service that benefits the security community. For production use cases requiring reliability and compliance, DNSDumpster now offers an official paid API that should be used instead. This wrapper is best suited for occasional, manual reconnaissance tasks where you’re willing to accept the fragility in exchange for not managing API credentials.
Verdict
Use if: You’re conducting ad-hoc penetration testing or security research, need quick subdomain enumeration without account signup friction, and can tolerate occasional breakage when DNSDumpster updates their site. It’s perfect for one-off scripts, educational projects learning web scraping techniques, or situations where you’re already handling errors gracefully and can fall back to manual checks. Skip if: You need production-grade reliability, are building commercial tooling, require high-volume automated scanning, or need to maintain compliance with service terms. In those scenarios, invest in DNSDumpster’s official API or use active enumeration tools like subfinder and amass that don’t depend on scraping third-party websites. This is a hobbyist’s reconnaissance utility, not enterprise infrastructure.