> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

dcrawl: Building Domain Lists at Scale with Go's Concurrency Primitives

[ View on GitHub ]

dcrawl: Building Domain Lists at Scale with Go's Concurrency Primitives

Hook

Most web crawlers try to map entire websites. dcrawl does something smarter: it skims the surface of millions of pages to extract one thing—domain names—and does it fast enough to gather 100,000+ unique domains in hours.

Context

Building comprehensive domain lists is a fundamental task in security research, threat intelligence, and web topology analysis. Whether you're creating blocklists, conducting reconnaissance, or studying internet infrastructure, you need a way to discover domains beyond what DNS enumeration or certificate transparency logs can provide. Traditional web crawlers like Scrapy or Nutch are built for deep site scraping—indexing every page of a website, respecting sitemaps, rendering JavaScript. But when your goal is breadth over depth, these tools are overkill. You don't need to scrape product catalogs or parse article metadata. You need to hop across the web's hyperlink graph, extract domain names, and move on.

This is where dcrawl finds its niche. Created by Kuba Gretzky (known for evilginx2), dcrawl is purpose-built for domain discovery. It treats the web as a graph where each page is just a stepping stone to find more domains. By implementing smart throttling, content-type filtering, and subdomain limiting, it avoids the traps that catch naive crawlers—like infinite calendars, subdomain enumeration honeypots, or massive media galleries. It's a single Go binary with no dependencies, designed to run on a VPS for days, steadily building domain lists without babysitting.

Technical Insight

HEAD request

valid

invalid

check

check

pass

pass

new URLs

discovered

reject

reject

Seed URLs

URL Queue

Worker Pool\ngoroutines

Content Type\nValidation

HTTP GET\nmax 1MB

HTML Parser\nextract links

Filter Logic

Hostname Limit\ndefault 5/host

Subdomain Limit\ndefault 10/domain

State Tracker

Domain Output

System architecture — auto-generated

dcrawl's architecture is a masterclass in doing one thing well. At its core, it's a breadth-first crawler with a URL queue managed by goroutines. The main loop spawns worker threads (configurable via -t) that pull URLs from the queue, make HTTP requests, parse HTML for links, and feed new URLs back into the queue. But the real intelligence is in what it doesn't crawl.

The crawler implements two critical heuristics to prevent runaway crawling. First, it limits links per hostname (default 5 via -l flag). This prevents crawling every page of a single site. Second, it caps subdomains per domain (default 10 via -s flag). This is crucial because many platforms like WordPress.com or Blogspot generate infinite subdomains. Without this limit, you'd waste resources enumerating user1.wordpress.com, user2.wordpress.com, etc., instead of discovering new top-level domains.

Here's how dcrawl extracts and filters domains from HTML:

// Simplified example of dcrawl's link extraction logic
func extractDomains(body []byte, baseURL *url.URL) []string {
    var domains []string
    doc, _ := html.Parse(bytes.NewReader(body))
    
    var traverse func(*html.Node)
    traverse = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, attr := range n.Attr {
                if attr.Key == "href" {
                    link, err := url.Parse(attr.Val)
                    if err != nil {
                        continue
                    }
                    // Resolve relative URLs
                    absolute := baseURL.ResolveReference(link)
                    
                    // Only HTTP/HTTPS schemes
                    if absolute.Scheme == "http" || absolute.Scheme == "https" {
                        domain := extractDomain(absolute.Host)
                        if domain != "" && !isSubdomainExplosion(domain) {
                            domains = append(domains, domain)
                        }
                    }
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            traverse(c)
        }
    }
    traverse(doc)
    return domains
}

Before downloading full page bodies, dcrawl performs HEAD requests to check Content-Type headers. If the response isn't text/html, it skips the download. This is a massive bandwidth saver—no downloading PDFs, images, or video files just to realize they contain no links. The body download is also capped at 1MB, preventing memory exhaustion from accidentally crawling massive HTML files or malformed pages.

The threading model uses Go's goroutines with a shared channel for URL distribution. Each worker goroutine pulls from the queue, processes URLs, and pushes discovered URLs back. A sync.Map tracks visited URLs to prevent duplicate requests, and another tracks domain/subdomain counts to enforce the heuristics. This design is simple but effective—Go's runtime handles the scheduling, and the channel provides natural backpressure.

One clever feature is resumable crawling. You can point dcrawl at its own output file with -r, and it'll load previously discovered domains to skip already-seen URLs. This enables iterative discovery campaigns where you run the crawler for a few hours, stop, analyze results, and continue without re-crawling:

# Initial crawl
dcrawl -t 50 -o domains.txt -seed https://example.com

# Resume and continue discovering
dcrawl -t 50 -o domains.txt -r domains.txt -seed https://another-seed.com

The output is a simple newline-delimited text file of unique domain names. No JSON, no metadata—just domains. This makes it trivial to pipe into other tools for DNS resolution, HTTP probing, or database insertion. The simplicity is refreshing in an ecosystem of over-engineered tools that output gigabytes of structured data you'll never use.

Gotcha

dcrawl's biggest limitation is its inability to handle modern JavaScript-heavy websites. Single-page applications that load content via AJAX or frameworks like React won't yield their full link graphs. If a page requires JavaScript execution to render links, dcrawl won't see them. This is a deliberate trade-off—adding a headless browser would 10x the resource requirements and complexity. For domain discovery, static HTML parsing covers enough of the web to be useful, but you'll miss domains exclusively linked in JS-rendered content.

The tool also completely ignores robots.txt and implements no per-domain rate limiting beyond the link-per-hostname cap. While the throttling prevents hammering single sites too hard, it's not respectful crawling by web standards. For large-scale operations, you risk getting IP-banned or causing server strain. There's no built-in proxy rotation or user-agent randomization, so defensive sites will detect and block you quickly. Ethically, this makes dcrawl a research tool, not something to point at production infrastructure without permission. You're also responsible for filtering out domains you shouldn't interact with—there's no built-in blocklist for honeypots or illegal content.

Deduplication across multiple runs requires manual merging of output files. If you run dcrawl with different seeds, you'll get duplicate domains across files unless you post-process with sort -u. For one-off discovery tasks this is fine, but for continuous crawling operations, you'll want to pipe output into a database with unique constraints.

Verdict

Use if: You need a lightweight, fast tool for building domain lists from web crawling, you're conducting security research or reconnaissance where breadth matters more than depth, you want a single binary with zero dependencies that runs anywhere Go compiles, or you're starting from known seed URLs and want to discover connected domains across the web graph. It's perfect for building custom blocklists, initial recon phases, or academic web topology research. Skip if: You need to crawl JavaScript-heavy modern web applications comprehensively, you require production-grade features like robots.txt respect and polite rate-limiting, you want persistent state management across distributed crawlers, or you need rich metadata beyond just domain names. For ethical use, pair dcrawl with your own rate-limiting wrapper, proxy rotation, and filtering logic—the tool gives you speed and simplicity, but you're responsible for using it responsibly.