Back to Articles

collectlinks: The 50-Line Link Extractor That Shows Why Less Is More in Go

[ View on GitHub ]

collectlinks: The 50-Line Link Extractor That Shows Why Less Is More in Go

Hook

Most developers reach for 10,000-line DOM manipulation libraries when they need to extract links from HTML. collectlinks does it in 50 lines with zero dependencies—and that constraint is exactly what makes it interesting.

Context

Web scraping in Go presents a common dilemma: you need to extract links from HTML, but the standard library's html package only gives you a tokenizer—no convenient query interface like JavaScript's querySelector or Python's BeautifulSoup. The typical response is to reach for goquery, a jQuery-like library that builds a full DOM tree and provides familiar CSS selectors.

But what if you only need href attributes? Building an entire DOM tree to extract strings feels like using a bulldozer to plant flowers. You're parsing HTML twice (once to build the tree, once to traverse it), allocating memory for nodes you'll never use, and pulling in dependencies that might conflict with your project's ecosystem. collectlinks emerged from this realization: for the specific use case of link extraction, a streaming parser that processes HTML tokens sequentially is not just sufficient—it's superior in both performance and simplicity.

Technical Insight

The genius of collectlinks lies in its direct use of Go's html.Tokenizer, which reads HTML as a stream of tokens without constructing an in-memory representation of the document structure. This streaming approach means you can process gigabytes of HTML without proportional memory consumption.

Here's the core implementation pattern:

package main

import (
    "fmt"
    "strings"
    "github.com/JackDanger/collectlinks"
)

func main() {
    htmlContent := `
        <html>
            <body>
                <a href="https://example.com">Example</a>
                <a href="/relative/path">Local</a>
                <a href="mailto:test@example.com">Email</a>
            </body>
        </html>
    `
    
    links := collectlinks.All(strings.NewReader(htmlContent))
    
    for _, link := range links {
        fmt.Println(link)
    }
    // Output:
    // https://example.com
    // /relative/path
    // mailto:test@example.com
}

Under the hood, collectlinks iterates through HTML tokens, checks if each token is a StartTagToken for an anchor element, then extracts the href attribute value. The implementation leverages Go's io.Reader interface, which means it works seamlessly with HTTP response bodies, file handles, or any byte stream—no need to load the entire document into a string first.

The package's API surface is deliberately minimal: a single function that accepts an io.Reader and returns []string. This design choice reflects a Unix philosophy—do one thing well and compose with other tools. If you need URL normalization, pipe the output through a URL parser. If you need filtering, wrap it with your business logic. The package doesn't try to be everything to everyone.

For real-world scraping, you'd typically combine it with net/http:

resp, err := http.Get("https://news.ycombinator.com")
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()

links := collectlinks.All(resp.Body)

// Filter for actual HTTP(S) links
var httpLinks []string
for _, link := range links {
    if strings.HasPrefix(link, "http://") || strings.HasPrefix(link, "https://") {
        httpLinks = append(httpLinks, link)
    }
}

The streaming tokenizer approach has another subtle advantage: resilience to malformed HTML. Real-world web pages are rarely valid XHTML, and building a complete DOM tree from broken markup can fail or produce unexpected structures. A streaming parser, by contrast, extracts what it can and moves on—perfect for the messy reality of web scraping.

One architectural decision worth noting: collectlinks returns hrefs exactly as they appear in the HTML source. This means you get relative URLs like /about or ../index.html without any resolution. While this might seem like a limitation, it's actually a defensible design choice—the package can't know your base URL context, and attempting to guess would introduce errors. Instead, you handle URL resolution downstream using net/url's ResolveReference method with your known base URL.

Gotcha

The simplicity that makes collectlinks elegant also defines its boundaries. It extracts href attributes from anchor tags, period. If you need link elements in the head (<link rel="stylesheet">), image sources, script tags, or CSS url() references, you're out of luck. Modern web applications often load content dynamically, and collectlinks will only see the initial HTML—no JavaScript execution means no single-page app routes.

The lack of URL filtering or validation can also trip you up. You'll get everything: javascript:void(0), mailto: addresses, # fragments, and completely malformed URLs. There's no deduplication either—if a link appears fifty times on the page, you'll get fifty entries in your slice. For production use, you'll need to build filtering logic around it. The package also doesn't handle URL resolution—relative paths stay relative, which means you need to track the base URL yourself and manually resolve references using net/url.Parse and ResolveReference. This isn't necessarily bad, but it's more plumbing than you might expect from a "link extraction" library.

Verdict

Use if: You're building a focused crawler or scraper where you'll handle URL processing, filtering, and validation yourself anyway. The zero-dependency profile and streaming efficiency make it perfect for embedded systems, AWS Lambda functions with size constraints, or situations where you want complete control over the link processing pipeline. It's also ideal for learning projects—the source code is short enough to read in five minutes and understand completely. Skip if: You need comprehensive link extraction beyond anchor hrefs, URL normalization and validation out of the box, or you're already using goquery for other DOM manipulation tasks (in which case, just use its Find("a[href]") capability). For complex production scrapers with sophisticated crawling logic, frameworks like colly provide better value despite their larger footprint.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/jackdanger-collectlinks.svg)](https://starlog.is/api/badge-click/developer-tools/jackdanger-collectlinks)