xioc: Extracting Defanged Indicators of Compromise from Security Intelligence Reports

Hook

Security analysts intentionally break URLs and domains in their reports—writing 'hxxp://evil[.]com' instead of 'http://evil.com'—to prevent accidental infections. This creates a parsing nightmare for automation tools that need to extract these indicators.

Context

In cybersecurity threat intelligence sharing, there's a peculiar practice that drives parsers insane: analysts deliberately mangle indicators of compromise. When documenting malicious domains, IP addresses, or URLs in reports, they'll write 'hxxp://malware[.]example[.]com' or '192(.)168(.)1(.)1' instead of the actual indicators. This 'defanging' prevents accidental clicks, stops security tools from flagging the report itself as malicious, and keeps web crawlers from visiting dangerous sites.

The problem emerges when you need to operationalize this intelligence. Security teams receive hundreds of threat reports from vendors, ISACs, and open-source feeds—all filled with defanged IOCs. Manually copying and 'refanging' each indicator is tedious and error-prone. Standard regex-based extraction fails because 'hxxp[:]//evil[.]com' doesn't match URL patterns. You need a preprocessing layer that understands the obfuscation conventions before applying standard extraction patterns. This is precisely what xioc solves: it normalizes defanged text back to parseable formats, then extracts IOCs using well-tested regex patterns for IPv4/IPv6 addresses, domains, URLs, email addresses, and file hashes.

Technical Insight

System architecture — auto-generated

xioc's architecture is elegantly simple: a two-stage pipeline that first translates defanging patterns to their original forms, then applies standard IOC extraction regex. The tool is written in Go, compiles to a static binary with zero runtime dependencies, and exposes both CLI and library interfaces.

The defanging translation happens through string replacement maps. Before any extraction, xioc runs the input text through a series of substitutions that recognize common obfuscation patterns. The code in xioc.go defines these mappings:

// Common defanging patterns security analysts use
var defangPatterns = map[string]string{
    "hxxp":     "http",
    "hXXp":     "http",
    "h__p":     "http",
    "[://]":    "://",
    "[.]":      ".",
    "(.)":      ".",
    "[dot]":    ".",
    "(dot)":    ".",
    "[@]":      "@",
    "(@)":      "@",
}

This preprocessing is critical. A URL like 'hxxps[://]malware[.]example[dot]com/payload' gets normalized to 'https://malware.example.com/payload' before the URL extraction regex ever sees it. The approach handles nested obfuscation too—text with multiple defanging styles in the same indicator gets progressively cleaned.

After normalization, xioc applies targeted regex patterns for each IOC type. The IPv4 extractor, for instance, uses a pattern that matches dotted-quad notation while avoiding common false positives:

func ExtractIPv4s(text string) []string {
    // Match IPv4 but exclude version numbers like 1.2.3
    pattern := `\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b`
    re := regexp.MustCompile(pattern)
    matches := re.FindAllString(text, -1)
    
    // Filter out invalid IPs (octets > 255)
    var validIPs []string
    for _, match := range matches {
        if isValidIPv4(match) {
            validIPs = append(validIPs, match)
        }
    }
    return validIPs
}

The CLI interface is designed for Unix pipeline integration. You can pipe curl output, lynx dumps, or plain text files directly into xioc:

# Extract IOCs from a security blog post
curl -s https://threatreport.example/analysis.html | xioc

# Pull IOCs from a PDF converted to text
pdftotext malware-report.pdf - | xioc

# Extract only domains from defanged text
echo 'Contacted evil[.]example[.]com and backup[dot]malicious[dot]org' | xioc -o domain

The -o flag lets you filter output to specific IOC types (ipv4, ipv6, domain, url, email, md5, sha1, sha256), which is invaluable when feeding extracted indicators into downstream tools. You might pipe domains to a DNS resolver, IP addresses to a geolocation service, or file hashes to VirusTotal lookups.

For programmatic use, xioc exposes individual extraction functions as a Go library. This lets you embed IOC extraction into security automation platforms, SOAR tools, or custom threat intelligence pipelines:

import "github.com/assafmo/xioc"

func processThreatReport(reportText string) {
    // Extract all IOC types
    domains := xioc.ExtractDomains(reportText)
    ips := xioc.ExtractIPv4s(reportText)
    urls := xioc.ExtractURLs(reportText)
    hashes := xioc.ExtractSHA256s(reportText)
    
    // Feed to security tools
    for _, domain := range domains {
        checkReputation(domain)
        addToBlocklist(domain)
    }
}

The library approach gives you fine-grained control over extraction and post-processing. You can deduplicate results, validate indicators against threat feeds, or apply custom filtering logic that xioc's CLI doesn't provide.

One architectural decision worth noting: xioc doesn't attempt to understand context or validate whether extracted strings are actually malicious. It's purely an extraction tool. If your text contains '192.168.1.1' in a benign networking example, xioc will extract it. This design makes the tool fast and predictable—you control filtering and validation in your pipeline rather than having xioc make decisions about indicator relevance.

Gotcha

xioc's regex-based extraction will produce false positives in certain text contexts. Version numbers like '1.2.3.4' match IPv4 patterns. Email signatures containing legitimate domains will be extracted alongside malicious ones. The tool has no semantic understanding of whether an indicator appears in a threat context versus a benign reference. You'll need post-processing filters if you're working with noisy input.

The defanging pattern list is hardcoded and finite. If an analyst invents a novel obfuscation scheme—say, using 'h_t_t_p' or '<>' instead of the common patterns xioc recognizes—you'll get no extraction unless you fork the code and add those patterns. The tool also doesn't handle some edge cases in defanging: an analyst might defang only part of an indicator ('hxxp://example.com' with the domain not defanged), which xioc will catch, but heavily mixed patterns can confuse the preprocessor. Additionally, xioc provides no deduplication—if an IOC appears multiple times in text, you'll get multiple extractions. For production use in threat intelligence platforms, you'll want to pipe xioc's output through 'sort -u' or build deduplication into your wrapper code.

Verdict

Use if: You're processing security reports, threat intelligence feeds, or analyst-written documentation where defanging is standard practice. xioc excels at quick CLI extraction for incident response, building security automation pipelines that consume unstructured threat data, or integrating IOC extraction into Go-based security tools. The static binary deployment and zero dependencies make it ideal for air-gapped networks or Docker containers in security operations centers. Skip if: You need context-aware extraction that understands whether an indicator is malicious based on surrounding text, require validation or enrichment of IOCs (reputation checks, WHOIS lookups), or work primarily with structured data formats like STIX/TAXII where dedicated parsers are more appropriate. Also skip if your threat sources use highly custom defanging patterns that would require constant code modifications—xioc's pattern set is comprehensive for common practices but not infinitely extensible without forking.

xioc: Extracting Defanged Indicators of Compromise from Security Intelligence Reports

xioc: Extracting Defanged Indicators of Compromise from Security Intelligence Reports

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

xioc: Extracting Defanged Indicators of Compromise from Security Intelligence Reports

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]