HTTP Garden: Finding Request Smuggling Bugs by Pitting Servers Against Each Other
Hook
Nginx accepts bare LF characters in chunked request bodies—a clear spec violation that maintainers openly refuse to fix. How many other parsing disagreements are lurking between your proxy and backend server?
Context
HTTP request smuggling attacks exploit a fundamental problem: when a proxy and backend server disagree about where one request ends and another begins, attackers can inject requests that the proxy never saw. Traditional security testing approaches miss these vulnerabilities because they test servers in isolation. A payload might be perfectly valid to your WAF but get parsed completely differently by your application server, creating a blind spot where attackers slip through.
The challenge isn’t just finding malformed HTTP—it’s finding HTTP that’s malformed in interesting ways, where different implementations disagree about what it means. You need to send the same payload to dozens of servers simultaneously, normalize their responses to ignore benign quirks, and surface only the disagreements that matter. That’s the insight behind HTTP Garden, a differential testing framework supported by DARPA from researchers at Narf Industries, Galois, Trail of Bits, and Dartmouth College. Instead of asking “is this valid HTTP?”, it asks “do these twelve servers agree on what this HTTP means?”—and when the answer is no, you’ve found an attack surface.
Technical Insight
HTTP Garden’s architecture revolves around a composable pipeline that makes differential testing almost trivial. Each HTTP server and proxy runs in its own Docker container, built from source at a specific commit for reproducibility. The REPL interface gives you four core operations that chain together: payload creates the test input, transduce routes it through a proxy, fanout sends it to all configured servers simultaneously, and grid compares their interpretations.
Here’s what discovering a real vulnerability looks like:
garden> payload 'POST / HTTP/1.1\r\nHost: a\r\nTransfer-Encoding: chunked\r\n\r\n0\n\r\n' | fanout | grid
gunicorn: [
HTTPResponse(version=b'1.1', method=b'400', reason=b'Bad Request'),
]
hyper: [
]
nginx: [
HTTPRequest(
method=b'POST', uri=b'/', version=b'1.1',
headers=[
(b'transfer-encoding', b'chunked'),
(b'host', b'a'),
(b'content-length', b'0'),
(b'content-type', b''),
],
body=b'',
),
]
g
u
n
i h n
c y g
o p i
r e n
n r x
+-----
gunicorn|✓ ✓ X
hyper | ✓ X
nginx | ✓
That 0\n instead of 0\r\n is the smoking gun—a bare LF line ending in a chunked body that’s explicitly disallowed by RFC 9112. Gunicorn rejects it with a 400, Hyper silently drops the connection, but Nginx happily accepts it. In a proxy-backend configuration where HAProxy sits in front of Nginx, this disagreement becomes an attack vector.
The framework’s intelligence lies in its equivalence checking. Notice how Gunicorn’s 400 response and Hyper’s empty response both show as agreeing in the grid? That’s because HTTP Garden understands that both represent rejection of the malformed input—semantically equivalent even if technically different. Similarly, when Nginx adds empty content-length and content-type headers (a known quirk), the Garden doesn’t flag it as a disagreement:
garden> payload 'GET / HTTP/1.1\r\nHOST: a\r\n\r\n' | transduce haproxy | fanout | grid
# All three servers show as agreeing despite Nginx's extra headers
g
u
n
i h n
c y g
o p i
r e n
n r x
+-----
gunicorn|✓ ✓ ✓
hyper | ✓ ✓
nginx | ✓
This normalization happens through quirk detection mechanisms that identify each implementation’s benign idiosyncrasies. The result is a signal-to-noise ratio that makes real parser disagreements jump out immediately.
The transducer pipeline is where things get especially powerful. You can chain a payload through HAProxy, see how it transforms the request, then fan that transformed version out to your backend servers. This directly models real-world deployment architectures where requests pass through multiple layers of proxies, CDNs, and load balancers before reaching your application. Each layer might normalize, reject, or subtly modify the HTTP—and those modifications are exactly where desynchronization attacks hide.
Under the hood, each target is parameterized with APP_REPO, APP_BRANCH, and APP_VERSION in its Dockerfile, making it trivial to test different versions of the same software. This means you can verify whether a parser bug exists in older versions, test proposed patches, or compare behavior across major version bumps. The project currently covers 37 HTTP servers (from Apache and Nginx to Rust’s Hyper and Go’s stdlib) and 13 proxies, all built from source at pinned commits.
Gotcha
HTTP Garden runs on x86_64 and AArch64 Linux and is untested on other platforms. If you’re on macOS or Windows, compatibility is uncertain—the README explicitly notes the framework is untested outside Linux. The dependency footprint is also hefty: you need Python 3.12 or newer (ruling out many production environments still on 3.11 or earlier), Docker, and the willingness to build dozens of servers from source. Initial setup can take significant time and disk space as each container compiles its target from scratch.
More fundamentally, the framework stops at discovery. When the grid shows disagreements, you get a clear visualization of which implementations differ, but no automated classification of severity, no proof-of-concept exploit generation, and no mapping to CVE categories. You still need deep HTTP knowledge to understand whether a disagreement is exploitable request smuggling, a harmless parser quirk that slipped through normalization, or something in between. The REPL is powerful but appears to have limited documentation for integration into CI/CD pipelines or automated security testing workflows. The design is optimized for interactive exploration, which makes sense for security research but less so for regression testing.
Verdict
Use HTTP Garden if you’re hunting for HTTP desynchronization vulnerabilities in production architectures, validating RFC compliance for your own HTTP implementation, or researching parser security. It’s the only tool that makes differential testing across dozens of implementations this accessible, and the transducer pipeline perfectly models real proxy-backend chains. Skip it if you’re on non-Linux platforms (it’s untested elsewhere), need automated exploit generation rather than discovery (it only finds disagreements), want lightweight compliance checking (the OWASP or httpwg test suites are simpler), or require well-documented integration into automated testing pipelines (the REPL-centric design may not facilitate this). This is a specialized research instrument for finding subtle parser discrepancies that matter—when two implementations disagree about what a byte stream means, attackers take notice.