WAVSEP: The Forgotten Benchmark That Exposed Security Scanner Snake Oil

Hook

Between 2010-2014, WAVSEP benchmarks revealed that expensive enterprise security scanners routinely missed 40-60% of vulnerabilities while cheap tools sometimes outperformed them. The vendors weren't happy.

Context

Before WAVSEP, choosing a web application security scanner was an expensive gamble wrapped in vendor marketing. Companies would drop $50,000+ on enterprise scanners based on feature lists and sales pitches, only to discover critical vulnerabilities slipping through in production. The problem wasn't just that scanners missed vulnerabilities—it was that no one had a standardized way to measure how badly they missed them.

The landscape of vulnerable web applications in the early 2010s focused on training: DVWA taught developers about XSS, WebGoat provided security lessons. But none of these answered the fundamental question security teams needed: "If I point Scanner A and Scanner B at the same application, which one actually finds more real vulnerabilities without drowning me in false positives?" WAVSEP emerged from this gap, created by security researcher Shay Chen, not as another training platform but as a forensic benchmark—a controlled experiment where every vulnerability was catalogued, every edge case documented, and scanner performance could be measured with scientific rigor.

Technical Insight

System architecture — auto-generated

WAVSEP's architecture reveals a deceptively simple but powerful design philosophy: organize vulnerable web pages into a structured taxonomy that separates true positives from false positive test cases. The application deploys as a standard Java WAR file on Tomcat, but its real innovation lies in its methodical categorization of vulnerability variants.

The project structure organizes test cases by vulnerability type (XSS, SQL injection, file inclusion, etc.) with subdirectories for different complexity levels. Each vulnerability category contains both vulnerable pages that should trigger scanner alerts and safe pages that implement proper input validation. For example, the reflected XSS test cases include:

// Case 1: Basic reflected XSS (should be detected)
String userInput = request.getParameter("input");
out.println("<div>" + userInput + "</div>");

// Case 2: Reflected XSS with URL encoding (tests parser robustness)
String encoded = request.getParameter("input");
out.println("<div>" + encoded + "</div>");  // Input: %3Cscript%3E

// Case 3: False positive test - properly sanitized
String userInput = request.getParameter("input");
String sanitized = StringEscapeUtils.escapeHtml4(userInput);
out.println("<div>" + sanitized + "</div>");

The genius is in the granularity. WAVSEP doesn't just test if a scanner can find basic XSS—it tests whether the scanner can detect XSS through event handlers, through JavaScript eval contexts, through CSS expressions, and through various encoding schemes. Each variant exists as a separate endpoint, allowing researchers to map exactly which detection techniques work and which fail.

The SQL injection test cases demonstrate this comprehensively. WAVSEP includes test cases for numeric injection, string injection, second-order injection, and blind SQL injection across multiple database contexts:

// Numeric parameter injection (no quotes needed)
String id = request.getParameter("id");
String query = "SELECT * FROM users WHERE id = " + id;
statement.executeQuery(query);

// String parameter with single quotes
String name = request.getParameter("name");
String query = "SELECT * FROM users WHERE name = '" + name + "'";
statement.executeQuery(query);

// False positive control - parameterized query
String id = request.getParameter("id");
PreparedStatement stmt = conn.prepareStatement(
    "SELECT * FROM users WHERE id = ?");
stmt.setString(1, id);
stmt.executeQuery();

But WAVSEP's most underappreciated feature is its focus on false positive measurement. While most vulnerable apps only test if scanners can find vulnerabilities, WAVSEP dedicates entire sections to pages that look vulnerable but aren't. This tests whether scanners actually parse application logic or just pattern-match on suspicious code. A scanner that flags the parameterized query example above as vulnerable reveals a fundamental inability to understand basic security controls—yet WAVSEP benchmarks exposed that many expensive enterprise tools did exactly this.

The test case naming convention follows a strict pattern: the vulnerability type, the input vector, the context, and whether it's a true positive (vulnerable) or false positive (safe) case. For example: XSS-Reflected-GET-AttributeContext-SingleQuote-Vulnerable.jsp versus XSS-Reflected-GET-AttributeContext-SingleQuote-Safe.jsp. This systematic naming allows automated result parsing and statistical analysis across thousands of test cases.

The benchmark methodology that emerged from WAVSEP involved running multiple scanners against the entire test suite, then calculating detection rates (true positives found / total true positives) and false positive rates (safe pages flagged / total safe pages). The results were brutal: some scanners with impressive marketing detected under 30% of vulnerabilities while flagging 40% of safe code as vulnerable. The data exposed a market where vendor claims bore little relationship to actual capability.

Gotcha

WAVSEP's primary limitation is its 2014 vintage—it's a time capsule of mid-2010s web security. Modern vulnerability classes like Server-Side Request Forgery (SSRF), insecure deserialization, JWT attacks, and framework-specific issues in React, Angular, or Vue.js are completely absent. If you're evaluating scanners for contemporary applications built with modern JavaScript frameworks, GraphQL APIs, or serverless architectures, WAVSEP's test coverage has significant gaps.

The project's minimal documentation also creates friction. Most of the context lives in archived blog posts from the now-defunct Google Code site and scattered conference presentations. Setting up the environment requires Java 7 or 8 (newer versions may have compatibility issues), specific Tomcat configurations, and manual database setup—there's no Docker container or automated deployment script. The repository is essentially frozen, with no issues being triaged and no pull requests being merged. For organizations wanting to extend the test suite with modern vulnerability types, you're effectively forking and maintaining your own version. The 241 GitHub stars tell the story: this is a niche tool used primarily by security researchers and tool vendors, not a thriving open-source community.

Verdict

Use WAVSEP if you're conducting formal evaluations of web application security scanners and need standardized, reproducible test cases with known ground truth for both vulnerabilities and false positives. It's invaluable for security tool vendors validating their detection logic, enterprise teams comparing scanners before purchase, or academic researchers studying scanner effectiveness. The structured taxonomy and comprehensive coverage of classic vulnerability variants (XSS, SQLi, command injection, path traversal) remains unmatched for these use cases. Skip it if you need a platform for security training (WebGoat or Juice Shop are better), want to test modern vulnerability classes introduced after 2014, or lack the Java/Tomcat expertise to deploy and maintain it. Also skip if you're evaluating scanners exclusively for modern single-page applications or API-first architectures—WAVSEP's traditional server-rendered JSP architecture won't reflect your actual attack surface. For most developers, WAVSEP is a historical artifact worth understanding but not actively deploying unless scanner benchmarking is specifically your mission.

WAVSEP: The Forgotten Benchmark That Exposed Security Scanner Snake Oil

WAVSEP: The Forgotten Benchmark That Exposed Security Scanner Snake Oil

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

WAVSEP: The Forgotten Benchmark That Exposed Security Scanner Snake Oil

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]