Back to Articles

PMD: Building a Multi-Language Static Analyzer on AST Parsing and Rule Engines

[ View on GitHub ]

PMD: Building a Multi-Language Static Analyzer on AST Parsing and Rule Engines

Hook

While most linters hard-code their rules in the language they analyze, PMD lets you write bug detectors in either Java or XPath—treating your codebase as queryable data structures rather than text.

Context

Static analysis tools emerged from a simple observation: many bugs follow predictable patterns. In the early 2000s, tools like FindBugs and Checkstyle gained traction in the Java ecosystem, but they solved different problems—FindBugs analyzed bytecode for runtime bugs, while Checkstyle focused on formatting conventions. PMD entered this landscape in 2002 with a different philosophy: treat source code as structured data that can be queried and analyzed through abstract syntax trees, not regex patterns on text.

The insight was powerful. Rather than writing brittle string-matching logic, developers could traverse the actual structure of their code—the same representation compilers use. This approach enabled sophisticated checks like "find all methods that catch Exception but don't log it" or "detect database queries inside loops." Over two decades, PMD evolved from a Java-only tool into a polyglot analyzer supporting 16+ languages, from Apex to Swift, while maintaining backward compatibility and an extensible architecture that lets teams codify their own coding standards.

Technical Insight

Output

Rule Engine

Parsing Layer

Grammar-based

Grammar-based

Tree traversal

Visitor pattern

Tokenization

Source Files

Java, Apex, JS, etc.

JavaCC Parser

Java, PL/SQL

ANTLR Parser

Apex, Swift, Go

Abstract Syntax Trees

Language ASTs

XPath Rules

Pattern Matching

Java Rules

Complex Logic

Violation Reports

Renderers

XML, JSON, HTML

CPD Engine

Token-based Duplicates

Duplication Report

System architecture — auto-generated

PMD's architecture centers on a three-stage pipeline: parsing source code into abstract syntax trees, executing rules against those ASTs, and rendering violations in various formats. The parsing stage uses language-specific grammars—JavaCC for Java, ANTLR for languages like Apex and Swift—to transform source files into tree structures where each node represents a syntactic element (classes, methods, expressions).

What makes PMD distinctive is its dual rule authoring model. You can write rules in Java for complex logic, or use XPath queries for simpler pattern matching. Here's a practical example of an XPath rule that detects empty catch blocks:

<rule name="EmptyCatchBlock"
      language="java"
      message="Avoid empty catch blocks"
      class="net.sourceforge.pmd.lang.rule.XPathRule">
    <description>Empty catch blocks hide exceptions</description>
    <priority>3</priority>
    <properties>
        <property name="xpath">
            <value>
<![CDATA[
//CatchStatement[Block[count(*) = 0]]
]]>
            </value>
        </property>
    </properties>
</rule>

This XPath expression navigates the AST looking for CatchStatement nodes containing empty Block children. It's declarative and concise, requiring no Java compilation. For more complex scenarios, Java-based rules give you programmatic control:

public class AvoidDatabaseCallsInLoopsRule extends AbstractJavaRule {
    @Override
    public Object visit(ASTForStatement node, Object data) {
        // Check if loop body contains database operations
        List<ASTMethodCall> calls = node.findDescendantsOfType(ASTMethodCall.class);
        for (ASTMethodCall call : calls) {
            if (isDatabaseCall(call)) {
                addViolation(data, call, "Avoid database calls in loops");
            }
        }
        return super.visit(node, data);
    }
    
    private boolean isDatabaseCall(ASTMethodCall call) {
        // Logic to identify JDBC calls, ORM operations, etc.
        return call.getImage().matches("(executeQuery|executeUpdate|save|persist)");
    }
}

This Java rule demonstrates AST traversal—visiting ForStatement nodes, searching descendants for method calls, and applying custom logic to determine violations. The tradeoff is clear: XPath rules are faster to write but limited to structural patterns, while Java rules require compilation but support arbitrary complexity.

PMD's architecture separates concerns through interfaces. The RuleViolation interface abstracts how violations are reported, allowing output renderers (XML, JSON, HTML) to operate independently of rule execution. Similarly, the Language interface defines how PMD handles each programming language—specifying its parser, version handling, and AST node types. This design enabled PMD to grow from 1 language to 16+ without rewriting the core engine.

The Copy-Paste Detector (CPD) component uses a different algorithm entirely. Instead of AST-based analysis, CPD tokenizes source code and applies the Rabin-Karp string matching algorithm to find duplicated token sequences. It's language-agnostic at its core—as long as PMD has a tokenizer for your language, CPD can detect duplicates. This explains why CPD supports 30+ languages while PMD's rule engine covers only 16: tokenization is simpler than full AST parsing.

Integration points reveal PMD's maturity. The Maven plugin exposes configuration through XML:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-pmd-plugin</artifactId>
    <version>3.21.0</version>
    <configuration>
        <rulesets>
            <ruleset>/custom-rules.xml</ruleset>
        </rulesets>
        <failOnViolation>true</failOnViolation>
        <printFailingErrors>true</printFailingErrors>
    </configuration>
</plugin>

This declarative approach works because PMD's rule engine is designed as a library, not just a CLI tool. The same architecture powers IDE integrations—Eclipse and IntelliJ plugins run PMD's analysis engine in the background, providing real-time feedback as you type.

Gotcha

PMD's comprehensive approach comes with performance costs. Because it builds complete abstract syntax trees for every analyzed file, large codebases can take minutes to analyze. A 500,000-line Java project might require 3-5 minutes for a full PMD run, compared to under a minute for bytecode analyzers like SpotBugs. The AST parsing overhead is unavoidable—it's the foundation of PMD's power—but it makes incremental analysis critical in CI/CD pipelines. Teams often resort to analyzing only changed files or running different rule sets based on context (strict rules in PR validation, lighter checks in pre-commit hooks).

The multi-language promise has asterisks. While PMD supports 16+ languages, rule coverage varies dramatically. Java has 400+ rules covering everything from security to performance, while languages like Scala have parser support but almost no built-in rules. You'll need to write custom rules for less-popular languages, which requires understanding both the language's AST structure and either XPath or Java rule development. The XPath approach, while powerful, has a steep learning curve—you're essentially querying tree structures with a syntax designed for XML documents. Debugging XPath expressions against ASTs often involves trial-and-error with PMD's designer tool, and complex queries can become unreadable. Documentation for AST node types is scattered, requiring source code inspection to understand what properties and children each node exposes.

Verdict

Use PMD if you're working with Java, Apex, or PL/SQL codebases where you need extensive built-in rules and the ability to create custom organizational standards. It excels when you need both static analysis and copy-paste detection in one tool, or when your architecture involves multiple JVM languages that PMD supports well. The extensibility shines for teams with specific domain requirements—financial services enforcing transaction safety rules, or healthcare applications requiring HIPAA-related checks. Skip PMD if you need fast analysis for large monorepos in CI/CD where every minute counts, or if you're working primarily with languages outside its core competency (JavaScript, Python, Go all have better-optimized native tools). Also skip it if you want quick wins without investment—PMD's power requires learning its rule model, and simpler linters like Checkstyle might suffice for basic code style enforcement. For polyglot teams, consider whether SonarQube's broader language support justifies its complexity, or whether language-specific tools (ESLint, Pylint) provide better ergonomics for your stack.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/pmd-pmd.svg)](https://starlog.is/api/badge-click/cybersecurity/pmd-pmd)