PMD: Building Multi-Language Static Analysis with AST Traversal and XPath Rules
Hook
Most static analyzers lock you into one language. PMD parses 16+ languages into abstract syntax trees and lets you write detection rules in either Java or XPath—turning code quality checks into queryable data structures.
Context
Static analysis tools have traditionally been language-specific islands. If you work with Java, you use Checkstyle or SpotBugs. JavaScript gets ESLint. Python has pylint. This fragmentation creates operational overhead: different tools for different languages, each with unique configuration formats, CI/CD integrations, and rule customization approaches. Enterprise teams working with polyglot codebases—think Java backends with Apex on Salesforce, Swift mobile clients, and PL/SQL databases—end up managing a patchwork of analyzers.
PMD provides a unified framework that parses multiple languages into abstract syntax trees and runs analysis rules against those trees. The key architectural insight: if you can represent different languages as traversable tree structures, you can apply similar analysis patterns across all of them. PMD now supports 16+ languages for static analysis (mainly concerned with Java and Apex) and 30+ languages for its copy-paste detector (CPD), all while maintaining over 400 built-in rules and a healthy open-source community with 5,351 GitHub stars.
Technical Insight
PMD’s architecture centers on separating parsing from analysis. For each supported language, PMD uses either JavaCC or ANTLR to generate parsers that transform source code into abstract syntax trees. Once you have an AST, the language-specific details fade away—you’re working with nodes, children, and attributes. This abstraction enables the dual rule engine, PMD’s most distinctive feature.
The first rule engine runs Java-based rules. You write a class that extends a base rule type and implements visitor methods for specific AST node types. This visitor pattern gives you full programmatic control—access to type resolution, symbol tables, and complex logic that spans multiple AST nodes. You can implement sophisticated checks like detecting circular dependencies or validating architectural layer boundaries.
The second rule engine uses XPath queries, treating the AST as an XML-like document. This declarative approach dramatically lowers the barrier for custom rules. You write an XPath expression, drop it in a ruleset XML file, and PMD executes it during analysis. This makes iterative rule development fast—edit the XPath, rerun PMD, see results. For teams that need to enforce organization-specific conventions (like ensuring database calls happen in service layer classes), XPath rules offer a low-friction path from policy to enforcement.
The CPD component takes a different approach entirely. Rather than building ASTs, it tokenizes source code and identifies duplicate token sequences. This is computationally cheaper than AST comparison and works across languages without needing full parsers. You can detect copy-paste violations in 30+ languages, including those without full PMD rule support. This token-based approach means CPD can analyze languages like Dart, Go, and TypeScript even though PMD doesn’t have comprehensive rulesets for them yet.
Integration options include plugins for Maven and Gradle as well as for various IDEs. PMD also provides a Docker image (pmdcode/pmd) for consistent CI/CD execution across different environments. The project maintains reproducible builds, which are verified through the Reproducible Builds project—this ensures that published binaries can be independently verified against the source code.
Gotcha
PMD’s biggest limitation appears in the README’s language support table: Scala is listed as supported, but there are currently no Scala rules available. The parser exists, the AST infrastructure works, but you get zero out-of-the-box rules. If you’re analyzing Scala code, you’d need to write every rule yourself—a significant investment that undermines PMD’s value proposition of comprehensive built-in rules.
Configuration complexity hits hard for newcomers. Understanding what makes a good XPath query requires mental models of AST structure. The documentation shows the tree representations, but there’s cognitive overhead in mapping source code constructs to tree nodes. For Java rules, you need to understand the visitor pattern and PMD’s specific AST node hierarchy. While this investment pays off for teams building custom rules, it’s a steep learning curve compared to ESLint’s simple JSON configuration. The JVM requirement also introduces friction in polyglot environments—if your frontend team primarily uses Node.js, requiring a Java runtime just to run a linter feels heavyweight.
Verdict
Use PMD if you’re working in enterprise Java environments, especially those incorporating Salesforce Apex, or if you maintain polyglot codebases where a single analysis tool across multiple languages provides operational value. The 400+ built-in rules for Java and Apex represent years of accumulated wisdom about common defects, and the dual rule engine (Java + XPath) gives you an escape hatch for organization-specific policies. The CPD component alone justifies adoption if you need multi-language duplicate detection. Skip PMD if you work exclusively in languages with mature specialized tooling (JavaScript/ESLint, Python/pylint, Go/staticcheck) where those tools offer better ecosystem integration and faster performance. Also skip if you prioritize configuration simplicity over extensibility—if you just want to enforce a code style guide without writing custom rules, Checkstyle or Prettier will frustrate you less. The JVM requirement and setup complexity make PMD overkill for small projects or teams that don’t need custom rule development.