Back to Articles

Recog: The XML Fingerprint Database Powering Metasploit's Service Detection

[ View on GitHub ]

Recog: The XML Fingerprint Database Powering Metasploit's Service Detection

Hook

When Metasploit scans a network and tells you exactly which version of OpenSSH is running on port 22, it's using a framework that identifies services through 773+ community-maintained regex patterns stored in XML files—and you can use the same database in your own tools.

Context

Network security tools face a fundamental challenge: when you connect to a service on port 22 and receive "SSH-2.0-OpenSSH_7.4", how do you parse that banner into structured data like vendor (OpenBSD), product (OpenSSH), and version (7.4)? Multiply this problem across dozens of protocols—HTTP headers, FTP banners, SMTP greetings, SNMP responses—and you need thousands of patterns to identify what's actually running on discovered services.

Before centralized fingerprint databases, each security tool maintained its own hardcoded patterns, leading to duplicated effort and inconsistent results. Recog emerged from Rapid7's need to share fingerprint knowledge across their security product line (Metasploit, Nexpose, and others). Rather than embedding patterns in code, they externalized them into XML files with a standardized format. This architectural decision transformed fingerprints from proprietary logic into shareable, testable, version-controlled data that any tool in any language could consume.

Technical Insight

Recog's architecture separates concerns brilliantly: fingerprint definitions live in XML files organized by protocol, while language-specific implementations (Ruby, Java, Go) provide the matching engine. Each XML file targets a specific data source—ssh_banners.xml for SSH protocol banners, http_header_server.xml for HTTP Server headers, ftp_banners.xml for FTP welcome messages.

Here's a fingerprint from ssh_banners.xml that demonstrates the pattern structure:

<fingerprint pattern="^SSH-([\d.]+)-OpenSSH[_-]([\d.p]+)[_-]?(.*)?$">
  <description>OpenSSH</description>
  <example service.version="7.4">SSH-2.0-OpenSSH_7.4</example>
  <example service.version="6.6.1p1" os.version="Ubuntu-2ubuntu2">SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2</example>
  <param pos="0" name="service.vendor" value="OpenBSD"/>
  <param pos="0" name="service.family" value="OpenSSH"/>
  <param pos="0" name="service.product" value="OpenSSH"/>
  <param pos="1" name="ssh.protocol" value="{ssh.protocol}"/>
  <param pos="2" name="service.version" value="{service.version}"/>
  <param pos="3" name="service.cpe23" value="cpe:/a:openbsd:openssh:{service.version}"/>
</fingerprint>

The pattern attribute contains a regex with capture groups. The param elements define how to map those groups into structured fields. When pos="0", it's a static value. When pos="2", it references capture group 2 from the regex. The curly brace syntax {service.version} enables variable interpolation—capture group 2 becomes the version field, which then gets embedded into the CPE (Common Platform Enumeration) string.

The Ruby implementation loads these XML files and provides a matching API:

require 'recog'

# Load SSH banner fingerprints
db = Recog::DB.new('xml/ssh_banners.xml')

# Match a banner string
banner = 'SSH-2.0-OpenSSH_7.4'
if match = db.match(banner)
  puts match['service.vendor']   # => "OpenBSD"
  puts match['service.product']  # => "OpenSSH"
  puts match['service.version']  # => "7.4"
  puts match['service.cpe23']    # => "cpe:/a:openbsd:openssh:7.4"
end

What makes this architecture powerful is the testing framework. Each fingerprint includes <example> tags with real-world data and expected field values. The test suite validates every pattern:

# Run tests for a specific fingerprint file
recog_verify xml/ssh_banners.xml

# Output shows which examples pass/fail
FAIL: 'SSH-2.0-OpenSSH_7.4' failed to match
PASS: 'SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2' matched with all attributes

This test-driven approach prevents regex regressions. When contributors add fingerprints, they must include examples. When patterns break existing examples, tests fail. The XML format also supports <example _encoding="base64"> for binary protocols and <example _filename="data/protocols/ftp.bin"> for large payloads stored externally.

The multi-language strategy works because XML is language-agnostic. Java and Go implementations include the XML database as a git submodule, so all implementations stay synchronized with the same fingerprint definitions. When someone adds a new HTTP Server pattern to http_header_server.xml, every implementation immediately benefits.

Recog also handles ambiguity through fingerprint ordering. The XML files are evaluated sequentially, and the first match wins. More specific patterns appear earlier, with generic fallbacks at the end. This prevents "Apache" from matching before "Apache Tomcat/8.5.23" gets a chance.

Gotcha

The documentation explicitly warns that the Ruby library is "fairly new and subject to change quickly" and recommends consulting the maintainers before production use. This isn't false modesty—the API has evolved significantly, and upgrades can break integration code. The Java and Go implementations have more stable APIs, but this creates a documentation gap where Ruby examples don't translate directly to other languages.

Regex-based fingerprinting inherits all the brittleness of pattern matching. When vendors change banner formats (Microsoft loves doing this), fingerprints break until someone contributes an update. If you're identifying a newly released product, it won't be in the database yet. The framework also struggles with intentionally obfuscated banners—services that randomize version strings or remove identifying information won't match. Unlike machine learning approaches, Recog can't generalize from similar patterns; it only knows what's explicitly defined in the XML files. For rapidly evolving services like modern web frameworks that release weekly, the fingerprint database lags behind reality.

Verdict

Use if: You're building security scanning tools that need battle-tested service identification across multiple protocols, you want a fingerprint database backed by Rapid7's production security products, you need multi-language support with synchronized patterns, or you're extending existing tools like Metasploit that already integrate Recog. The XML fingerprint database itself is production-grade and actively maintained by the security community. Skip if: You need a stable programmatic API for Ruby (consult maintainers first), you're identifying rapidly changing web technologies (Wappalyzer is better), you want passive fingerprinting without active probes (use p0f), or you need machine learning-based identification that adapts to new patterns automatically. For quick command-line identification, the included tools work well, but integrating the library requires understanding its maturity level in your target language.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/rapid7-recog.svg)](https://starlog.is/api/badge-click/cybersecurity/rapid7-recog)