Back to Articles

Trailmark: Building Security-Focused Code Graphs Across 22+ Languages

[ View on GitHub ]

Trailmark: Building Security-Focused Code Graphs Across 22+ Languages

Hook

Most code analysis tools speak one language fluently. Trailmark speaks 22+, building a unified graph database where you can query "show me all paths from HTTP handlers to SQL queries" across Python, JavaScript, and C in the same codebase.

Context

Security auditors face a persistent problem: modern applications are polyglot by default. Your typical web service might have a Python FastAPI backend calling Rust utilities, a Go microservice handling auth, JavaScript running in the browser, and C libraries doing the heavy cryptographic lifting. Traditional static analysis tools are language-specific—CodeQL excels at C++ but requires separate queries for JavaScript, Semgrep needs different rulesets per language, and none of them talk to each other.

This fragmentation creates blind spots. An attacker doesn't care that your input validation is in TypeScript while your database layer is Python—they care about the path connecting them. Trail of Bits built Trailmark to solve this exact problem: create a language-agnostic graph representation of code where nodes are functions, classes, and modules, edges are relationships like calls and inheritance, and queries can traverse the entire codebase regardless of which languages are involved. The goal isn't to replace deep semantic analysis tools but to provide the 10,000-foot view that security work demands.

Technical Insight

Trailmark's architecture rests on a three-phase pipeline that prioritizes breadth over depth. First, it parses source code using tree-sitter grammars—the same incremental parsing library that powers syntax highlighting in editors like Neovim and Zed. Tree-sitter provides ASTs for 22+ languages with a consistent API, which means Trailmark doesn't need language-specific parsers. This is the crucial trade-off: you get wide language support at the cost of semantic precision.

The second phase extracts nodes and edges from these ASTs and indexes them into a rustworkx graph. Rustworkx is a Rust-based graph library with Python bindings, chosen specifically for performance—it can handle whole-program analysis on multi-million-line codebases without choking. Nodes carry metadata like cyclomatic complexity, type annotations (when present), and thrown exceptions. Edges capture calls, inheritance, containment ("this function is inside this class"), and imports.

Here's where it gets interesting: Trailmark tags edges with confidence levels. When it sees a direct function call like authenticate_user(request.body), that edge is marked certain. When it resolves a cross-file import unambiguously, that's also certain. But when it encounters polymorphism, dynamic dispatch, or ambiguous module resolution, it creates an uncertain edge. This explicit uncertainty tracking is rare in static analysis tools and critical for security work—you need to know where your analysis might have gaps.

Here's a practical example of querying the graph. Suppose you want to find all paths from HTTP request handlers to database queries in a mixed Python/JavaScript codebase:

from trailmark import CodeGraph

# Build the graph from source directories
graph = CodeGraph()
graph.index_directory("./backend", language="python")
graph.index_directory("./services", language="javascript")

# Find entry points (functions decorated with @app.route or express handlers)
entrypoints = graph.query_nodes(
    node_type="function",
    has_annotation="route|get|post"
)

# Find database interaction points (calls to execute, query, etc.)
db_sinks = graph.query_nodes(
    node_type="function",
    name_matches="execute|query|find|insert|update"
)

# Traverse all paths between them
for entry in entrypoints:
    for sink in db_sinks:
        paths = graph.find_paths(entry, sink, max_depth=10)
        for path in paths:
            # Check if any edge in the path is uncertain
            confidence = min(edge.confidence for edge in path.edges)
            print(f"Path: {' -> '.join(node.name for node in path.nodes)}")
            print(f"Confidence: {confidence}")
            print(f"Languages: {set(node.language for node in path.nodes)}")

This query spans language boundaries seamlessly. If your Python backend calls a JavaScript service via an internal API, Trailmark can track that relationship—assuming the call is statically resolvable.

The third phase is where Trailmark distinguishes itself from generic call graph tools: semantic annotation ingestion. You can import findings from SARIF (the Static Analysis Results Interchange Format) or weAudit (Trail of Bits' audit markup tool) and merge them directly into the graph as node or edge annotations. This means you can run Semgrep to find SQL injection patterns, import those results into Trailmark, then query "show me all functions with high cyclomatic complexity that lie on a path from an HTTP handler to a SARIF-flagged SQL sink." You're combining structural analysis with external findings in a way that traditional tools don't support.

The graph also tracks trust boundaries. You can tag nodes as "untrusted entry points" (HTTP handlers, CLI argument parsers, deserialization functions) or "trusted sinks" (logging, analytics), then query for paths that cross from untrusted to trusted without passing through validation functions. This is attack surface enumeration as a graph query:

# Tag untrusted entry points
graph.annotate_nodes(
    query={"node_type": "function", "decorators": ["app.route"]},
    annotation={"trust_level": "untrusted"}
)

# Tag validation functions
graph.annotate_nodes(
    query={"name_matches": "validate|sanitize|escape"},
    annotation={"is_validator": True}
)

# Find paths from untrusted to sensitive that skip validation
sensitive_sinks = graph.query_nodes(name_matches="execute|eval|system")
for entry in graph.query_nodes(trust_level="untrusted"):
    for sink in sensitive_sinks:
        risky_paths = graph.find_paths(
            entry, 
            sink,
            exclude_nodes={"is_validator": True}
        )
        if risky_paths:
            print(f"Unvalidated path found: {entry.name} -> {sink.name}")

This is fundamentally different from pattern-based tools like Semgrep. You're not searching for local patterns—you're analyzing global program structure.

Gotcha

Trailmark's breadth comes with inherent precision limits that security-conscious users need to understand. The core issue is that tree-sitter parses syntax, not semantics. It doesn't perform type checking, doesn't resolve generics, and doesn't understand most forms of dynamic dispatch. When you call a method on an interface in TypeScript or Java, Trailmark sees the call but can't definitively determine which implementation gets invoked unless there's only one possibility. It tags the edge as uncertain and moves on.

This creates coverage gaps in object-oriented and functional codebases. Polymorphic method calls, higher-order functions passed as callbacks, and dynamic module imports all result in uncertain or missing edges. For security audits, this is dangerous—an attacker only needs one real path to succeed, and if that path involves dynamic dispatch that Trailmark couldn't resolve, your query results will give false confidence.

The cross-language call resolution suffers similar limitations. If your Python backend spawns a subprocess to run a Go binary, that's invisible to Trailmark unless you manually annotate the relationship. If you make HTTP calls between services, those aren't edges in the graph unless you build custom extractors to detect them. The tool works best on monolithic codebases with static call structures, which is increasingly rare in modern microservice architectures.

Finally, mutation testing and fuzzing integration—some of the most exciting features mentioned in the roadmap—aren't implemented yet. The vision is to use graph queries to guide fuzzing toward high-value paths and prioritize mutation testing on complexity hotspots, but that's future work. Right now, you're getting graph construction and queries, not automated test generation.

Verdict

Use Trailmark if: you're auditing polyglot codebases and need to map attack surfaces across language boundaries, you want to combine structural analysis with external tool findings (SARIF, custom annotations) in a unified graph, or you're doing exploratory security research where breadth matters more than semantic precision. It's particularly valuable when you need to ask architectural questions like "what's the blast radius of this component" or "which high-complexity functions are reachable from network input." Skip if: you need precise data flow analysis with full type inference—CodeQL or language-specific tools will catch vulnerabilities Trailmark misses, your codebase relies heavily on dynamic dispatch and runtime polymorphism where static analysis struggles, or you need production-ready mutation testing and fuzzing integration right now rather than as a future roadmap item. For deep single-language audits, use dedicated tools; for whole-system architecture analysis, Trailmark fills a real gap.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/trailofbits-trailmark.svg)](https://starlog.is/api/badge-click/data-knowledge/trailofbits-trailmark)