Building an Autonomous Reverse-Engineering Agent with Multi-Stage Verification
Hook
What if you could reverse-engineer an entire C++ binary while you sleep, with an AI agent that checks its own work through three layers of verification before committing a single line?
Context
Reverse engineering is notoriously time-consuming. A senior engineer might spend weeks manually reversing a moderately-sized binary—loading it into Ghidra or IDA Pro, analyzing each function's decompiled output, cross-referencing symbols, inferring data structures, and painstakingly reconstructing source code that matches the binary's behavior. The process is repetitive, error-prone, and doesn't scale. Large game engines, proprietary SDKs, and legacy systems contain thousands of functions, many following similar patterns that humans excel at recognizing but hate implementing repeatedly.
LLMs changed the equation. Tools like ChatGPT and Claude can read decompiled code and produce plausible C++ reconstructions. But "plausible" isn't "correct." LLMs hallucinate, miss edge cases, and confidently generate code that compiles but behaves differently than the original binary. Dryxio's auto-re-agent tackles this trust problem head-on by wrapping LLM-based reverse engineering in a multi-stage verification pipeline. It's not trying to replace human reverse engineers—it's automating the tedious 80% while providing verification signals so you can focus your expertise on the tricky 20%.
Technical Insight
The architecture is a four-stage pipeline: decompilation retrieval, LLM generation with source context, objective verification, and parity analysis. At the foundation sits ghidra-ai-bridge, a backend that maintains a headless Ghidra instance, accepting queries for decompiled code, cross-references, and struct definitions. The orchestrator queries this bridge for a target function, retrieves nearby source code from the project repository for context, then feeds both to an LLM provider (Claude, OpenAI, or a local Codex CLI endpoint).
Here's where it gets interesting. The system doesn't trust the LLM's first output. Stage two is a checker loop—the generated code goes back to the LLM with a prompt asking "does this implementation match the decompiled assembly?" This self-checking catches obvious mistakes like wrong parameter types or missing conditionals. But LLMs can hallucinate consistency, so stage three introduces objective verification: static analysis that counts function calls, measures control flow complexity (number of branches, loops), and compares these metrics between the decompiled binary and generated source. If the decompiled function has 7 calls and 3 conditional branches, but the generated code only has 5 calls and 2 branches, it fails verification regardless of what the LLM claims.
The fourth stage is the parity engine, which runs 11 heuristic signals comparing generated code against known source patterns in your codebase. These signals include stub detection (does it just return a constant or throw an exception?), call density (functions-per-line ratio), floating-point usage, string literal patterns, and domain-specific markers. For instance, if you're reversing a game engine that heavily uses a plugin architecture, the parity engine can flag functions that don't call the expected plugin registration APIs:
# Simplified parity signal example from the engine
def check_plugin_api_usage(generated_code, function_metadata):
expected_patterns = [
r'RegisterPlugin\(',
r'PluginManager::GetInstance\(',
r'PLUGIN_EXPORT'
]
if function_metadata.get('is_plugin_function'):
matches = sum(1 for pattern in expected_patterns
if re.search(pattern, generated_code))
if matches == 0:
return {'signal': 'RED', 'reason': 'Plugin function missing API calls'}
elif matches < 2:
return {'signal': 'YELLOW', 'reason': 'Incomplete plugin registration'}
else:
return {'signal': 'GREEN', 'reason': 'Expected plugin patterns found'}
return {'signal': 'GREEN', 'reason': 'Not a plugin function'}
Each signal returns RED (fail), YELLOW (manual review), or GREEN (pass). The engine aggregates these into a final confidence score. A function needs at least 8/11 GREEN signals to auto-accept. YELLOW signals queue the function for human review with context about which heuristics failed.
The source-aware aspect is critical. When reversing a function, the system searches your repository for similar function names, related classes, or file context. If you're reversing EntityManager::SpawnEntity, it retrieves your existing EntityManager source files and includes them in the LLM prompt. This dramatically improves accuracy because the LLM sees your coding conventions, naming patterns, and architecture decisions. It's the difference between asking someone to write code in a vacuum versus showing them your existing codebase first.
Session management is deliberately conservative. Every run creates a JSON session file tracking which functions were processed, their verification scores, and retry counts. The system has bounded retry loops (default 3 attempts per function) and never performs git operations automatically. It won't commit, push, or modify your working tree. Generated code goes into a separate output directory. This safety-first design means you can kill the process anytime without data loss or corrupted state. For long-running reverse engineering projects—imagine processing 2,000 functions overnight—this determinism is essential.
The Ghidra integration uses capability flags for graceful degradation. If your Ghidra project lacks certain metadata (maybe structs weren't fully analyzed), the bridge reports reduced capabilities, and the orchestrator adjusts its verification strictness. It might skip struct layout verification if that data isn't available but still verify call counts and control flow. This flexibility lets you start reversing early in your Ghidra analysis rather than waiting for perfect decompilation.
Gotcha
The system's effectiveness is bounded by Ghidra's decompilation quality and your reference source code. If Ghidra produces garbage decompilation—common with heavily optimized binaries, hand-written assembly, or unusual compiler flags—the LLM receives garbage input and produces garbage output. No amount of verification fixes this foundational problem. You'll burn through API tokens generating plausible-looking but fundamentally wrong code. Similarly, if you don't have reference source material (you're reversing a completely unknown binary with no similar codebases), the parity engine's heuristics become guesswork. The 11 signals are tuned for specific patterns; a binary that doesn't follow those conventions will generate false positives or false negatives.
The heuristic verification is also not proof of correctness. A function could pass all 11 signals and still have subtle bugs—off-by-one errors, incorrect edge case handling, or wrong floating-point precision. The tool provides confidence signals, not guarantees. For security-critical reverse engineering (vulnerability research, malware analysis, anti-cheat systems), you cannot rely solely on these heuristics. Manual verification remains mandatory. Additionally, LLM costs add up quickly. Processing thousands of functions through Claude or GPT-4 with multiple retries and checker loops can easily cost hundreds of dollars per project. Budget for this if you're considering automation at scale.
Verdict
Use if: you're reverse-engineering a large C++ binary (5,000+ functions) with existing partial source code or a similar codebase for reference, you have budget for LLM API costs ($200-500 per major reversing project), and you need to automate repetitive function reconstruction while maintaining verification guardrails. It's ideal for game modding, legacy system reconstruction, or SDK reversing where patterns repeat and you're comfortable with 85-90% accuracy requiring human spot-checks. Skip if: you're doing one-off or small-scale reversing (under 500 functions) where manual work is faster, you need guaranteed correctness for security research or legal compliance, you're working with heavily obfuscated or hand-optimized assembly that Ghidra struggles with, or you don't have reference source material for the parity engine to learn patterns from. In those cases, stick with manual Ghidra analysis or invest in custom Binary Ninja scripting tailored to your specific binary's quirks.