Building a Self-Correcting Reverse Engineering Agent with Dual LLMs

Hook

What if your decompiler could argue with itself until the code was right? auto-re-agent runs two competing LLMs in a loop—one reconstructing source code from binaries, the other tearing it apart looking for mistakes—then uses an 11-signal quality engine to decide when to stop.

Context

Reverse engineering at scale has always been a grinding war of attrition. Anyone who’s worked on game preservation projects, malware analysis pipelines, or legacy binary modernization knows the pattern: open Ghidra or IDA, decompile a function, spend twenty minutes untangling pointer arithmetic and inlined templates, write equivalent source code, verify it matches the binary’s behavior, then move to the next of 3,000 functions. Even experienced reverse engineers can only handle 10-15 complex functions per day.

LLMs promised to change this. Tools like GitHub Copilot and GPT-4 can read decompiled output and generate surprisingly plausible source reconstructions. But anyone who’s tried feeding Ghidra decompilation into ChatGPT hits the same wall: the LLM confidently produces code that looks right but introduces subtle behavioral differences—wrong alignment assumptions, missing volatile keywords, hallucinated helper functions that don’t exist in the binary. For hobbyist projects you can hand-verify each function. For production RE work on thousands of functions, you need automation that knows when it’s wrong. That’s the problem auto-re-agent tackles: creating a reverse engineering pipeline with LLMs that can self-correct and self-assess quality at scale.

Technical Insight

System architecture — auto-generated

The architecture centers on what the author calls a “dual-LLM verify-fix loop.” Instead of asking one model to both generate and validate code, auto-re-agent splits the cognitive load. The ‘reverser’ LLM receives Ghidra decompilation output along with cross-references, type information, and project-specific context (like known stub patterns or hook conventions). It generates a source code reconstruction. That output immediately goes to a separate ‘checker’ LLM with a different system prompt optimized for finding discrepancies—missing edge cases, type mismatches, behavioral differences from the decompilation. The checker returns critiques, the reverser tries again, and the loop continues for N rounds (configurable, typically 3-5).

This adversarial setup exploits a quirk of LLM behavior: models are often better at critique than generation. A model that produces code with subtle bugs will frequently spot those same bugs when asked to review them in isolation. By forcing the critique into a separate inference pass with fresh context, you reduce the “consistency bias” where models defend their own output.

The system’s integration with Ghidra happens through an external bridge component (ghidra-ai-bridge). Here’s what a typical orchestration flow looks like:

# Simplified conceptual example based on the architecture
class REOrchestrator:
    def __init__(self, reverser_llm, checker_llm, parity_engine):
        self.reverser = reverser_llm
        self.checker = checker_llm
        self.parity = parity_engine
        self.ghidra = GhidraAIBridge()
    
    def process_function(self, func_address, max_rounds=5):
        # Extract from Ghidra: decompilation, xrefs, types
        context = self.ghidra.extract_function_context(func_address)
        
        for round_num in range(max_rounds):
            # Reverser generates source reconstruction
            source_code = self.reverser.generate(
                decompiled=context.decompilation,
                xrefs=context.cross_references,
                prior_attempt=source_code if round_num > 0 else None,
                critique=critique if round_num > 0 else None
            )
            
            # Checker validates against original binary behavior
            critique = self.checker.validate(
                original=context.decompilation,
                reconstructed=source_code,
                type_info=context.types
            )
            
            # Parity engine decides if quality is acceptable
            quality = self.parity.evaluate(source_code, context)
            if quality == "GREEN":
                return source_code, round_num
            elif quality == "RED" and round_num == max_rounds - 1:
                return source_code, "FAILED"
        
        return source_code, "MAX_ROUNDS"

The real magic is in the parity engine’s 11 signals. Instead of just checking “does it compile,” it runs heuristics that catch reverse-engineering-specific issues. Signal examples include:

Stub detection: Scans for markers like STUB(), __debugbreak(), or project-specific incomplete patterns. Functions with stubs shouldn’t pass as GREEN.
Plugin call density: Measures ratio of calls to known hooks/plugins versus total calls. Deviations from project norms flag RED (might indicate hallucinated helper functions).
Floating-point presence: Checks if FP operations in decompilation match FP usage in reconstruction. Missing an __int64 cast can change behavior.
Trivial wrapper detection: Flags functions that just call one other function—might indicate the reverser gave up and produced a placeholder.
Type consistency: Cross-references reconstructed types against Ghidra’s type recovery. Mismatched struct layouts break at runtime.

Each signal returns a score; the engine combines them into GREEN/YELLOW/RED classifications. YELLOW means “probably needs human review.” This interpretable quality gate is crucial for production use—you can tune thresholds per project and understand why something failed.

Session management is equally thoughtful. The orchestrator persists state to JSON after each function, storing the final source code, quality classification, round count, and token usage. If the process crashes or you hit API rate limits, you can resume exactly where you left off. For large projects (think decompiling an entire game engine), this resumability is the difference between “interesting experiment” and “production tool.”

The configuration system deserves mention: CLI flags override environment variables, which override YAML profiles, which override hardcoded defaults. You can define per-project profiles that encode conventions:

# project_config.yaml for a game engine
project: "engine_alpha"
stub_markers:
  - "STUB"
  - "NOT_IMPLEMENTED"
  - "__debugbreak"
hook_patterns:
  - "Hook_*"
  - "Original_*"
max_rounds: 5
parity_thresholds:
  stub_tolerance: 0
  plugin_density_range: [0.15, 0.40]
  fp_mismatch_tolerance: 2

This lets teams adapt the tool to their codebase’s quirks without forking the core logic.

Gotcha

The dual-LLM approach sounds elegant but brings practical friction. First, cost: you’re running two inference passes per round, typically 3-5 rounds per function. For a complex 200-line function, expect 4,000-8,000 tokens per round across both models. At GPT-4 pricing that’s $0.50-$1.00 per function. Decompiling 1,000 functions costs $500-$1,000 in API fees. The author mentions using cheaper models for the checker, which helps, but you’re still paying for two models’ worth of compute on every iteration.

Second, the parity engine’s heuristics can’t catch semantic bugs that compile cleanly. If the reverser misunderstands a bitwise trick and produces code that compiles, passes type checks, and has similar call density, but behaves differently on edge case inputs—GREEN classification, wrong output. The system has no runtime verification step. You’re trusting static analysis heuristics to proxy for correctness, which works maybe 80-90% of the time based on the design.

Third, setup complexity is real. You need a pre-configured Ghidra project with the binary already loaded and analyzed, ghidra-ai-bridge installed and working (which itself requires compatible Ghidra and Python versions), and API keys configured. The repo’s documentation is minimal—9 stars suggests this is early-stage tooling without mature onboarding guides. Expect to spend half a day reading code to understand configuration options and workflow.

Finally, latency makes this unsuitable for interactive use. Each function takes 30-90 seconds depending on complexity and rounds. You can batch process overnight, but you can’t sit in a debugger and iteratively improve one function with instant feedback like you would in IDA Pro with Hex-Rays.

Verdict

Use if: You’re working on large-scale reverse engineering projects (500+ functions) where manual decompilation is the bottleneck, you have budget for LLM API costs ($500-$2000 depending on project size), and you can tolerate 10-20% of output needing human review. This tool shines for game preservation, firmware modernization, or malware family analysis where you need to process thousands of functions and can batch overnight. Also ideal if your project has consistent patterns (hooks, stubs, plugin architectures) that the parity engine can learn. Skip if: You’re reverse engineering small binaries (under 100 functions) where manual work is faster, you need provably correct output for safety-critical or security applications, you lack LLM API budget, or you prefer interactive RE workflows where you refine one function at a time in a debugger. Also skip if your Ghidra setup is flaky—this tool’s value proposition depends on reliable decompilation input. For those cases, stick with native Ghidra scripting for full control or Binary Ninja’s tighter IDE integration for interactive LLM-assisted analysis.

Building a Self-Correcting Reverse Engineering Agent with Dual LLMs

Building a Self-Correcting Reverse Engineering Agent with Dual LLMs

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building a Self-Correcting Reverse Engineering Agent with Dual LLMs

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Sudomy: The Bash Orchestrator That Weaponizes 22 APIs for Subdomain Reconnaissance

vulnx: Query 250,000+ CVEs Like You're Searching Your Email

VHostScan: Fuzzy Logic and Virtual Host Discovery for Penetration Testing

Repokid: Netflix's Battle-Tested Approach to Taming AWS IAM Permission Creep

Sudomy: The Bash Orchestrator That Weaponizes 22 APIs for Subdomain Reconnaissance

vulnx: Query 250,000+ CVEs Like You're Searching Your Email

VHostScan: Fuzzy Logic and Virtual Host Discovery for Penetration Testing

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]