Anamnesis: The LLM Exploit Generation Benchmark That Rewrites Security Assumptions
Hook
GPT-5.2 discovered an exit handler chaining technique to bypass Control Flow Integrity, Shadow Stack protections, and a sandbox—without human hints. The exploit consumed 50 million tokens and took over three hours to generate, demonstrating autonomous capability on a challenge that would test experienced exploit developers.
Context
Exploit development has traditionally been the domain of elite security researchers who spend weeks reverse-engineering binaries, crafting ROP chains, and manually bypassing mitigations like ASLR and NX. It’s painstaking work that requires deep systems knowledge, assembly fluency, and creative problem-solving. The skill barrier has been one of cybersecurity’s few remaining moats—if attackers couldn’t exploit vulnerabilities efficiently, even known bugs remained relatively safe.
Anamnesis challenges that assumption with reproducible experimental evidence. Built by Sean Heelan, this evaluation framework tests whether frontier LLMs can autonomously generate working exploits from vulnerability reports. Using a use-after-free bug in QuickJS (which the README states was itself discovered by an Opus 4.5 agent), Anamnesis provides agents with bug reports and proof-of-concept triggers, then measures their ability to produce exploits that bypass progressively harder security mitigations: ASLR, NX, Full RELRO, Control Flow Integrity, Shadow Stack, and sandboxing. The results show both Claude Opus 4.5 and GPT-5.2 succeeded at tasks that would challenge experienced exploit developers, independently discovering techniques like GOT overwrites, FSOP attacks, pointer mangling defeats, and multi-stage ROP chains. The repository includes complete experiment logs, working exploits, and detailed technical analysis of each bypass technique.
Technical Insight
Anamnesis is structured as a JavaScript-based evaluation harness that orchestrates LLM agents through the exploit development lifecycle. Agents interact with a vulnerable QuickJS binary through a controlled environment, using the Claude Agent SDK for Opus 4.5 and OpenAI Agents SDK for GPT-5.2. The framework configures token budgets (30-60M per run), reasoning settings (31999 for Opus, ‘high’ or ‘xhigh’ for GPT-5.2), and mitigation profiles, then logs every interaction as agents analyze binaries, craft exploits, and iterate on failures.
The baseline challenge—Partial RELRO with ASLR and NX enabled—demonstrates the agents’ fundamental capabilities. Both models independently discovered the classic GOT overwrite technique. According to the README, GPT-5.2 produced exploits that overwrite free@GOT with system() addresses and trigger shell execution by calling free on buffers containing “/bin/sh”.
What’s significant is how agents adapted when mitigations increased. Under Full RELRO, the GOT becomes read-only, blocking direct overwrites. Opus 4.5 pivoted to FSOP attacks, constructing fake FILE structures to hijack glibc’s cleanup routines. GPT-5.2 took a different approach: the README describes how it traversed DT_DEBUG -> r_debug -> link_map structures to enumerate loaded libraries, read __libc_stack_end from ld-linux, then built ROP chains targeting execve. The README notes these techniques weren’t provided as hints—the agents discovered them by analyzing ELF structures and glibc internals.
The architecture reveals careful engineering in the evaluation methodology. Each experiment runs 10 independent agent instances with no communication between runs, ensuring reproducibility. The framework captures complete work logs showing the agent’s reasoning process, failed attempts, and iterative refinement. For example, in Full RELRO + CFI experiments, Opus consistently used stack corruption: leaking libc addresses, scanning memory for return addresses, then overwriting them with ROP chains. This works because Clang’s CFI only protects forward edges (indirect calls), not backward edges (returns). GPT-5.2 also used this approach, but additionally discovered that glibc’s exit handlers could be hijacked by locating the pointer mangling key.
The hardest challenge—Full RELRO + CFI + Shadow Stack + Sandbox—blocked traditional ROP chains entirely. Shadow Stack maintains a protected copy of return addresses, detecting any corruption. According to the README, GPT-5.2’s solution chained multiple exit handlers together by corrupting the exit handler list with carefully crafted pointers, each mangled with the correct key. This achieved multi-call execution without touching the return stack, remaining both CFI-compliant and Shadow Stack-safe. The README states this took over 3 hours and 50M tokens.
The experiment data lives in experiment-results/, organized by mitigation configuration and model. Each run directory contains subdirectories for achieved_primitives with working exploit PoCs, agent work logs showing the full reasoning trace, and metadata about token consumption and timing. Running run_experiments.py reproduces the evaluation, though the hardest challenges consume up to 60M tokens and require 3+ hours on high-reasoning settings.
Gotcha
Anamnesis is an academic evaluation framework demonstrating capability, not a production security tool. The framework focuses exclusively on a single vulnerability—a use-after-free in QuickJS. While this provides controlled reproducibility for scientific evaluation, it limits generalizability to other vulnerability types or targets. The README explicitly acknowledges that 10 runs per experiment is ‘too low to make definitive statements about the relative capabilities of the models,’ which should inform interpretation of the comparative results.
The computational requirements are substantial. The baseline Partial RELRO challenge consumes around 30 million tokens per run; the hardest configuration required 60 million. The three-hour runtime for complex challenges means this isn’t something you iterate on quickly. The framework also requires specific SDK configurations (Claude Agent SDK, OpenAI Agents SDK) that may not translate cleanly to other LLM providers or self-hosted models. The README notes that for the hardest experiment, only GPT-5.2 was run due to resource concentration decisions.
Verdict
Use Anamnesis if you’re researching AI-assisted security, need reproducible benchmarks for evaluating LLM exploitation capabilities, or want to study the technical details of how frontier models approach binary exploitation. It’s valuable for security researchers who need concrete evidence when discussing AI capabilities in offensive security contexts—the experiment logs and working exploits in the experiment-results directory provide specific examples that general claims cannot match. This is also relevant reading if you’re in vulnerability disclosure or defensive security; understanding what LLMs can demonstrate on controlled challenges should inform threat models. Skip it if you want a general-purpose exploit development framework, need production tooling for security testing, or expect coverage beyond single-vulnerability case studies. This is narrowly scoped research infrastructure with explicit limitations acknowledged in the README, not a practical security toolkit. Also consider your resource constraints—the token budgets (30-60M) and runtime requirements (3+ hours for hardest challenges) make casual exploration computationally expensive.