Inside the Risk Bubble: How Princeton’s Framework Measures AI Agent Capabilities in Offensive Security
Hook
When you give an AI agent multiple attempts to hack a vulnerable system, does it get better at exploiting it—or just expose the ceiling of its capabilities? Princeton researchers built a framework to answer exactly that question.
Context
As large language models become autonomous agents capable of executing penetration testing tasks, the security community faces questions about assessing the risk these systems pose. Traditional cybersecurity evaluations test static capabilities, but AI agents can improve through iteration—they refine prompts, optimize workflows, and even train on their own successful attempts. This creates what researchers call a ‘risk bubble’: agents appear to improve dramatically through iteration, but may hit fundamental capability ceilings that static benchmarks can’t detect.
The Dynamic Risk Assessment framework from Princeton University (with collaborators from UC Irvine) addresses this gap by implementing rigorous evaluation methodologies that account for iterative improvement. Rather than running a model once against a CTF challenge and recording pass/fail, the framework runs 12 independent rollouts by default, applies iterative prompt refinement across 20 rounds, and evaluates self-training on successful trajectories. It’s designed for researchers who need to distinguish between genuine capability growth and statistical noise—critical for understanding whether an AI agent that fails multiple times but succeeds once represents a real security risk or a lucky guess.
Technical Insight
The framework implements three complementary improvement strategies, each revealing different aspects of agent capability. The foundation is repeated sampling with statistical rigor: for each CTF challenge, the system runs 12 independent rollouts by default with varying max interaction rounds (N=20 by default), then calculates pass@k metrics with confidence intervals. This isn’t just about success rates—it’s about understanding the probability distribution of agent capabilities.
The evaluation workflow starts by hosting your model via vLLM API (the framework assumes you’re running inference on GPU infrastructure), then launching parallel Docker containers for isolated agent execution. Here’s how you’d evaluate a model on the Intercode CTF test set:
bash scripts/launch_evaluation_base.sh
# Key parameters:
# N=20 (max interaction rounds)
# dataset=intercode_ctf
# task_mask=analysis/test_tasks.txt (test set only)
# model_name=Qwen2.5-Coder-32B-Instruct
# parallelism=10
# i=1..12 (repetition id for statistical analysis)
# After collecting logs, grade the benchmark:
python analysis/grade_benchmark.py \
--task_name intercode_ctf \
--N 20 \
--output_file acc_repeated_sampling.csv \
--k0 12 \
--test_set
This generates a CSV with pass@1 through pass@12 scores and confidence intervals—critical for distinguishing signal from noise.
The second strategy, iterative prompt refinement, identifies failed tasks from initial runs and automatically refines instructions based on failure patterns. This reveals whether agents fail due to poor prompting or fundamental capability gaps. The system runs up to 20 refinement iterations for each failed task:
bash scripts/launch_evaluation_iter_prompt_refinement.sh
# k0=12 (number of rollouts)
# iter_prompt_round=1..20 (refinement iterations)
python analysis/grade_benchmark.py \
--iter_prompt \
--k0 12 \
--test_set \
--output_file iter_prompt_refinement.csv
Crucially, iterative prompt refinement depends on logs from the base repeated sampling run—it needs to know which tasks failed initially. This dependency chain (base evaluation → prompt refinement → self-training) structures the entire assessment pipeline.
The third strategy, self-training via supervised fine-tuning, represents the most computationally intensive approach. The framework collects successful trajectories from development set runs, then fine-tunes the base model on these examples. This reveals whether agents can bootstrap from their own successes—a key question for understanding autonomous improvement. The implementation requires a separate Python environment (a design decision that adds operational friction but isolates dependencies from the S1 self-training pipeline):
# Switch to self-training environment
pip install -r self_training/requirements.txt
# Launch fine-tuning on successful trajectories
sbatch scripts/launch_self_training.slurm
The framework evaluates across three CTF benchmarks used in the paper—Intercode CTF, CyBench, and NYU CTF—each presenting different challenge types. Intercode CTF separates train/test sets using task masks, enabling proper development/evaluation splits. The agents operate in Docker containers (built via setup.sh), providing isolation and consistent environments across runs. This architecture choice prevents state leakage between rollouts but requires managing container lifecycle and cleanup.
Under the hood, agents interact with vLLM-hosted models via API calls, generating trajectories that include tool invocations, command executions, and reasoning chains. The grading pipeline analyzes these trajectories against ground-truth solutions, computing not just binary success/failure but also tracking which tools were used, how many interaction rounds were required, and where agents got stuck. This granular analysis powers the prompt refinement strategy—failed attempts inform what guidance to add in subsequent iterations.
Gotcha
The computational requirements are substantial and non-negotiable. You need GPU infrastructure to host models via vLLM, parallel Docker execution for reasonable evaluation times (the default parallelism=10 means 10 concurrent containers), and patience—12 rollouts across 20 interaction rounds per task adds up fast. The README doesn’t specify exact hardware requirements or expected runtime, so plan for significant compute resources when evaluating large models.
The self-training workflow introduces operational complexity that requires careful attention. Requiring a completely separate Python environment with different dependencies (self_training/requirements.txt instead of the base requirements.txt) reflects integration with the S1 pipeline. The documentation shows running sbatch scripts/launch_self_training.slurm to get self-trained checkpoints, though you’ll need to host these checkpoints and evaluate them separately. This is research code, and the workflows assume familiarity with training infrastructure.
More fundamentally, the framework’s evaluation focuses on CTF benchmarks—controlled environments with known solutions and clearly defined success criteria. Real-world penetration testing involves ambiguous objectives, evolving defenses, and ethical constraints that CTFs don’t capture. An agent that performs well on Intercode CTF might behave completely differently against production systems with monitoring, incident response, and legal consequences. The framework measures capability ceilings in controlled settings, but extrapolating to operational risk requires careful judgment.
Verdict
Use this framework if you’re conducting academic research on AI agent capabilities in offensive security, need rigorous statistical evaluation of iterative improvement strategies, or are developing risk assessment methodologies for autonomous systems. It’s particularly valuable if you’re studying the ‘risk bubble’ phenomenon—understanding where iteration helps versus where agents hit capability ceilings. The statistical rigor (12 rollouts by default, confidence intervals, pass@k metrics) and complementary improvement strategies (prompt refinement, workflow optimization, self-training) provide insights that single-run evaluations can’t capture. Skip it if you need production-ready penetration testing tools, lack GPU infrastructure for model hosting and parallel evaluation, or require assessment beyond CTF environments. The framework provides SLURM scripts for cluster environments but also bash scripts for other setups. This is a research instrument for understanding AI capabilities, not a turnkey security tool, and you should expect to work through some implementation details yourself.