Inside the Framework Measuring How Good AI Agents Are at Hacking
Hook
When researchers needed to measure how dangerous autonomous hacking agents could become, they couldn't find a benchmark that captured the real risk profile—so they built one that treats AI capability improvement itself as the threat model.
Context
The cybersecurity community faces an uncomfortable reality: large language models are getting frighteningly competent at offensive security tasks. While we've seen demos of GPT-4 solving CTF challenges and autonomous agents poking at vulnerabilities, we've lacked rigorous methodology for quantifying these capabilities and—more critically—understanding how quickly they improve with different enhancement strategies.
The stakes are uniquely high because improvement curves matter more than point-in-time performance. An agent with 15% success rate today might seem harmless, but if iterative refinement pushes it to 60% within hours of compute time, we're looking at a different threat landscape entirely. Princeton's Polaris Lab built Dynamic Risk Assessment to answer the question security researchers and AI safety teams are quietly asking: how do we measure not just what offensive AI agents can do, but how rapidly they can get better at it?
Technical Insight
The architecture treats capability improvement as a first-class measurement dimension rather than an afterthought. The framework isolates three distinct enhancement strategies—repeated sampling, prompt refinement, and self-training—and measures their effectiveness across standardized CTF benchmarks with statistical rigor.
At the core sits a Docker-based isolation layer that spins up challenge environments from InterCode CTF, CyBench, and NYU CTF datasets. Each challenge runs in its own container, providing the agent a sandboxed shell to interact with vulnerable systems. The agent—typically Qwen2.5-Coder-32B-Instruct served via vLLM—engages through multi-turn dialogue, receiving observations from the environment and issuing commands until it either solves the challenge or exhausts its interaction budget.
The repeated sampling strategy implements pass@k evaluation where k ranges from 1 to 12 attempts. This measures raw capability ceiling when you throw compute at the problem:
# Simplified conceptual implementation of pass@k evaluation
def evaluate_pass_at_k(agent, challenge, k_values=[1, 2, 4, 8, 12]):
results = {}
trajectories = []
for attempt in range(max(k_values)):
trajectory = agent.run_challenge(
challenge=challenge,
max_interactions=10,
temperature=0.7 # Sampling for diversity
)
trajectories.append(trajectory)
# Check success at each k threshold
for k in k_values:
if k == attempt + 1:
success = any(t.solved for t in trajectories[:k])
results[f'pass@{k}'] = success
return results, trajectories
What makes this meaningful is the confidence interval calculation across challenge sets. The framework doesn't just report "the agent solved 12 out of 50 challenges"—it provides statistically grounded estimates of capability with uncertainty bounds, critical for risk assessment where false confidence kills.
The prompt refinement strategy gets more sophisticated. It implements an iterative loop that analyzes failed trajectories, identifies failure modes, and generates refined prompts:
# Conceptual prompt refinement pipeline
def refine_from_failures(base_prompt, failed_trajectories, llm):
failure_analysis = llm.analyze(
prompt="Analyze why these cybersecurity attempts failed",
trajectories=failed_trajectories,
focus_areas=["command errors", "logic gaps", "missing techniques"]
)
refinement_suggestions = llm.generate(
prompt="Suggest specific prompt improvements",
analysis=failure_analysis,
constraints="Must be actionable, specific, under 50 tokens"
)
refined_prompt = inject_improvements(
base_prompt,
suggestions=refinement_suggestions
)
return refined_prompt
This mirrors how human penetration testers iterate—try an approach, analyze what failed, adjust tactics. The framework quantifies how much this metacognitive loop improves success rates compared to naive repeated attempts.
The self-training pipeline represents the most computationally intensive strategy. It adapts the S1 framework: collect successful trajectories, format them as training data, fine-tune the base model via supervised learning on these "expert demonstrations." The brilliance is treating the agent's own successful attempts as training signal:
# Self-training data preparation
def prepare_self_training_data(successful_trajectories):
training_examples = []
for traj in successful_trajectories:
# Convert multi-turn interaction into training format
conversation = []
for interaction in traj.interactions:
conversation.append({
"role": "user",
"content": interaction.observation
})
conversation.append({
"role": "assistant",
"content": interaction.action
})
training_examples.append({
"messages": conversation,
"challenge_id": traj.challenge_id,
"success": True
})
return training_examples
The framework then measures how this fine-tuned model performs on held-out challenges, quantifying whether the agent learned generalizable offensive techniques or merely memorized specific exploits. For risk assessment, this distinction is everything—memorization is bounded, generalization scales.
The SLURM integration enables running these experiments at scale, distributing challenges across cluster nodes and aggregating results with proper statistical methodology. The entire pipeline from environment setup through final analysis is designed for reproducibility, with Docker ensuring consistent challenge environments and fixed random seeds controlling sampling variation.
Gotcha
The computational requirements are prohibitive outside well-funded research labs. You're looking at multi-GPU setups to serve Qwen2.5-Coder-32B-Instruct via vLLM, potentially dozens of Docker containers running simultaneously for parallel evaluation, and SLURM cluster access for the self-training experiments. A single full evaluation pass across all three datasets with all improvement strategies could easily consume hundreds of GPU-hours.
The documentation assumes substantial expertise in multiple domains simultaneously—cybersecurity fundamentals to understand CTF challenges, ML infrastructure to debug vLLM serving issues, and research methodology to interpret the statistical results properly. There's no gentle onboarding path. If you're not already comfortable with concepts like "pass@k metrics," "supervised fine-tuning pipelines," and "Docker networking for isolated environments," expect a steep learning curve before you generate meaningful results. The repository is research artifact first, usable tool second, and that shows in everything from setup complexity to error messages that assume you understand the underlying papers.
Verdict
Use if you're conducting AI safety research focused on dual-use capabilities, need to benchmark offensive security agents with statistical rigor, or want to understand concrete risk profiles before deploying autonomous security tools in controlled environments. This framework excels at answering "how quickly could this agent improve" rather than just "what can it do now." Skip if you're a security practitioner looking for penetration testing automation, lack multi-GPU infrastructure and SLURM cluster access, or need something production-ready rather than a research evaluation harness. This tool measures danger; it doesn't help you exploit or defend systems in operational contexts.