AgentLens: Replaying AI Agent Decisions Like Git Commits
Hook
What if every decision your AI agent made was recorded like a git commit, complete with diffs, branches, and the ability to checkout any point in its reasoning? That’s exactly what AgentLens does with an invisible git repository the agent never knows exists.
Context
AI safety researchers face a reproducibility crisis. When a language model agent makes a decision—calling a tool, editing a file, or spawning a subagent—that moment vanishes unless you’ve instrumented everything perfectly. Traditional logging captures text, but misses the full state: file changes, execution context, the exact branching point where behavior diverged. Researchers studying alignment, interpretability, or behavioral variance need more than logs; they need time machines.
AgentLens emerged from this gap. Built by Dreadnode as a research harness around Anthropic’s Claude Agent SDK, it treats agent trajectories as first-class experimental data. The core insight: agents are stateful systems that modify environments over time, so observability tooling should use version control primitives, not just append-only logs. The result is a framework that captures every step of multi-session agent interactions in ATIF (Agent Trajectory Interchange Format), tracks all file modifications with git under the hood, and enables researchers to branch, replay, and resample from any decision point in an agent’s execution history.
Technical Insight
The architecture centers on three instrumentation layers that wrap Claude Code execution without the agent’s awareness. First is ATIF trajectory capture, which logs every step as structured JSON including tool calls, observations, and model responses. Second is shadow git tracking, the cleverest piece: AgentLens creates an invisible bare git repository that monitors the agent’s working directory using GIT_DIR and GIT_WORK_TREE environment variables. The agent sees a normal directory; git sees a tracked workspace. Every turn triggers a commit with attribution metadata linking back to the trajectory step.
Here’s what a basic experiment definition looks like:
name: code-refactor-study
provider: anthropic
model: claude-sonnet-4-20250514
sessions:
- id: initial-refactor
mode: isolated
system_prompt: "You are a Python refactoring expert."
task: "Refactor the authentication module for better testability."
max_turns: 20
- id: security-review
mode: chained
parent: initial-refactor
task: "Review the refactored code for security vulnerabilities."
max_turns: 10
The mode parameter controls state management: isolated sessions start fresh, chained sessions inherit the working directory from their parent via git worktrees (lightweight checkouts sharing the same git object database), and forked sessions enable branching for A/B trajectory comparisons. This orchestration model supports the multi-session, multi-agent patterns common in safety research.
The resampling architecture operates at four levels. At the top, stateless API resampling simply re-runs the same prompt with a new seed, useful for measuring variance. Intervention testing lets researchers edit assistant responses or tool results mid-trajectory to test counterfactuals—“what if the agent had chosen a different file to edit?” Session-level resampling reruns an entire session from a checkpoint. Turn-level replay is the deepest: AgentLens checks out the exact git worktree state from that turn and re-executes all tools from that branch point forward.
The provider abstraction deserves scrutiny because it’s simultaneously elegant and constrained. AgentLens supports four backends—Anthropic, OpenRouter, AWS Bedrock, and GCP Vertex—through a unified interface:
from agent_lens import AgentLens
from agent_lens.providers import AnthropicProvider, BedrockProvider
# Works identically across providers
lens = AgentLens(
provider=BedrockProvider(
model="anthropic.claude-sonnet-4-20250514-v1:0",
region="us-west-2"
),
experiment_path="experiments/alignment-test.yaml"
)
result = lens.run()
print(f"Tokens used: {result.usage.total_tokens}")
print(f"Git SHA: {result.final_commit}")
But here’s the catch: despite the abstraction, only Claude models work because AgentLens depends entirely on the Claude Agent SDK, which speaks exclusively Anthropic’s Messages API protocol. The provider layer essentially just changes billing endpoints while keeping the wire format identical.
The output format combines ATIF trajectories (JSON with full provenance metadata), unified diffs showing all file changes across the entire experiment, and git SHAs for every decision point. For researchers, this means experiments are fully reproducible. Share the ATIF file and the git repository, and another team can replay your agent’s exact trajectory, branch at turn 47, inject a different tool result, and measure how behavior diverges.
Gotcha
The Claude-only limitation is the biggest gotcha. Despite the provider abstraction suggesting model flexibility, you’re locked into Anthropic’s API protocol. If you want to study Gemini, GPT-4, or open models, AgentLens won’t help—the entire architecture assumes Claude Agent SDK’s tool use patterns and message structures. This isn’t a simple provider adapter problem; the whole trajectory format is shaped around Claude’s specific agentic behavior.
Turn-level replay is marked experimental in the documentation, which should give researchers pause. This is arguably the most valuable feature—the ability to checkout any decision point and re-execute—but stability isn’t guaranteed. Cost reporting on non-Anthropic providers is purely informational and doesn’t reflect actual billing, making budget controls unreliable for large-scale experiments on Bedrock or Vertex. And with only 78 GitHub stars, this is firmly research-phase software. Expect breaking changes, incomplete documentation, and the need to read source code when things break.
Verdict
Use AgentLens if you’re conducting AI safety or interpretability research on Claude-based agents and need reproducible multi-session trajectories with granular replay capabilities. The shadow git tracking and ATIF logging are purpose-built for academic research, behavioral variance studies, and ablation experiments where you need to branch agent behavior at specific decision points. The multi-session orchestration handles complex scenarios like agent-spawned subagents or chained reasoning tasks. Skip if you need production-grade reliability, want model flexibility beyond Claude, require accurate cross-provider cost tracking, or expect stable APIs—this is explicitly experimental research tooling. Also skip if you’re building production agent systems; tools like LangSmith offer broader model support and operational maturity, even if they lack AgentLens’s deep git-based change tracking.