Back to Articles

AgentLens: Observing AI Agents Through an Invisible Git Time Machine

[ View on GitHub ]

AgentLens: Observing AI Agents Through an Invisible Git Time Machine

Hook

What if you could replay any AI agent session from an arbitrary decision point, fork the timeline, and watch what changes? AgentLens treats agent behavior as a git history you never asked for—but desperately need.

Context

AI agents that write code, modify files, and orchestrate complex workflows are moving from demos to reality. But when Claude rewrites your config file for the third time or mysteriously deletes a directory, how do you figure out why? Traditional observability tools log API calls and token counts, but they miss the filesystem layer where agents actually do their work. Researchers studying AI safety face an even thornier problem: how do you analyze whether an agent's behavior is aligned when you can't reproducibly replay its decisions or track causality between actions and outcomes?

AgentLens emerged from the Machine Learning Alignment Theory Scholars (MATS) program under Neel Nanda to solve this trajectory capture problem. It's not trying to be LangSmith or production monitoring—it's a research harness that treats agent sessions as experimental trials you can dissect. The core insight: wrap Claude's agent SDK with an invisible git repository that shadows every file operation, then serialize the entire trajectory into a standardized interchange format (ATIF) that preserves not just what the agent said, but what it actually changed and when.

Technical Insight

The architectural centerpiece is the shadow git system, and it's more elegant than it sounds. AgentLens spawns a bare git repository using GIT_DIR and GIT_WORK_TREE environment variables to create an invisible tracking layer. Your agent workspace stays pristine—no .git directory cluttering things up—while every file write gets committed under the hood with per-step attribution.

Here's the pattern in practice:

# AgentLens orchestrates sessions with transparent change tracking
from agentlens import AgentLens

lens = AgentLens(
    shadow_git=True,  # Invisible git tracking enabled
    session_mode="chained",  # Carry state between sessions
    output_dir="./trajectories"
)

# Run a multi-step agent session
trajectory = lens.run(
    user_prompt="Refactor this API to use async/await",
    workspace="./my_project",
    model="claude-3-5-sonnet-20241022"
)

# Every file change is now in git history with turn-level granularity
for turn in trajectory.turns:
    print(f"Turn {turn.id}: {turn.action}")
    print(f"Files changed: {turn.filesystem_diff.files}")
    # Each turn has a git commit hash you can checkout

The shadow git approach solves a fundamental observability problem: filesystem operations are side effects that standard LLM tracing can't capture. By making every agent action a git commit, AgentLens gives you time-travel debugging for AI behavior. Want to see exactly what the agent changed when it "fixed" your database schema? Checkout that turn's commit hash. Need to replay from turn 7 with a different temperature setting? Git worktrees let you fork execution timelines in parallel.

ATIF (Agent Trajectory Interchange Format) is the serialization layer that makes trajectories portable. It's structured JSON capturing user prompts, assistant responses, tool calls, filesystem diffs, and crucially—subagent hierarchies. When your agent spawns another agent (yes, this happens), ATIF links trajectories via SubagentTrajectoryRef objects:

{
  "trajectory_id": "traj_abc123",
  "turns": [
    {
      "turn_id": 3,
      "user_prompt": "Optimize this function",
      "assistant_response": "...",
      "tool_calls": [...],
      "subagent_refs": [
        {
          "trajectory_id": "traj_xyz789",
          "spawn_reason": "complexity_delegation"
        }
      ],
      "filesystem_diff": "diff --git a/optimizer.py..."
    }
  ]
}

This standardization is critical for interpretability research. You can ingest ATIF trajectories into analysis pipelines, compare behavioral patterns across experiments, or build statistical models of agent decision-making. The unified diff format means you can grep through thousands of trajectories to find patterns like "when does Claude choose to delete files versus modify them?"

The session chaining modes (isolated/chained/forked) enable experimental designs you can't get elsewhere. Isolated sessions start fresh each time—useful for measuring baseline performance. Chained sessions carry filesystem state forward, letting you study how agents handle technical debt accumulation. Forked sessions let you run counterfactual experiments: "What if the agent had used a different library at turn 5?" The git worktree mechanism makes forking practically free, spinning up parallel universes from any decision point.

Resampling operates at four granularities, each useful for different research questions. Stateless API resampling regenerates responses without filesystem context—measuring pure LLM variance. Session-level resampling reruns entire interactions—testing reproducibility. Turn-level resampling (experimental) replays from specific decision points—ideal for intervention testing. You could inject a safety constraint at turn 10 and measure behavioral divergence.

The provider abstraction layer is interesting because it's simultaneously flexible and constrained. AgentLens supports Anthropic, OpenRouter, AWS Bedrock, and GCP Vertex as API endpoints, but you're still locked to Claude models because of the underlying SDK dependency. This means you can route requests through your preferred infrastructure (crucial for research budgets) while maintaining model consistency:

lens = AgentLens(
    provider="openrouter",  # Cheaper API routing
    api_key=os.environ["OPENROUTER_KEY"],
    model="anthropic/claude-3-5-sonnet"  # Still Claude
)

The cost reporting tracks cumulative token usage and estimates costs, but the documentation honestly warns it's informational only—especially problematic with OpenRouter's dynamic pricing or Bedrock's enterprise contracts. For research accounting, you're better off reconciling against actual bills.

Gotcha

The Claude-only constraint is the elephant in the room. Despite supporting multiple providers, you can't compare GPT-4's agent behavior against Claude's because the Agent SDK hard-depends on Claude's tool-use format. If your interpretability research requires cross-model analysis—and most serious alignment work does—you'll need to maintain separate harnesses. This isn't AgentLens's fault (it's transparent about the limitation), but it's a research blocker worth knowing upfront.

Turn-level replay is marked experimental for good reason. The documentation actively requests bug reports, which in research tooling means "this works in our specific use cases but probably breaks in yours." Git-based replay is conceptually sound, but edge cases around file permissions, symlinks, or non-deterministic tool behavior can cause divergence between original and replayed trajectories. Budget time for debugging replay failures if you're building analysis pipelines that depend on it. The 104 GitHub stars suggest a small user community, so you're more likely to encounter undocumented issues than with mature tools.

Verdict

Use AgentLens if you're conducting AI safety or interpretability research on Claude-based agents where you need reproducible filesystem tracking, trajectory standardization for analysis pipelines, or multi-session behavioral studies. The shadow git architecture and ATIF format are genuinely novel approaches to agent observability that fill a gap in the research toolkit. It's particularly valuable if you're affiliated with alignment research programs and need to share trajectories with collaborators. Skip it if you need production-grade reliability, multi-model LLM support for comparative analysis, or aren't specifically focused on research use cases. This is a research prototype with research-grade stability. Also skip if you're doing agent work outside the Claude ecosystem—the SDK dependency makes this a non-starter for cross-model studies despite the provider flexibility.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/dreadnode-agent-lens.svg)](https://starlog.is/api/badge-click/ai-agents/dreadnode-agent-lens)