OpenSRE: Building the SWE-bench for Production Incidents
Hook
Every agent framework claims it can debug production incidents, but none can prove it. OpenSRE is building the benchmark that will finally separate marketing from reality.
Context
The AI agent space has a validation problem. While SWE-bench established rigorous benchmarks for coding agents—complete with ground truth solutions and adversarial test cases—production incident response has remained a wild west of demos and anecdotes. Companies launch 'AI SRE' tools with cherry-picked examples of resolved alerts, but no one can answer the fundamental question: how often does this actually find the root cause?
OpenSRE emerges from this gap. It's not just another LLM wrapper around your observability APIs—though it is that too. The project's real ambition is to become the standardized training and evaluation infrastructure for incident response agents. Think of it as building the gym where AI SREs learn to diagnose cascading failures, not just shipping a pre-trained agent. The Tracer-Cloud team is betting that the tooling for agent development matters more than any single agent implementation, and they're open-sourcing the entire scaffolding: synthetic incident scenarios with labeled root causes, fault injection harnesses for real cloud infrastructure, and a plugin system that normalizes 60+ observability tools into a unified query interface. It's infrastructure for researchers and AI engineers iterating on agent architectures, wrapped in a usable CLI for practitioners who just want to wire up Datadog and start experimenting.
Technical Insight
OpenSRE's architecture strips away the abstraction layers that plague most agent frameworks. There's no LangGraph orchestration, no chain-of-thought libraries—just a direct observe-reason-act loop implemented in FastAPI. When an alert fires (webhook or CLI trigger), the agent pulls context from connected observability platforms through thin adapter interfaces, feeds everything to an LLM with tool-use capabilities, and either suggests or executes remediation based on user-configured safety thresholds.
The integration layer is where most production agent projects collapse under their own weight. OpenSRE sidesteps this with a brutally simple provider interface. Every observability tool—whether Datadog, Grafana, CloudWatch, or Sentry—gets normalized into three operations: fetch logs, query metrics, retrieve traces. Here's what a custom integration looks like:
from opensre.providers.base import ObservabilityProvider
from typing import Dict, List, Optional
class CustomMetricsProvider(ObservabilityProvider):
def __init__(self, api_key: str, base_url: str):
self.api_key = api_key
self.base_url = base_url
def query_metrics(self,
service: str,
metric: str,
start_time: int,
end_time: int,
tags: Optional[Dict] = None) -> List[Dict]:
# Your vendor API call here
response = self._api_call(
f"/metrics/query",
params={
"metric": metric,
"start": start_time,
"end": end_time,
"filter": self._build_filter(service, tags)
}
)
# Return normalized schema
return [{
"timestamp": point["time"],
"value": point["val"],
"labels": point.get("tags", {})
} for point in response["data"]]
This thin wrapper approach means vendor-specific optimizations—query pushdown, materialized views, custom aggregations—get punted to the LLM's prompt engineering. It's architecturally honest but shifts complexity: the agent needs to know that CloudWatch has 1-minute granularity limits while Datadog supports sub-second queries, and that knowledge lives entirely in tool descriptions and few-shot examples.
The real innovation is the dual testing strategy. Most agent projects have unit tests and maybe some integration tests. OpenSRE maintains two parallel validation tracks: synthetic RCA scenarios and end-to-end fault injection. The synthetic suite defines incident scenarios with labeled ground truth:
# tests/synthetic/database_connection_pool.py
class ConnectionPoolExhaustion(SyntheticIncident):
name = "database-connection-pool-exhaustion"
ground_truth = "Connection pool size (50) exceeded due to leaked connections in OrderService v2.3.1"
def generate_evidence(self) -> Dict:
return {
"alerts": [{"service": "orders-api", "message": "High latency"}],
"logs": self._generate_app_logs() + self._generate_red_herrings(),
"metrics": {
"db.connections.active": self._spike_to_50(),
"db.connections.max": [50] * 100,
"api.latency.p99": self._gradual_increase()
},
"traces": self._generate_stuck_queries()
}
def _generate_red_herrings(self):
# Inject plausible but incorrect signals
return [
{"level": "error", "msg": "Redis timeout", "count": 3},
{"level": "warn", "msg": "GC pause 200ms", "count": 12}
]
This is the piece that doesn't exist elsewhere. The agent runs against these scenarios, and OpenSRE scores whether it identified the labeled root cause within a configurable evidence budget (number of tool calls). The adversarial red herrings—Redis timeouts, GC pauses—test whether the agent can filter noise, not just pattern-match common failure modes.
The e2e harness goes further: it actually provisions AWS resources, deploys intentionally buggy services, injects faults (network partitions, resource exhaustion, cascading retries), and measures whether the agent can diagnose real infrastructure. This is expensive to run but provides ground truth that synthetic scenarios can't: does your agent handle API rate limits? Does it recognize when Terraform state drift causes deployment failures? These are the questions you can't answer with mocked test data.
The CLI exposes a pragmatic UX decision most agent tools hide: reasoning depth as a first-class parameter. The /effort command lets you set low (fast, cheap, basic pattern matching), medium (default), high (multi-hop reasoning, more tool calls), or max (exhaustive search). This directly trades LLM API costs against investigation quality, which is honest about the economics of agent workloads. A high-severity incident might justify max effort; a noisy non-critical alert gets low.
Gotcha
The project's biggest limitation is the gap between ambition and execution. OpenSRE claims to provide 'scored root-cause accuracy' and a benchmark for AI SRE agents, but the repository shows no published metrics, validation methodology, or leaderboard. The synthetic test suite exists as Python fixtures, but there's no documented scoring logic, no reproducibility instructions, and no evidence that anyone has actually run these scenarios against multiple agent architectures to compare performance. It's pre-alpha infrastructure for building a benchmark, not a usable benchmark yet.
The 60+ integrations are marketing speak masking shallow coverage. Most are trivial API wrappers—often under 100 lines—with no production validation. Real observability APIs have quirks: Datadog's query language changes between metric types, Grafana's authentication varies by datasource, CloudWatch has arcane dimension limits. OpenSRE's thin adapters don't handle these edge cases, which means your agent will fail on queries that work fine in the vendor's UI. You're expected to debug this yourself or contribute fixes upstream.
Remediation safety is the elephant in the room. The agent can execute fixes—restart pods, scale deployments, modify configs—but there's no approval workflow, no rollback mechanism, no audit trail. The documentation hand-waves this with 'configure safety thresholds' but provides no implementation. Giving an LLM write access to production without guardrails is a liability most SRE teams won't accept, and OpenSRE doesn't solve it. The remediation mode is effectively a research feature, not production-ready automation.
Verdict
Use if: You're an AI engineer or researcher building incident response agents and need structured evaluation infrastructure. OpenSRE gives you reusable test scenarios, a plugin system for observability integrations, and a conceptual framework (synthetic RCA + e2e fault injection) that's genuinely novel. It's valuable as a development environment for iterating on agent architectures, especially if you're exploring reinforcement learning approaches where you need scored training data. The CLI makes it easy to wire up your existing observability stack and experiment with different LLM reasoning strategies. Skip if: You're an SRE team looking for production-ready incident automation. The agent reasoning is too opaque to trust, remediation safety is unproven, and the integrations are too shallow for real vendor complexity. Stick with deterministic runbooks (PagerDuty Automation, Rundeck) for known failure modes and mature observability platforms (Datadog, Grafana) for unknown ones. Also skip if you need a usable benchmark today—the evaluation infrastructure exists but isn't validated or documented well enough for research publication.