Back to Articles

GUARDIAN: Detecting When Your AI Agents Start Gaslighting Each Other

[ View on GitHub ]

GUARDIAN: Detecting When Your AI Agents Start Gaslighting Each Other

Hook

When multiple LLM agents collaborate, they don’t just share information—they amplify each other’s hallucinations. A single agent’s factual error can cascade through a network of collaborators, becoming an accepted ‘truth’ that corrupts the entire system’s output.

Context

Multi-agent LLM systems are becoming the architecture of choice for complex AI applications. Rather than relying on a single monolithic model, developers are building systems where specialized agents collaborate: one agent researches, another analyzes, a third synthesizes. This pattern mirrors human team dynamics and often produces superior results. But it also introduces a systemic risk that single-agent systems don’t face: error propagation.

When Agent A hallucinates a fact and passes it to Agent B, who incorporates it into a summary for Agent C, who uses it as a premise for recommendations to Agent D, you don’t have four independent agents anymore—you have a corruption pipeline. Traditional LLM safety measures focus on individual model outputs, but they’re blind to these cascading failures. GUARDIAN, emerging from NeurIPS 2025 research, tackles this problem by modeling multi-agent collaborations as temporal graphs, where it can detect both compromised agents and corrupted communication channels as they emerge across multi-turn dialogues.

Technical Insight

Temporal Attributed Graph

Incremental Training

Temporal edges

Compressed graphs via

Information Bottleneck

Encoded representations

Reconstruction attempt

High error = anomaly

Flag corrupted agents/edges

Agent Interactions

Graph Abstraction Module

Temporal Graph Encoder

Decoder

Reconstruction Error

Anomaly Detection

Safety Response

System architecture — auto-generated

GUARDIAN’s core insight is treating agent collaboration as a dynamic graph problem rather than a sequence of independent LLM calls. Each agent becomes a node, each information exchange becomes an edge, and time becomes a critical dimension for tracking how errors propagate. This isn’t just graph theory for its own sake—it’s the only way to capture the topology of trust and information flow that determines whether a single hallucination becomes a systemic failure.

The architecture implements an unsupervised encoder-decoder framework with incremental training. The encoder processes temporal attributed graphs—snapshots of agent interactions over time, with attributes capturing the semantic content and confidence levels of exchanges. The decoder attempts to reconstruct these patterns, and reconstruction error serves as the anomaly signal. When reconstruction fails, it indicates the system has encountered interaction patterns it hasn’t seen in ‘healthy’ multi-agent collaborations.

What makes this approach powerful is the graph abstraction module based on Information Bottleneck Theory. Raw agent interactions generate massive, noisy graphs. The abstraction layer compresses these temporal structures while preserving patterns that matter for safety. It’s lossy compression with a purpose: throw away everything except the signal that indicates an agent is spreading misinformation or an edge is carrying corrupted data. Here’s how the temporal graph structure looks in practice:

import torch
from guardian.models import TemporalGraphEncoder

# Define agent interaction over time
# nodes: [agent_id, embedding_dim]
# edges: [source, target, timestamp, interaction_type]
agent_embeddings = torch.randn(5, 128)  # 5 agents
interactions = [
    (0, 1, 0.0, 'query'),      # Agent 0 queries Agent 1
    (1, 0, 0.1, 'response'),   # Agent 1 responds
    (0, 2, 0.2, 'forward'),    # Agent 0 forwards to Agent 2
    (2, 3, 0.3, 'synthesize'), # Agent 2 synthesizes for Agent 3
]

# Build temporal graph snapshot
temporal_graph = TemporalGraphEncoder.build_snapshot(
    nodes=agent_embeddings,
    edges=interactions,
    window_size=0.5  # Consider interactions within 0.5 time units
)

# Encode and detect anomalies
encoder = TemporalGraphEncoder(input_dim=128, hidden_dim=64)
latent_repr = encoder(temporal_graph)
reconstruction = encoder.decode(latent_repr)
anomaly_score = torch.norm(temporal_graph - reconstruction)

if anomaly_score > threshold:
    # Identify which nodes/edges contributed most to anomaly
    suspect_agents = encoder.localize_anomaly(temporal_graph, reconstruction)
    print(f"Potentially compromised: {suspect_agents}")

The framework supports both temporal (GUARDIAN) and static (GUARDIAN.s) variants. The temporal version maintains history across dialogue turns, tracking how patterns evolve. If Agent 3 suddenly starts generating responses with different statistical properties after interacting with Agent 1, the temporal model catches this behavioral shift. The static variant analyzes single snapshots—useful when you need faster inference or when historical context isn’t available.

Crucially, GUARDIAN operates unsupervised. You don’t need labeled examples of ‘corrupted’ vs. ‘healthy’ agent interactions, which is essential because attack patterns in multi-agent systems are largely unknown. The model learns the manifold of normal collaboration during training, then flags deviations during inference. This is fundamentally different from supervised safety classifiers that can only catch threats they’ve been explicitly trained to recognize.

The implementation uses incremental training, updating the model as it observes new interaction patterns. This addresses concept drift: as your agents’ collaboration strategies evolve (through prompt engineering, model updates, or task changes), GUARDIAN adapts rather than throwing false positives on legitimate behavioral changes. The challenge is distinguishing genuine evolution from emerging corruption, which the temporal modeling helps resolve by tracking the rate and topology of change.

Gotcha

GUARDIAN is research code, not a production safety system. The repository provides scripts to reproduce paper experiments, but you won’t find a clean API, integration examples for popular multi-agent frameworks like AutoGen or LangChain, or deployment guidance. Expect to spend significant engineering effort adapting this to your infrastructure. The code assumes you can represent your agent interactions as graph structures with temporal attributes—if your agents communicate through complex, nested tool calls or have highly asynchronous interaction patterns, the graph abstraction may not capture all relevant information.

The computational requirements are unclear from the documentation. Temporal graph encoding with incremental training isn’t free—you’re adding a monitoring layer on top of already expensive LLM inference. For large agent networks (dozens of collaborating agents), the graph operations could introduce noticeable latency. The paper likely includes performance analysis, but the repository doesn’t surface these practical constraints for developers evaluating deployment feasibility. You’ll need to benchmark against your specific setup to determine if real-time anomaly detection is viable or if you’re limited to post-hoc analysis.

Verdict

Use GUARDIAN if you’re building or researching multi-agent LLM systems where error propagation poses real risks—think autonomous decision-making pipelines, collaborative fact-checking networks, or agent swarms handling critical tasks. It’s especially valuable if you’re trying to understand the failure modes of your agent architecture, need to demonstrate safety considerations for stakeholder trust, or want to push the boundaries of LLM safety research. Skip it if you need plug-and-play monitoring for production systems, lack the engineering resources to adapt research code, or primarily work with single-agent applications where traditional output validation suffices. Also skip if you need immediate answers about scalability, latency, and operational costs—those questions require hands-on experimentation this early-stage project doesn’t yet answer.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/jialongzhou666-guardian.svg)](https://starlog.is/api/badge-click/ai-agents/jialongzhou666-guardian)