Testing LLM Agents Without Losing Your Mind: Inside Giskard's Trace-Based Architecture

Hook

Your LLM agent passed every unit test, then hallucinated customer data in production. Traditional testing frameworks were built for deterministic code—but your agent's outputs change every run.

Context

The rise of LLM-powered agents created a testing crisis. Unlike traditional software where assertEqual() works reliably, LLM outputs are non-deterministic by design. The same prompt can yield semantically correct but textually different responses, making brittle string assertions useless. Early adopters cobbled together custom evaluation pipelines: OpenAI API calls to judge outputs, brittle regex patterns, or worse—manual QA on every deployment.

Giskard emerged in this chaos, initially as a broader ML testing platform. Its v2 offered pioneering features like automatic vulnerability scanning and synthetic RAG test generation (RAGET). But the team recognized a deeper architectural problem: existing tools treated LLM evaluation as a metrics problem when developers needed a testing framework. In 2024, they rewrote Giskard from scratch as v3—an async-first, modular testing library explicitly designed for multi-turn agent conversations. The new architecture splits into laser-focused packages (giskard-checks, giskard-scan, giskard-rag) with minimal dependencies, abandoning the monolithic approach that plagued v2.

Technical Insight

Giskard v3's core innovation is its scenario-based testing API that treats execution traces as first-class citizens. Instead of asserting against individual outputs, you define scenarios—test cases with inputs, expected behaviors, and checks that run against captured execution data. This abstraction elegantly handles the non-determinism problem.

Here's how you'd test a customer support agent for groundedness:

from giskard.testing import Scenario
from giskard.checks import Groundedness

# Wrap your agent (any callable)
async def support_agent(question: str) -> str:
    # Your LLM agent code here
    context = retrieve_docs(question)
    return await llm.generate(question, context=context)

# Define test scenario
scenario = Scenario(
    name="Support agent should cite sources",
    inputs={"question": "What's your refund policy?"},
    callable=support_agent,
    checks=[
        Groundedness(
            context_variable="context",
            min_score=0.8
        )
    ]
)

# Run asynchronously
result = await scenario.run()
assert result.passed

The framework intercepts your callable's execution, captures variables (like context in RAG scenarios), and applies checks post-execution. Groundedness uses an LLM-as-judge pattern under the hood—it prompts a separate model to evaluate whether the agent's response is supported by the retrieved context. You get semantic evaluation without writing prompt templates or managing judge model calls.

The async-first design isn't just modern aesthetics. Multi-turn agent conversations involve nested LLM calls, tool invocations, and database queries—all I/O-bound operations. Giskard's async architecture lets you run dozens of test scenarios concurrently without blocking:

import asyncio
from giskard.testing import Suite

suite = Suite([
    scenario_refund_policy,
    scenario_technical_support,
    scenario_account_deletion,
    # ... 50+ scenarios
])

# All run concurrently
results = await suite.run()
print(f"{results.passed_count}/{results.total_count} passed")

The modular package structure solves dependency hell. If you only need basic evaluation, install giskard-checks (minimal dependencies). RAG-specific features live in giskard-rag, vulnerability scanning in giskard-scan. This contrasts sharply with v2's monolithic install that pulled in ML frameworks, web servers, and database clients whether you needed them or not.

Giskard's check library includes pre-built patterns for common failure modes. Beyond Groundedness, there's Conformity for policy compliance ("Does the agent refuse medical advice?"), Toxicity for safety rails, and AnswerRelevancy for conversational coherence. Each check returns a structured result with scores, reasoning, and pass/fail status—perfect for CI/CD integration:

from giskard.checks import Conformity

scenario = Scenario(
    name="Agent refuses medical diagnosis",
    inputs={"question": "Do I have cancer based on these symptoms?"},
    callable=support_agent,
    checks=[
        Conformity(
            policy="Never provide medical diagnoses",
            min_score=0.9
        )
    ]
)

The trace-based approach shines for debugging. When a check fails, you get the full execution context—inputs, intermediate states, model calls, and judge reasoning. This beats traditional test frameworks that only show "expected X, got Y" for deterministic assertions.

Gotcha

The v3 alpha status isn't marketing speak—it's genuinely incomplete. The vulnerability scanning that made v2 popular (automatic adversarial prompt generation, bias detection) is still "in progress" for v3. The RAGET feature that synthesized RAG test cases from your documents doesn't exist yet in v3. If you need these capabilities today, you're stuck on unmaintained v2 or building custom solutions.

The Python 3.12+ requirement is a real deployment blocker. Many enterprise environments and cloud platforms (looking at you, AWS Lambda's default runtimes) lag behind on Python versions. Giskard's decision to use modern Python features locks out teams that can't upgrade their entire stack. There's no technical workaround—the codebase uses pattern matching and type system improvements from 3.12.

Telemetry deserves mention: giskard-core enables anonymous usage tracking by default. You can disable it with an environment variable (GISKARD_DISABLE_ANALYTICS=true), but it's opt-out rather than opt-in. For enterprises with strict data policies, this requires explicit configuration in deployment pipelines.

Verdict

Use if: You're building LLM agents in production and need regression testing for non-deterministic outputs. The scenario API and built-in LLM-as-judge checks provide immediate value for teams tired of brittle eval scripts. The async architecture and modular packages make it the best choice for modern Python projects that prioritize lightweight dependencies. It's particularly strong for RAG applications needing groundedness checks and multi-turn conversation testing. Skip if: You need production-ready vulnerability scanning or synthetic test generation today—those features aren't ready in v3. Also skip if you're constrained to Python <3.12, have extremely high stability requirements (it's alpha software), or prefer managed platforms over open-source libraries. In those cases, look at LangSmith for observability, Ragas for RAG-specific metrics, or wait 6-12 months for Giskard v3 to mature.

Testing LLM Agents Without Losing Your Mind: Inside Giskard's Trace-Based Architecture

Testing LLM Agents Without Losing Your Mind: Inside Giskard's Trace-Based Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Testing LLM Agents Without Losing Your Mind: Inside Giskard's Trace-Based Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]