Back to Articles

Testing LLM Agents Before They Hallucinate in Production: Inside Giskard's Component-Level Evaluation

[ View on GitHub ]

Testing LLM Agents Before They Hallucinate in Production: Inside Giskard’s Component-Level Evaluation

Hook

Your RAG application works perfectly in testing but hallucinates in production. The problem isn’t your generator—it’s your retriever returning irrelevant documents. But traditional end-to-end testing can’t tell you that.

Context

The LLM operations landscape has a glaring gap: while DevOps teams have decades of testing infrastructure for traditional software, teams deploying RAG agents and LLM-based systems are flying blind. You can measure end-to-end accuracy, but when your chatbot starts making up facts or leaking sensitive data, pinpointing whether the issue lives in your retriever, your prompt, or your knowledge base becomes investigative work. Giskard emerged from this reality—a Python testing framework that treats LLM applications as composite systems with testable components, not black boxes. Instead of asking “did the agent answer correctly,” it asks “which specific component failed, and how?” The project gained significant traction (5,190 GitHub stars) because it addresses the operational nightmare of debugging multi-component AI systems where a single failure mode—hallucination, bias, prompt injection—can cascade from any layer. The v3 rewrite signals the team’s recognition that modern AI systems aren’t just models; they’re agents with multiple turns, routing logic, and retrieval pipelines that demand a fundamentally different testing approach than v2’s ML model validation roots.

Technical Insight

Target System

Evaluation Engine

Wrap model

Standardized predict

Query

Context

Responses

Model interface

Generate adversarial inputs

Test cases

Test responses

Vulnerability report

User Application

Model Wrapper Interface

LLM/Agent/RAG System

Vulnerability Scanner

Test Generators

Evaluators & Scorers

Knowledge Base

System architecture — auto-generated

Giskard’s architecture centers on two core abstractions: model wrapping and automated vulnerability scanning. The framework is deliberately model-agnostic—you wrap any LLM or agent behind a prediction interface, then Giskard applies domain-specific test generators without needing access to model internals. Here’s how you wrap a LangChain RAG agent:

from langchain import FAISS, PromptTemplate
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.chains import RetrievalQA
import giskard as gsk

# Your existing RAG setup
db = FAISS.from_documents(documents, OpenAIEmbeddings())
rag_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=db.as_retriever()
)

# Wrap for Giskard evaluation
def model_predict(df):
    return [rag_chain.run(question) for question in df['question']]

gsk_model = gsk.Model(
    model_predict,
    model_type="text_generation",
    name="Climate QA Agent",
    description="RAG agent for IPCC climate reports"
)

Once wrapped, the vulnerability scanner goes to work. It’s not running a static checklist—it’s generating adversarial inputs tailored to your domain. The scanner detects hallucinations by asking questions your knowledge base can’t answer, tests for prompt injection by embedding malicious instructions in user queries, and probes for bias by systematically varying demographic terms. You don’t write these test cases manually; Giskard’s test generators synthesize them based on your model’s behavior and the data it’s seen.

The real differentiation comes with RAGET (RAG Evaluation Toolkit). Traditional RAG evaluation asks: “Is the final answer correct?” RAGET decomposes the question: “Did the retriever fetch relevant documents? Did the generator use them faithfully? Did the rewriter improve the query?” It automatically generates question-answer-context triples from your knowledge base, then scores each component independently:

from giskard.rag import generate_testset, evaluate

# Auto-generate evaluation dataset from your knowledge base
testset = generate_testset(
    knowledge_base=documents,
    num_questions=100,
    agent_description="Answers questions about climate change using IPCC reports"
)

# Evaluate with component-level scoring
report = evaluate(
    model=gsk_model,
    testset=testset,
    knowledge_base=documents
)

# Results broken down by component:
# Generator score: 0.82 (hallucination issues on edge cases)
# Retriever score: 0.64 (missing relevant chunks)
# Knowledge Base coverage: 0.91

This component-level granularity means when your RAG fails, you know exactly where to optimize. A low retriever score points to chunking strategy or embedding model issues. A low generator score with high retriever score suggests prompt engineering problems or model capability limits. This is fundamentally different from frameworks that only give you an aggregate accuracy number.

The v3 rewrite specifically targets multi-turn agent testing. Where v2 evaluated single predictions, v3 handles conversational context and state. The architecture strips out heavy dependencies, making it lighter and easier to integrate into CI/CD pipelines. The vulnerability scanner appears designed to handle agentic behaviors, reflecting the shift from “test this model” to “test this autonomous system.”

Gotcha

The v2-to-v3 transition creates real uncertainty for production adoption. The README explicitly states v2 is no longer actively maintained, while v3 remains under active development per the roadmap. Documentation is in flux, and breaking changes are expected—risky if you’re building long-term evaluation infrastructure. The wrapping requirement, while enabling framework-agnosticism, adds friction. You can’t just point Giskard at a deployed API endpoint; you need to write prediction functions that match its interface, which means integration overhead for every model you test. The README officially supports Python 3.9, 3.10, and 3.11, which may limit adoption for teams standardized on different Python versions and completely locks out non-Python teams. The automated test generation is powerful but opaque—when Giskard flags a vulnerability, understanding why it generated that specific adversarial input requires digging into test generator internals. There’s an inherent trust-the-framework element that may not satisfy teams requiring full audit trails. RAGET’s component scoring relies on automatic question generation from knowledge bases, which works well for factual documents but may face challenges with creative or subjective content where “correct” answers are ambiguous.

Verdict

Use Giskard if you’re deploying RAG applications to production and need to systematically identify which pipeline component is failing—retriever precision issues versus generator hallucinations versus knowledge base gaps. It’s particularly valuable when you’re working across multiple LLM frameworks (LangChain, raw OpenAI, Hugging Face) and want consistent evaluation without rewriting tests for each. The automated vulnerability scanning makes sense for regulated industries (healthcare, finance) where you must demonstrate testing for bias, security, and robustness before deployment. Skip it if you’re in rapid prototyping mode where wrapping overhead slows iteration, if you need stable APIs and can’t tolerate v3’s active development status, if your team works outside Python, or if you prefer manual test case creation with full control over evaluation logic. Also skip if you’re testing traditional ML models—Giskard technically supports tabular models, but the tooling and community momentum are clearly LLM-focused now.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/giskard-ai-giskard-oss.svg)](https://starlog.is/api/badge-click/cybersecurity/giskard-ai-giskard-oss)