Back to Articles

TruLens: The RAG Triad and Real-Time Evaluation for LLM Applications

[ View on GitHub ]

TruLens: The RAG Triad and Real-Time Evaluation for LLM Applications

Hook

Most teams discover their RAG system is hallucinating only after users complain. TruLens runs evaluation alongside every query, catching failures before deployment—but at a cost you need to understand.

Context

The LLM development cycle is broken. You modify a prompt, adjust retrieval parameters, or switch embedding models, then manually test a handful of queries to see if things improved. Did answer quality actually get better? Is the new retriever finding more relevant context? You're flying blind, relying on spot checks rather than systematic measurement.

This evaluation gap exists because traditional ML observability tools weren't designed for the unique challenges of LLM applications. Unlike supervised learning where you have labeled test sets, LLM outputs are open-ended and context-dependent. Retrieval-augmented generation (RAG) systems add another layer of complexity: you need to evaluate not just the final answer, but whether the retriever found relevant chunks, whether the LLM stayed grounded in that context, and whether the response actually addresses the question. TruLens emerged to solve this by introducing a framework-agnostic instrumentation layer that captures execution traces and runs configurable evaluation metrics—what they call 'feedback functions'—in real-time as your application runs.

Technical Insight

TruLens uses a middleware pattern that wraps your LLM application components without requiring you to rewrite code for specific frameworks. Whether you're using LangChain, LlamaIndex, or raw OpenAI calls, the instrumentation works the same way. You import a context wrapper, annotate your application, and TruLens intercepts calls at each step—prompt construction, retrieval, LLM invocation, output processing.

Here's what instrumenting a basic RAG application looks like:

from trulens_eval import TruChain, Feedback, Tru
from trulens_eval.feedback.provider import OpenAI
import numpy as np

# Initialize TruLens
tru = Tru()

# Define feedback functions for the RAG Triad
provider = OpenAI()

# Context Relevance: Are retrieved chunks relevant to the query?
f_context_relevance = (
    Feedback(provider.context_relevance)
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)

# Groundedness: Is the answer supported by the retrieved context?
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(TruChain.select_context().collect())
    .on_output()
)

# Answer Relevance: Does the answer address the original question?
f_answer_relevance = (
    Feedback(provider.relevance)
    .on_input()
    .on_output()
)

# Wrap your LangChain application
tru_recorder = TruChain(
    chain,
    app_id='my_rag_v1',
    feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance]
)

# Run as normal - instrumentation happens automatically
with tru_recorder as recording:
    response = chain("What are the symptoms of heat exhaustion?")

The RAG Triad—context relevance, groundedness, and answer relevance—is TruLens's core evaluation framework. Context relevance measures whether your retriever is pulling chunks that actually relate to the user's query. Groundedness checks if the LLM's response is supported by the retrieved context or if it's hallucinating information. Answer relevance evaluates whether the final output addresses what was asked. This three-dimensional view catches different failure modes: a retriever finding irrelevant documents, an LLM inventing facts not in the context, or responses that ignore the question.

The feedback functions themselves are typically LLM-based evaluators. TruLens sends the relevant inputs (query, context chunks, response) to an LLM with a carefully crafted evaluation prompt that asks it to score the output on a specific dimension. This is meta-evaluation: using LLMs to judge LLM outputs. The provider abstraction lets you swap between OpenAI, Anthropic, or local models as evaluators, and you can customize prompts for domain-specific criteria.

What makes this powerful for experimentation is the comparison interface. TruLens stores all execution traces and feedback scores in a local SQLite database (or configurable backend). The built-in dashboard lets you compare metrics across different app versions side-by-side:

# Record two different RAG configurations
with TruChain(chain_v1, app_id='rag_v1', feedbacks=feedbacks) as rec:
    for query in test_queries:
        chain_v1(query)

with TruChain(chain_v2, app_id='rag_v2', feedbacks=feedbacks) as rec:
    for query in test_queries:
        chain_v2(query)

# Launch dashboard to compare
tru.run_dashboard()

The dashboard shows aggregate metrics, per-query breakdowns, and lets you drill into individual traces to see exactly where failures occurred. Did groundedness drop because the retriever changed? Is answer relevance higher but at the cost of longer latency? You get quantitative answers instead of guessing.

Beyond the RAG Triad, TruLens supports custom feedback functions. You can write Python functions that take execution data and return scores, enabling domain-specific evaluation. For a medical chatbot, you might add a feedback function that checks if responses include appropriate medical disclaimers. For a code generation tool, you could run unit tests as feedback. The framework is extensible while providing battle-tested defaults for common use cases.

Gotcha

The elephant in the room is cost and latency. Every feedback function typically makes an LLM API call, meaning a single user query might trigger four or five total LLM requests: the original application call plus three RAG Triad evaluations. For development with small query volumes, this is manageable. In production at scale, you're potentially doubling or tripling your LLM costs and adding evaluation latency to the critical path.

TruLens offers sampling strategies to mitigate this—evaluate 10% of requests, or only evaluate during specific time windows—but then you're back to incomplete visibility. The documentation suggests running feedback functions asynchronously to avoid blocking user responses, but this requires infrastructure setup and doesn't help with cost. You need to carefully consider whether to run TruLens in production at all, or limit it to staging environments and development iterations.

The second limitation is that LLM-based evaluators inherit all the problems of LLMs themselves: inconsistency, bias, and the inability to catch certain types of errors. A groundedness evaluator might miss subtle hallucinations or score inconsistently across runs. The meta-evaluation problem runs deep—who evaluates the evaluators? TruLens provides some deterministic feedback functions and lets you bring your own, but the most convenient options rely on LLM judgment, which isn't ground truth. For critical applications, you'll still need human evaluation as the ultimate arbiter, treating TruLens scores as signals rather than definitive verdicts.

Verdict

Use if: You're actively iterating on an LLM application (especially RAG systems) and need systematic comparison across prompt variations, model changes, or retrieval configurations. The RAG Triad framework and comparison dashboard excel during development sprints where understanding why performance changed matters more than API costs. Also ideal for teams building evaluation culture from scratch—the pre-built feedback functions provide immediate value without requiring evaluation expertise. Skip if: You're running high-volume production workloads where evaluation overhead would significantly impact costs or latency. Also skip if you already have robust offline evaluation pipelines with human-labeled test sets; TruLens's LLM-based evaluators won't be more accurate than your labeled data. Finally, skip for simple applications where manual testing suffices—the instrumentation complexity isn't worth it if you're just logging prompts and responses. TruLens shines brightest in the messy middle of LLM development where you're experimenting rapidly and need quantitative feedback loops.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/truera-trulens.svg)](https://starlog.is/api/badge-click/ai-agents/truera-trulens)