Back to Articles

LangWatch: OpenTelemetry-Native LLM Observability Without the Vendor Lock-In

[ View on GitHub ]

LangWatch: OpenTelemetry-Native LLM Observability Without the Vendor Lock-In

Hook

Most LLM observability platforms trap you with proprietary SDKs and data formats. LangWatch bets everything on OpenTelemetry, making it the first major platform where your instrumentation code outlives the vendor.

Context

The explosion of LLM applications has created a new operational nightmare. Unlike traditional software where you can reliably trace execution paths, LLM apps are probabilistic black boxes. A prompt that worked yesterday might hallucinate today. An agent that successfully completes a task in testing might loop infinitely in production. Token costs spiral unexpectedly. Users report vague ‘weird responses’ that you can’t reproduce.

The first wave of LLM observability tools solved the immediate visibility problem but created a new one: vendor lock-in through proprietary instrumentation. Switching providers meant rewriting all your tracing code. LangWatch emerged from this frustration, built around OpenTelemetry’s OTLP standard from day one. This architectural choice means your instrumentation is portable across any OTel-compatible platform, and you can self-host everything when compliance requires it. But the real innovation isn’t just standards compliance—it’s integrating the complete LLM development loop (capture traces, build datasets, run evaluations, optimize prompts, simulate agents) into a single platform that doesn’t force you to glue together five different tools with custom scripts.

Technical Insight

OTLP spans

OpenTelemetry

traces

metadata

query

config

trigger

results

datasets

jobs

LLM Application

LangWatch SDK

OTLP Collector

OpenSearch

PostgreSQL

Next.js Dashboard

Evaluation Engine

Redis

System architecture — auto-generated

LangWatch’s architecture centers on OpenTelemetry spans as the fundamental data unit. When your LLM application makes a call—whether through LangChain, LlamaIndex, OpenAI’s SDK, or a custom implementation—OTel instrumentation captures it as a trace with spans for each operation. These spans flow through the OTLP protocol to LangWatch’s collector, which routes them to OpenSearch for storage and querying.

Here’s what basic instrumentation looks like with the Python SDK:

import langwatch
from openai import OpenAI

# Initialize with your project API key
langwatch.init(api_key="lwk_...")

client = OpenAI()

# Wrap your LLM call - creates OTel spans automatically
with langwatch.trace("user_query") as trace:
    trace.set_user_id("user_123")
    trace.set_metadata({"session_id": "abc", "route": "/chat"})
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Explain quantum entanglement"}]
    )
    
    # Attach custom spans for business logic
    with trace.span("fact_check") as fact_span:
        verification = check_scientific_accuracy(response.choices[0].message.content)
        fact_span.set_attribute("accuracy_score", verification.score)

The magic happens after traces are captured. LangWatch’s dataset builder lets you filter traces by metadata, error conditions, or user feedback to create evaluation datasets. You might filter for “all traces where users clicked ‘thumbs down’” or “traces where token cost exceeded $0.50” and export them as a dataset. These datasets feed into the evaluation engine, which runs multiple evaluators (both LLM-as-judge and rule-based) against your prompts.

The evaluation system supports DSPy integration for prompt optimization. You define success criteria (factual accuracy, conciseness, sentiment), and DSPy automatically iterates through prompt variations while LangWatch tracks performance metrics. This closes the loop: production traces → problematic examples → optimization → validation → deployment.

What sets LangWatch apart is its agent simulation framework. Instead of just evaluating single LLM calls, you can define complete agent behaviors with tools, state management, and multi-turn conversations. Here’s a simulation configuration:

// Define an agent simulation scenario
const simulation = {
  name: "customer_support_agent",
  agent_config: {
    system_prompt: "You are a helpful support agent with access to order data.",
    tools: ["search_orders", "update_order_status", "send_email"],
    model: "gpt-4o"
  },
  test_cases: [
    {
      initial_message: "I need to cancel order #12345",
      expected_tools_called: ["search_orders", "update_order_status"],
      success_criteria: [
        { type: "tool_call_sequence", pattern: ["search", "update"] },
        { type: "response_contains", text: "cancellation confirmed" },
        { type: "max_turns", value: 3 }
      ]
    }
  ]
};

The simulator executes these scenarios, capturing full traces of tool calls, state transitions, and LLM responses. It validates that agents follow expected paths, use tools correctly, and handle edge cases without infinite loops or hallucinated tool arguments. This catches issues like “agent repeatedly calls the same tool with identical arguments” or “agent fabricates order IDs” before production deployment.

The platform’s Model Context Protocol (MCP) integration is particularly clever. You can expose LangWatch’s evaluation capabilities as MCP servers, letting you run evaluations directly from Claude Desktop or any MCP-compatible IDE. This means developers can test prompt variations without leaving their development environment, with results automatically synced to the LangWatch dashboard for team visibility.

Under the hood, the architecture uses PostgreSQL for structured data (projects, users, evaluations), OpenSearch for time-series trace data and full-text search across LLM inputs/outputs, and Redis for job queuing and caching. The Next.js frontend communicates with a TypeScript backend that handles trace ingestion, evaluation orchestration, and dataset management. For self-hosting, everything deploys via Docker Compose or Kubernetes with clear separation between stateless services and persistent data stores.

Gotcha

The OpenTelemetry requirement is both LangWatch’s strength and its friction point. If you’re using a framework without native OTel support—or worse, a custom in-house LLM wrapper—you’ll need to manually instrument your code with span creation and attribute setting. This isn’t conceptually hard, but it’s more setup work than solutions like Helicone where you just proxy requests through a different URL. Teams not already invested in OpenTelemetry face a steeper learning curve understanding the tracing model before they can effectively use LangWatch.

Self-hosting comes with real operational complexity. You’re running PostgreSQL, OpenSearch, Redis, and the LangWatch application services. OpenSearch in particular can be resource-hungry and requires tuning for production workloads. If a trace spike causes OpenSearch to fall behind, you’ll experience ingestion delays that make real-time debugging difficult. The Docker Compose setup works for development and small deployments, but scaling to handle millions of traces requires Kubernetes expertise and careful capacity planning. Cloud-hosted LangWatch eliminates this complexity but at the cost of sending potentially sensitive LLM data to a third party—acceptable for many use cases but a non-starter in heavily regulated industries without careful data classification.

The evaluation and simulation features, while powerful, assume you can define clear success criteria. For creative or open-ended LLM applications, writing automated evaluations that meaningfully capture quality is genuinely hard. LLM-as-judge evaluators can be inconsistent or biased toward certain response styles. You’ll still need human review for many use cases, and LangWatch doesn’t replace the need for thoughtful evaluation design—it just gives you better tools to execute evaluations you’ve defined.

Verdict

Use if: You’re building production LLM applications where observability and systematic evaluation are critical, especially multi-step agents where emergent behavior needs testing. You value avoiding vendor lock-in through open standards, or you have compliance requirements that mandate self-hosting. You’re willing to invest in OpenTelemetry instrumentation upfront for long-term portability. Skip if: You’re prototyping and just need quick visibility into a handful of LLM calls—simpler proxy-based tools will get you observability faster. You’re deeply embedded in an existing observability ecosystem (Datadog, New Relic) and don’t want to manage another platform. Your organization can’t support running OpenSearch and the associated infrastructure for self-hosting, and you can’t use cloud-hosted solutions due to data policies. You need mature enterprise features like advanced RBAC or SOC2 compliance documentation that younger projects haven’t fully built out yet.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/langwatch-langwatch.svg)](https://starlog.is/api/badge-click/data-knowledge/langwatch-langwatch)