Jaeger: How Uber's Distributed Tracing System Became the OpenTelemetry Standard-Bearer

Hook

When Uber needed to debug a request spanning 100+ microservices across multiple data centers, traditional logging failed spectacularly. Jaeger was born from that chaos—and now powers distributed tracing for thousands of organizations.

Context

In a monolithic application, debugging is relatively straightforward: add log statements, reproduce the issue, grep through a single log file. But in distributed systems with dozens or hundreds of services communicating asynchronously, a single user request might traverse 20+ services, spawn background jobs, write to message queues, and touch multiple databases. Traditional logging becomes useless because there's no way to correlate log entries across service boundaries when requests are happening concurrently.

Distributed tracing emerged as the solution to this problem, inspired by Google's Dapper paper from 2010. The core concept is elegant: instrument your code to generate spans (representing units of work) that form a directed acyclic graph called a trace. Each span includes timing information, tags, and a trace ID that flows through every service call. Uber built Jaeger in 2016 to solve their own microservices debugging nightmare, open-sourced it in 2017, and donated it to the Cloud Native Computing Foundation where it graduated to top-level project status in 2019—one of only a handful of projects to achieve this level of maturity alongside Kubernetes and Prometheus.

Technical Insight

System architecture — auto-generated

Jaeger's architecture revolves around a clean separation between instrumentation, collection, storage, and querying. The recent v2 release represents a fundamental shift: instead of maintaining custom components, Jaeger now builds directly on OpenTelemetry Collector components, making it the reference implementation for OTLP-based tracing.

At the instrumentation layer, applications use OpenTelemetry SDKs to create spans. Here's what minimal instrumentation looks like in a Go HTTP handler:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("jaeger-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(r.Context(), "handle-request")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("user.id", r.Header.Get("X-User-ID")),
        attribute.String("request.path", r.URL.Path),
    )
    
    // Your business logic here
    result, err := processBusinessLogic(ctx)
    if err != nil {
        span.RecordError(err)
        w.WriteHeader(500)
        return
    }
    
    span.SetAttributes(attribute.Int("response.items", len(result)))
    w.Write([]byte("Success"))
}

The Collector is where Jaeger's architectural sophistication shines. It receives traces via OTLP (both HTTP and gRPC), applies sampling decisions, performs batch processing for efficiency, and writes to storage backends. The collector pipeline consists of receivers, processors, and exporters—a pattern borrowed directly from OpenTelemetry Collector. This means you can extend Jaeger with any OTEL-compatible component.

One of Jaeger's killer features is adaptive sampling. Instead of deciding at instrumentation time whether to sample a trace (which means you might miss interesting traces), Jaeger implements a feedback loop. The collector calculates sampling rates per service and operation, then serves these rates via a gRPC endpoint that SDKs query. This allows you to guarantee specific sampling rates (e.g., "sample 10 traces per second for checkout operations") while automatically reducing sampling for high-volume, low-value operations.

The storage layer uses a gRPC-based plugin architecture. Jaeger ships with built-in support for Cassandra, Elasticsearch, and Kafka (for buffering), but you can implement the storage interface in any language and run it as a separate process. This design choice reflects lessons learned at scale: your storage requirements might be unique, and forcing everyone into a single storage backend would be fatal for adoption.

The Query service exposes both a gRPC API and a REST API, serving the React-based UI. Query operations are optimized for specific access patterns: finding traces by service and operation, looking up traces by ID, and retrieving dependencies between services. The UI provides flamegraph-style visualizations showing the entire request flow, timing breakdowns, and the ability to drill into individual span attributes—invaluable when debugging latency issues.

Jaeger v2's embrace of OpenTelemetry components means you're no longer locked into Jaeger-specific tooling. You can swap in different processors, add custom exporters, or even use Jaeger primarily as a storage backend for traces collected by standalone OTEL collectors. This flexibility is why Jaeger remains relevant even as the observability landscape fragments.

Gotcha

Jaeger's flexibility comes with operational complexity that catches teams off guard. The storage backend decision is critical and irreversible in practice. Cassandra offers excellent write throughput and horizontal scaling but requires significant operational expertise—you're essentially running a complex distributed database. Elasticsearch is more familiar to many teams and offers powerful querying, but storage costs can spiral out of control with high trace volumes. The in-memory storage is only for development, and Badger (local disk) doesn't support distributed deployments.

Sampling is another gotcha that bites production deployments. Even with adaptive sampling, you face an inherent trade-off: sample too aggressively and you'll miss the one trace that would have explained your production incident. Sample too conservatively and you'll drown in storage costs and query performance will degrade. There's no magic solution—you need to instrument your sampling decisions themselves (Jaeger provides metrics) and tune continuously. Additionally, head-based sampling (which Jaeger does) means sampling decisions happen before you know if a trace is interesting. If you need to capture all error traces or all slow traces, you'll need to implement tail-based sampling, which requires stateful processors that buffer spans before making decisions—adding significant complexity.

Configuration compatibility between versions can break unexpectedly. While the maintainers try to provide migration paths, the v1 to v2 transition requires configuration rewrites because the underlying component model changed entirely. If you're running Jaeger in production, budget time for testing upgrades thoroughly. The three-month deprecation window sounds reasonable until you're managing deployments across dozens of environments with change freezes and compliance requirements.

Verdict

Use if: You're building or operating a microservices architecture where debugging cross-service interactions is painful, you've already adopted or plan to adopt OpenTelemetry instrumentation, you need production-grade distributed tracing with flexible storage options, or you want a CNCF-graduated project with strong governance and community support. Jaeger excels when you need control over your observability infrastructure and have the operational maturity to run it. Skip if: You're a small team without dedicated platform engineers (the operational overhead isn't worth it—use a hosted solution), you need tight integration between traces, metrics, and logs in a single UI (consider Grafana stack with Tempo instead), you want built-in anomaly detection or AI-powered insights (look at commercial APM tools), or your application is a monolith or uses fewer than 5-10 services (traditional logging and APM tools will serve you better with less complexity).

Jaeger: How Uber's Distributed Tracing System Became the OpenTelemetry Standard-Bearer

Jaeger: How Uber's Distributed Tracing System Became the OpenTelemetry Standard-Bearer

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Jaeger: How Uber's Distributed Tracing System Became the OpenTelemetry Standard-Bearer

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]