Back to Articles

Jaeger: Building Production-Grade Distributed Tracing with OpenTelemetry

[ View on GitHub ]

Jaeger: Building Production-Grade Distributed Tracing with OpenTelemetry

Hook

When Uber’s engineering team needed to debug performance issues across their microservices mesh, they built Jaeger—and it’s now a CNCF graduated project used across organizations worldwide.

Context

Before distributed tracing platforms like Jaeger, debugging performance issues in microservices architectures was like trying to find a needle in a haystack while blindfolded. A single user request might traverse dozens of services, each adding latency, and traditional logging offered no way to correlate these interactions. You’d see that Service A took 500ms, but understanding whether the bottleneck was in Service B, C, or a database call three hops away required manual log aggregation, timestamps comparison, and pure detective work.

Uber created Jaeger to solve this exact problem at scale. Inspired by Google’s Dapper and Twitter’s Zipkin, Jaeger introduced a production-ready tracing solution designed for high-throughput environments where instrumenting thousands of services was non-negotiable. In 2017, Uber donated it to the CNCF, and by October 2019, it became only the seventh project to achieve graduated status—joining Kubernetes, Prometheus, and Envoy in the elite tier of cloud-native infrastructure. Today, Jaeger is the reference implementation for distributed tracing, especially after its v2 release that deepened OpenTelemetry integration.

Technical Insight

gRPC/HTTP

OTLP Traces

Write Spans

Read Traces

Serve Data

Sampling Config

Instrumented Applications

OpenTelemetry SDKs

Jaeger Collector

Ingestion & Processing

Storage Backends

Cassandra/ES/Kafka

Jaeger Query Service

API Layer

Jaeger UI

Visualization

System architecture — auto-generated

Jaeger’s architecture is deceptively simple but deeply thought-out. Applications instrumented with OpenTelemetry SDKs send trace spans via gRPC or HTTP to the Jaeger Collector, which validates, processes, and writes them to pluggable storage backends. The Jaeger Query Service reads from this storage and serves data to the UI, where developers visualize request flows, identify latency hotspots, and diagnose failures. What makes this powerful is the separation of concerns: collectors scale horizontally for ingestion throughput, while query services scale based on read patterns.

Getting started is genuinely effortless. The all-in-one Docker image bundles the collector, query service, UI, and in-memory storage into a single container:

docker run --rm --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:latest

This exposes the UI on port 16686 and accepts OTLP traces via gRPC (4317) or HTTP (4318). Within minutes, you can start sending traces using OpenTelemetry instrumentation. The OTLP support is critical—Jaeger natively speaks the same protocol as the OpenTelemetry Collector, meaning you can swap tracing backends without touching application code.

The pluggable storage architecture is where Jaeger shines in production. The platform supports various backends through its storage plugin interface, allowing you to choose the right storage for your needs. The choice matters: different backends offer different trade-offs between write performance, query capabilities, and operational complexity. A typical production setup might include buffering components to prevent trace loss during backend outages and smooth ingestion spikes.

Jaeger v2, released in 2024, represents a fundamental architectural shift. Instead of maintaining separate collector and query binaries, v2 consolidates components and leverages the OpenTelemetry Collector’s plugin ecosystem. This means Jaeger can now use any OTEL Collector receiver, processor, or exporter—giving you access to the broader OpenTelemetry ecosystem for metrics correlation, tail sampling, and more. The configuration moved from CLI flags to YAML-based OpenTelemetry Collector config, improving consistency but requiring migration effort for existing deployments.

Remote sampling is an underappreciated feature that becomes essential at scale. Instead of hardcoding sampling rates in application config, Jaeger collectors can dynamically serve sampling strategies via gRPC. This lets you configure sampling centrally and update it without redeploying services—the difference between drowning in irrelevant traces and having just enough data to debug issues.

The storage plugin interface deserves special attention. Jaeger defines a gRPC contract for reading and writing spans, allowing you to implement custom backends without forking the codebase. The interface is stable, documented, and battle-tested—if your organization has strong opinions about data storage, this is your escape hatch.

Gotcha

Storage is Jaeger’s Achilles’ heel, both in cost and complexity. At meaningful scale, you’re not running in-memory storage—you’re operating a production-grade storage cluster with significant data volumes and strict retention policies. Storage choices like Elasticsearch can become expensive as trace volume grows, because indexing on span tags consumes significant resources. You’ll spend real engineering time tuning storage configurations, managing disk I/O, and debugging query performance across large datasets.

The v2 migration introduces breaking changes that teams need to plan for. CLI flags are deprecated in favor of YAML configuration files following OpenTelemetry Collector conventions. While Jaeger provides at least a 3-month grace period or two minor version bumps (whichever is later) for deprecated options and clear migration guides, converting existing infrastructure-as-code setups isn’t trivial. If you have Kubernetes deployments, Helm charts, or Terraform modules referencing v1 flags, budget time for updates. Some v1 configurations changed format in v2, requiring careful testing before production cutover. The architectural consolidation is worth it long-term, but the transition isn’t free.

Verdict

Use Jaeger if you’re building or operating microservices architectures where understanding request flows is critical, you’re already invested in OpenTelemetry instrumentation (or planning to be), or you need flexible storage options to balance cost and query capabilities. It’s particularly compelling for platform teams supporting multiple development teams who need centralized tracing with minimal application-side complexity. The CNCF graduated status means this isn’t a risky bet—it’s production-proven infrastructure with active development and broad ecosystem support. Skip Jaeger if you’re working on monolithic applications where distributed tracing is overkill, you want an all-in-one observability platform that combines metrics, logs, and traces with zero infrastructure (use a managed APM service instead), or you lack the operational bandwidth to run production storage clusters at scale. For teams wanting tracing without storage overhead, consider managed alternatives or newer systems optimized for object storage. Jaeger is a power tool—incredibly effective when you need its capabilities, but it requires commitment to operate properly.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/jaegertracing-jaeger.svg)](https://starlog.is/api/badge-click/cybersecurity/jaegertracing-jaeger)