Day 2: Distributed Tracing with OpenTelemetry and Zipkin Integration
Welcome back, future troubleshooting maestros! Yesterday, we broke free from the println shackles and embraced Micrometer to understand single-service performance. We saw how a well-instrumented application can tell you what is happening internally. But here's the kicker: in a distributed system, "internally" isn't just one box. It's a symphony of services, each playing its part. What happens when the melody goes sour, and you can't tell if the problem is with the violins, the flutes, or the conductor?
That, my friends, is the chasm we bridge today. We're diving deep into Distributed Tracing, the bedrock of understanding how requests flowβand failβacross complex microservice landscapes.
Why Your Micrometer Metrics Aren't Enough (Yet)
Imagine a user clicks "Buy Now." This single action might touch a Frontend Service, then an Authentication Service, a User Profile Service, an Order Service, a Payment Service, and finally an Inventory Service. Each interaction is a hop. If the "Buy Now" button hangs, Micrometer might tell you the Order Service is slow, but it won't tell you why it's slow, or which specific downstream call within that order is the culprit, or if the initial Authentication Service was already struggling, causing a domino effect.
This is where distributed tracing shines. It paints a complete picture, a narrative thread that follows a single request through every service it touches, showing you latency, errors, and context at each step.
Core Concepts: The Story of a Request
Traces, Spans, and Context Propagation
At its heart, a trace is the journey of a single request or transaction as it propagates through a distributed system. Think of it as a story.
Each step in that storyβa call to a service, a database query, a method executionβis a span. Spans are the individual chapters, containing details like:
Operation Name: What happened (e.g.,
/api/users/{id},saveToDatabase).Start/End Time: How long it took.
Tags/Attributes: Key-value pairs providing context (e.g.,
http.status_code: 200,user.id: 123).Span ID: A unique identifier for this specific operation.
Parent Span ID: Links this span to its parent, forming a hierarchy.
Trace ID: A unique identifier for the entire request journey.
The magic happens with context propagation. When Service A calls Service B, Service A injects the current trace and span IDs into the outgoing request headers (e.g., traceparent HTTP header). Service B then extracts these IDs, understands it's part of an ongoing trace, and creates a new child span linked to Service A's span. This is how the entire story, across service boundaries, is stitched together.
OpenTelemetry: The Universal Translator
Before OpenTelemetry (OTel), every tracing vendor (Jaeger, Zipkin, Datadog, New Relic) had its own SDKs and data formats. This meant vendor lock-in and painful migrations. OpenTelemetry changed the game. It's a vendor-neutral observability framework for generating, collecting, and exporting telemetry data (traces, metrics, and logs).
Why OpenTelemetry?
Standardization: Write your instrumentation once, export to any OTel-compatible backend.
Rich Ecosystem: Supports dozens of languages and frameworks.
Auto-Instrumentation: For many popular frameworks (like Spring Boot), you can often get basic tracing with zero code changes, thanks to language agents. This is where we'll start!
Zipkin: Your Trace Storyboard
Zipkin is an open-source distributed tracing system. It collects and visualizes trace data, allowing you to see the full request flow, identify latency bottlenecks, and understand dependencies. We're using Zipkin because it's lightweight, easy to set up, and provides an excellent visual interface for understanding traces.
System Architecture: Our Traced Microservices
We'll set up a simple two-service architecture: UserService and OrderService.
A client makes a request to
UserService.UserServiceprocesses it and then makes an internal HTTP call toOrderService.Both services are instrumented using the OpenTelemetry Java Agent.
The OTel Agent in each service captures spans and sends them to an OpenTelemetry Collector.
The OTel Collector, in turn, exports these spans to Zipkin.
You'll then view the full trace in the Zipkin UI.
This architecture is robust: the OTel Collector acts as a buffer and processor, decoupling your application from the tracing backend.
Control Flow & Data Flow: Following the Breadcrumbs
Client Request:
GET /api/users/{id}toUserService.UserService Entry: The OpenTelemetry Java Agent automatically intercepts this request, starts a new root span (e.g.,
GET /api/users/{id}), and assigns a uniqueTrace IDandSpan ID.UserService Internal Call:
UserServicemakes an HTTP call toOrderService(e.g.,GET /api/orders/user/{id}). Before sending, the OTel Agent automatically injects the currentTrace IDandSpan IDinto the outgoing HTTP headers (traceparent,tracestate).OrderService Entry: The OTel Agent in
OrderServiceintercepts the incoming request, extracts theTrace IDandParent Span IDfrom the headers. It then starts a new child span for thisOrderServiceoperation, linking it to theUserServicespan.Span Completion & Export: As operations complete in both services, their respective spans are finalized (duration calculated, status set) and sent to the OpenTelemetry Collector.
Collector to Zipkin: The OTel Collector receives spans from both services, batches them, and exports them to Zipkin.
Zipkin Visualization: Zipkin reconstructs the entire trace based on
Trace ID,Span ID, andParent Span ID, allowing you to see the nested calls and their timings.
Production-Grade Insights: Beyond the Basics
Sampling is Your Friend (and Foe): At 100M RPS, you cannot trace every request. It's too much data, too much overhead. Sampling is crucial.
Head-based sampling: Decides whether to sample a trace at its very beginning. Simple, but you might miss interesting errors downstream if the initial decision was to discard.
Tail-based sampling: Collects all spans for a trace, then decides after the trace completes whether to keep it (e.g., if it had an error or was unusually slow). This is far more intelligent but requires more processing power in your collector. For high-scale, tail-based sampling with intelligent rules (e.g., always sample errors, always sample requests above N ms) is the way to go.
High-Cardinality Attributes are Expensive: Adding too many unique attributes (e.g.,
session.idfor every request) to spans inflates your data volume and storage costs. Be judicious. Focus on attributes that help you filter and understand performance, not unique identifiers for every single user interaction.The Collector is Not Optional: For production, always use an OpenTelemetry Collector. It provides:
Batching: Reduces network calls from your services.
Retries: Buffers and retries exports if your tracing backend is temporarily unavailable.
Processing: Filters, samples, adds attributes, and transforms data before sending it to the backend.
Security: Can handle authentication/authorization to your backend.
Connecting Traces to Logs (Teaser for Day 3): A trace tells you what happened and how long. Logs tell you why. The ultimate power comes from linking your logs to your traces using
Trace IDandSpan ID. This allows you to jump from a slow span in Zipkin directly to the relevant log entries in your logging system.
Assignment: Trace Your Distributed Java System
Your mission, should you choose to accept it, is to build a two-service Spring Boot application, instrument it with OpenTelemetry, and visualize the traces in Zipkin.
Steps:
Setup Docker Compose: Get Zipkin and an OpenTelemetry Collector running.
Create Two Spring Boot Services:
user-service: Exposes an endpoint/api/users/{id}. This service will callorder-service.order-service: Exposes an endpoint/api/orders/user/{id}.
Download OpenTelemetry Java Agent: This magical
.jarwill do most of the heavy lifting.Run Services with Agent: Start your Java applications, attaching the OpenTelemetry Java Agent and configuring it to export traces to your OTel Collector.
Trigger Traces: Use
curlto hit youruser-serviceendpoint, which in turn callsorder-service.Verify in Zipkin: Open the Zipkin UI, find your traces, and marvel at the end-to-end story of your request.
Solution Hints: Your Compass
Docker Compose:
zipkin: Useopenzipkin/zipkin.otel-collector: Useotel/opentelemetry-collector-contrib. You'll need aotel-collector-config.yamlto configure it to receive OTLP traces and export them to Zipkin.receivers:otlp(gRPC and HTTP).exporters:zipkin(pointing to thezipkinservice in Docker Compose).service:pipelines->traces->receivers: [otlp],exporters: [zipkin].Java Agent Download: Find the latest
opentelemetry-javaagent.jarrelease on the OpenTelemetry Java Agent GitHub page.Running Java Services:
java -javaagent:/path/to/opentelemetry-javaagent.jar -Dotel.service.name=user-service -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.resource.attributes="service.version=1.0" -jar user-service.jarRemember to change
user-servicetoorder-servicefor the second service and adjustserver.portinapplication.properties.localhost:4317(orotel-collector:4317if running services inside Docker network) is the default OTLP gRPC endpoint.Spring Boot
WebClient: UseWebClientinUserServiceto make the call toOrderService. OpenTelemetry automatically instruments it.
This hands-on journey will solidify your understanding of distributed tracing. You'll not just read about it; you'll build it, see it, and feel its power. This is the difference between knowing a concept and truly mastering it for production. Good luck, and happy tracing!