Day 2: Distributed Tracing with OpenTelemetry and Zipkin Integration

Lesson 2 60 min

Day 2: Distributed Tracing with OpenTelemetry and Zipkin Integration

Welcome back, future troubleshooting maestros! Yesterday, we broke free from the println shackles and embraced Micrometer to understand single-service performance. We saw how a well-instrumented application can tell you what is happening internally. But here's the kicker: in a distributed system, "internally" isn't just one box. It's a symphony of services, each playing its part. What happens when the melody goes sour, and you can't tell if the problem is with the violins, the flutes, or the conductor?

That, my friends, is the chasm we bridge today. We're diving deep into Distributed Tracing, the bedrock of understanding how requests flow—and fail—across complex microservice landscapes.

Why Your Micrometer Metrics Aren't Enough (Yet)

Imagine a user clicks "Buy Now." This single action might touch a Frontend Service, then an Authentication Service, a User Profile Service, an Order Service, a Payment Service, and finally an Inventory Service. Each interaction is a hop. If the "Buy Now" button hangs, Micrometer might tell you the Order Service is slow, but it won't tell you why it's slow, or which specific downstream call within that order is the culprit, or if the initial Authentication Service was already struggling, causing a domino effect.

This is where distributed tracing shines. It paints a complete picture, a narrative thread that follows a single request through every service it touches, showing you latency, errors, and context at each step.

Core Concepts: The Story of a Request

Flowchart

Traces, Spans, and Context Propagation

At its heart, a trace is the journey of a single request or transaction as it propagates through a distributed system. Think of it as a story.
Each step in that story—a call to a service, a database query, a method execution—is a span. Spans are the individual chapters, containing details like:

Operation Name: What happened (e.g., /api/users/{id}, saveToDatabase).
Start/End Time: How long it took.
Tags/Attributes: Key-value pairs providing context (e.g., http.status_code: 200, user.id: 123).
Span ID: A unique identifier for this specific operation.
Parent Span ID: Links this span to its parent, forming a hierarchy.
Trace ID: A unique identifier for the entire request journey.

The magic happens with context propagation. When Service A calls Service B, Service A injects the current trace and span IDs into the outgoing request headers (e.g., traceparent HTTP header). Service B then extracts these IDs, understands it's part of an ongoing trace, and creates a new child span linked to Service A's span. This is how the entire story, across service boundaries, is stitched together.

OpenTelemetry: The Universal Translator

Component Architecture

Before OpenTelemetry (OTel), every tracing vendor (Jaeger, Zipkin, Datadog, New Relic) had its own SDKs and data formats. This meant vendor lock-in and painful migrations. OpenTelemetry changed the game. It's a vendor-neutral observability framework for generating, collecting, and exporting telemetry data (traces, metrics, and logs).

Why OpenTelemetry?

Standardization: Write your instrumentation once, export to any OTel-compatible backend.
Rich Ecosystem: Supports dozens of languages and frameworks.
Auto-Instrumentation: For many popular frameworks (like Spring Boot), you can often get basic tracing with zero code changes, thanks to language agents. This is where we'll start!

Zipkin: Your Trace Storyboard

State Machine

Zipkin is an open-source distributed tracing system. It collects and visualizes trace data, allowing you to see the full request flow, identify latency bottlenecks, and understand dependencies. We're using Zipkin because it's lightweight, easy to set up, and provides an excellent visual interface for understanding traces.

System Architecture: Our Traced Microservices

We'll set up a simple two-service architecture: UserService and OrderService.

A client makes a request to UserService.
UserService processes it and then makes an internal HTTP call to OrderService.
Both services are instrumented using the OpenTelemetry Java Agent.
The OTel Agent in each service captures spans and sends them to an OpenTelemetry Collector.
The OTel Collector, in turn, exports these spans to Zipkin.
You'll then view the full trace in the Zipkin UI.

This architecture is robust: the OTel Collector acts as a buffer and processor, decoupling your application from the tracing backend.

Control Flow & Data Flow: Following the Breadcrumbs

Client Request: GET /api/users/{id} to UserService.
UserService Entry: The OpenTelemetry Java Agent automatically intercepts this request, starts a new root span (e.g., GET /api/users/{id}), and assigns a unique Trace ID and Span ID.
UserService Internal Call: UserService makes an HTTP call to OrderService (e.g., GET /api/orders/user/{id}). Before sending, the OTel Agent automatically injects the current Trace ID and Span ID into the outgoing HTTP headers (traceparent, tracestate).
OrderService Entry: The OTel Agent in OrderService intercepts the incoming request, extracts the Trace ID and Parent Span ID from the headers. It then starts a new child span for this OrderService operation, linking it to the UserService span.
Span Completion & Export: As operations complete in both services, their respective spans are finalized (duration calculated, status set) and sent to the OpenTelemetry Collector.
Collector to Zipkin: The OTel Collector receives spans from both services, batches them, and exports them to Zipkin.
Zipkin Visualization: Zipkin reconstructs the entire trace based on Trace ID, Span ID, and Parent Span ID, allowing you to see the nested calls and their timings.

Production-Grade Insights: Beyond the Basics

Sampling is Your Friend (and Foe): At 100M RPS, you cannot trace every request. It's too much data, too much overhead. Sampling is crucial.
Head-based sampling: Decides whether to sample a trace at its very beginning. Simple, but you might miss interesting errors downstream if the initial decision was to discard.
Tail-based sampling: Collects all spans for a trace, then decides after the trace completes whether to keep it (e.g., if it had an error or was unusually slow). This is far more intelligent but requires more processing power in your collector. For high-scale, tail-based sampling with intelligent rules (e.g., always sample errors, always sample requests above N ms) is the way to go.
High-Cardinality Attributes are Expensive: Adding too many unique attributes (e.g., session.id for every request) to spans inflates your data volume and storage costs. Be judicious. Focus on attributes that help you filter and understand performance, not unique identifiers for every single user interaction.
The Collector is Not Optional: For production, always use an OpenTelemetry Collector. It provides:
Batching: Reduces network calls from your services.
Retries: Buffers and retries exports if your tracing backend is temporarily unavailable.
Processing: Filters, samples, adds attributes, and transforms data before sending it to the backend.
Security: Can handle authentication/authorization to your backend.
Connecting Traces to Logs (Teaser for Day 3): A trace tells you what happened and how long. Logs tell you why. The ultimate power comes from linking your logs to your traces using Trace ID and Span ID. This allows you to jump from a slow span in Zipkin directly to the relevant log entries in your logging system.

Assignment: Trace Your Distributed Java System

Your mission, should you choose to accept it, is to build a two-service Spring Boot application, instrument it with OpenTelemetry, and visualize the traces in Zipkin.

Steps:

Setup Docker Compose: Get Zipkin and an OpenTelemetry Collector running.
Create Two Spring Boot Services:

user-service: Exposes an endpoint /api/users/{id}. This service will call order-service.
order-service: Exposes an endpoint /api/orders/user/{id}.

Download OpenTelemetry Java Agent: This magical .jar will do most of the heavy lifting.
Run Services with Agent: Start your Java applications, attaching the OpenTelemetry Java Agent and configuring it to export traces to your OTel Collector.
Trigger Traces: Use curl to hit your user-service endpoint, which in turn calls order-service.
Verify in Zipkin: Open the Zipkin UI, find your traces, and marvel at the end-to-end story of your request.

Solution Hints: Your Compass

Docker Compose:
zipkin: Use openzipkin/zipkin.
otel-collector: Use otel/opentelemetry-collector-contrib. You'll need a otel-collector-config.yaml to configure it to receive OTLP traces and export them to Zipkin.
receivers: otlp (gRPC and HTTP).
exporters: zipkin (pointing to the zipkin service in Docker Compose).
service: pipelines -> traces -> receivers: [otlp], exporters: [zipkin].
Java Agent Download: Find the latest opentelemetry-javaagent.jar release on the OpenTelemetry Java Agent GitHub page.
Running Java Services:
java -javaagent:/path/to/opentelemetry-javaagent.jar -Dotel.service.name=user-service -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.resource.attributes="service.version=1.0" -jar user-service.jar
Remember to change user-service to order-service for the second service and adjust server.port in application.properties.
localhost:4317 (or otel-collector:4317 if running services inside Docker network) is the default OTLP gRPC endpoint.
Spring Boot WebClient: Use WebClient in UserService to make the call to OrderService. OpenTelemetry automatically instruments it.

This hands-on journey will solidify your understanding of distributed tracing. You'll not just read about it; you'll build it, see it, and feel its power. This is the difference between knowing a concept and truly mastering it for production. Good luck, and happy tracing!

Learning Objectives

✓ By the end of this module, you will be able to:
✓ Define the relationship between Traces, Spans, and Context Propagation in a microservices environment.
✓ Explain the role of OpenTelemetry as a vendor-neutral standard for telemetry data.
✓ Configure an OpenTelemetry Collector to receive, process, and export spans to a visualization backend (Zipkin).
✓ Implement zero-code distributed tracing using the OpenTelemetry Java Agent.
✓ Evaluate sampling strategies (Head-based vs. Tail-based) to balance observability overhead with diagnostic depth at scale.

💬 Discuss this topic