Day 3: Structured Logging with MDC Context Propagation

Lesson 3 60 min

Day 3: Structured Logging with MDC Context Propagation – The Digital Thread in Your Distributed Fabric

Welcome back, fellow architects of resilience.

Yesterday, we laid the groundwork for understanding the journey of a request across services with Distributed Tracing. Today, we’re going to weave an even finer thread into that fabric: Structured Logging with MDC Context Propagation. This isn't just about dumping text; it's about crafting a digital narrative for every single interaction in your system, a narrative that tells a complete story, not just fragmented sentences.

In ultra-high-scale systems, where a single user request might touch dozens, if not hundreds, of microservices, traditional logging quickly becomes a liability. Imagine trying to debug why a payment failed when you have millions of logs per second across thousands of containers. Without a coherent thread connecting all these disparate log lines back to that one specific payment request, you're not debugging; you're playing a frustrating game of "Where's Waldo?" in a digital hayfield.

This is where MDC (Mapped Diagnostic Context) steps in. It's not just a nice-to-have; it's a critical enabler for effective troubleshooting in production.

Why Your Logs Need a Digital Thread

Think about how you usually log: log.info("Processing order for user: " + userId);. This is fine for a monolithic application. But in a distributed system, that userId might be processed by Service A, then Service B, then Service C, all on different machines, potentially handled by different threads. How do you tie all these log lines together?

The answer is context propagation. We need to carry a unique identifier (like a correlationId or traceId) through the entire request lifecycle, from the moment it hits your API Gateway until it completes its journey. MDC allows us to attach this context to every log statement made by the current thread, automatically.

Core Concepts: MDC and Structured Logging

Component Architecture

Gateway Service (JVM Thread) MDC {correlationId: "abc-123"} JSON Encoder Async Ship Log Aggregator (ELK/Splunk) {"level":"INFO", "id":"abc-123"...} {"level":"DEBUG", "id":"abc-123"...} {"level":"ERROR", "id":"abc-123"...} ...indexed by id...
  1. Mapped Diagnostic Context (MDC):

  • MDC is a thread-local map managed by logging frameworks (like Logback or Log4j2).

  • You put key-value pairs into MDC, and these pairs are then automatically included in all log messages emitted by that thread.

  • The "Aha!" Moment: When a request comes in, a unique correlationId is generated or extracted. This ID is then placed into MDC. As the request flows through your service, any log statement made by the handling thread will automatically include this correlationId. When this service calls another downstream service, the correlationId is passed in the HTTP headers. The downstream service then extracts it and places it into its MDC, continuing the chain.

  • The Nuance: MDC is ThreadLocal. This is crucial. If your application uses asynchronous processing (e.g., CompletableFuture, ExecutorService, reactive frameworks like Project Reactor), the context does not automatically flow to new threads. This is a common pitfall. You need explicit mechanisms (like MDC.wrap(), ContextSnapshot, or custom Context operators) to propagate MDC across thread boundaries. This is where the real production-grade difference lies.

  1. Structured Logging:

  • Instead of plain text logs, structured logging outputs logs in a machine-readable format, typically JSON.

  • Each log entry becomes a structured object with fields like timestamp, level, message, serviceName, and crucially, our MDC-propagated correlationId.

  • Why it matters for 100M RPS: When you're dealing with immense log volumes, you can't grep your way to a solution. You need to query, filter, and aggregate logs efficiently using tools like Elasticsearch, Splunk, or Datadog. Structured logs make this trivial. You can ask: "Show me all logs for correlationId='abc-123' across all services, where level='ERROR'." This transforms debugging from a manual hunt to a targeted query.

Flowchart

Request Inbound INTERCEPT: Generate/Extract ID MDC.put("correlationId", id) EXECUTE: Business Logic Logs automatically include ID PROPAGATE: Downstream Call Add Header: X-Correlation-ID CLEANUP: afterCompletion MDC.clear() / MDC.remove() Request Outbound

Architecture & Control Flow

Imagine a user request hitting your system:

  1. API Gateway / Edge Service:

  • A HandlerInterceptor (Spring) or Filter (Servlet API) intercepts the incoming request.

  • It generates a unique correlationId (if not present in headers) or extracts an existing one.

  • It places this correlationId into the current thread's MDC (MDC.put("correlationId", id)).

  • It adds the correlationId to outgoing HTTP headers for any downstream calls.

  • Crucially, it ensures MDC.clear() is called when the request completes to prevent context leaking to other requests handled by the same recycled thread (the "sticky context" problem).

  1. Downstream Microservices (e.g., Order Service, Inventory Service):

  • Similarly, a HandlerInterceptor or Filter intercepts the incoming request.

  • It extracts the correlationId from the incoming HTTP headers.

  • It places this correlationId into its own thread's MDC.

  • It also adds the correlationId to outgoing HTTP headers for its downstream calls.

  • Again, MDC.clear() on request completion.

Data Flow: The correlationId starts as a unique string, flows through HTTP headers between services, and lives temporarily in the MDC of each processing thread, getting injected into every log line.

Real-World Application: The Netflix Debugging Story

Consider Netflix. When you click "Play" on a movie, that single action triggers a cascade of events across services: authentication, authorization, content delivery network (CDN) routing, billing checks, recommendation updates, analytics tracking, and more. If your video buffers, how do their SREs pinpoint the bottleneck? They don't just look at a single service's logs. They use a requestId (their version of correlationId) that is propagated across every service. With structured logs and a powerful log aggregation system (like their internal variant of ELK), they can query for all events related to your specific "Play" request, instantly seeing its journey, latency at each hop, and any errors, regardless of which of their thousands of services generated the log. This is the power you're building.

Assignment: Weaving the Digital Thread

State Machine

CLEAN MDC is Empty MDC.put() ACTIVE Logs = Message + ID MDC.clear() Async Thread Spawn (Context Lost without wrap)

You will build a simple distributed system consisting of two Spring Boot services: a GatewayService and a BackendService. Your goal is to implement structured logging with MDC context propagation, ensuring a correlationId is present in all log lines across both services for a given request.

Steps:

  1. Project Setup: Create two Spring Boot projects: gateway-service and backend-service.

  2. Logback Configuration: Configure both services to use Logback with JSON structured logging, including a correlationId field.

  3. Gateway Service:

  • Implement a HandlerInterceptor to generate a correlationId for incoming requests (if not present) and put it into MDC.

  • Ensure the correlationId is added to outgoing HTTP headers when the Gateway calls the Backend. Use a RestTemplate interceptor for this.

  • Implement a simple /greet endpoint that calls the Backend's /hello endpoint.

  1. Backend Service:

  • Implement a HandlerInterceptor to extract the correlationId from incoming HTTP headers and put it into MDC.

  • Implement a /hello endpoint that returns a simple message.

  1. Clean Up: Crucially, ensure MDC is cleared after each request in both services to prevent context leaks. This is often handled automatically by Spring's HandlerInterceptor lifecycle (afterCompletion), but it's vital to understand why.

  2. Verification: Make requests to the Gateway and observe the structured logs in both services. Confirm the correlationId is identical across related log lines.

Solution Hints

  • Logback JSON: Use net.logstash.logback:logstash-logback-encoder dependency. In logback-spring.xml, configure a JsonLayout or LogstashEncoder.

xml
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
To include MDC properties, the LogstashEncoder does this by default.
  • MDC Interceptor (Gateway/Backend):

java
// Example: CorrelationIdInterceptor.java
public class CorrelationIdInterceptor implements HandlerInterceptor {
    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
        String correlationId = request.getHeader("X-Correlation-ID");
        if (correlationId == null || correlationId.isEmpty()) {
            correlationId = UUID.randomUUID().toString();
        }
        MDC.put("correlationId", correlationId);
        return true;
    }

    @Override
    public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
        MDC.remove("correlationId"); // Or MDC.clear() if you're sure you only put one item
    }
}
  • RestTemplate Interceptor (Gateway):

java
// Example: RestTemplateConfig.java
@Configuration
public class RestTemplateConfig {
    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder builder) {
        return builder.additionalInterceptors(new RestTemplateCorrelationIdInterceptor()).build();
    }
}

// Example: RestTemplateCorrelationIdInterceptor.java
public class RestTemplateCorrelationIdInterceptor implements ClientHttpRequestInterceptor {
    @Override
    public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException {
        String correlationId = MDC.get("correlationId");
        if (correlationId != null) {
            request.getHeaders().add("X-Correlation-ID", correlationId);
        }
        return execution.execute(request, body);
    }
}
  • WebMvcConfigurer (to register interceptors):

java
// Example: WebConfig.java
@Configuration
public class WebConfig implements WebMvcConfigurer {
    @Override
    public void addInterceptors(InterceptorRegistry registry) {
        registry.addInterceptor(new CorrelationIdInterceptor());
    }
}

This setup provides a robust foundation for observability. You're not just logging; you're creating a forensic trail that can cut through the noise of a high-volume distributed system, allowing you to quickly diagnose and resolve issues. This is how the pros do it.

Need help?