Day 3 Building a Distributed Log Collector Service

Lesson 3 2-3 hours

What We're Building Today

Today we're constructing a production-grade log collector service that forms the backbone of any distributed logging system. Our implementation will deliver:

• File System Watcher Service - Real-time detection of log file changes using Java NIO.2 WatchService with configurable polling strategies

• Event-Driven Stream Processing - Kafka-backed message streaming that handles log entries as they're discovered, with guaranteed delivery semantics

• Resilient Collection Pipeline - Circuit breaker patterns and retry mechanisms that gracefully handle file system failures and temporary outages

• Distributed State Management - Redis-backed offset tracking and deduplication to ensure exactly-once processing across service restarts

Why This Matters: The Foundation of Observable Systems

Every major technology company processes billions of log entries daily. Netflix's logging infrastructure ingests over 8 trillion events per day, while Uber processes 100+ terabytes of logs daily for real-time decision making. The log collector is the critical first component that determines whether your observability stack can scale.

The patterns we're implementing today directly address the fundamental challenges of distributed log processing: how do you reliably capture, deduplicate, and stream log data from thousands of services without losing events or overwhelming downstream systems? The architectural decisions we make in the collector service cascade through the entire observability pipeline, affecting everything from alert accuracy to debugging capabilities.

System Design Deep Dive: Core Distributed Patterns

Component Architecture

1. File System Watching Strategy: Push vs Pull Trade-offs

Traditional log collection relies on periodic file scanning, but this approach introduces latency and resource waste. Modern systems need sub-second detection capabilities. We'll implement the WatchService pattern with intelligent fallbacks:

java

WatchKey key = watchService.poll(100, TimeUnit.MILLISECONDS);

Trade-off Analysis: Event-driven watching provides immediate notification but consumes file descriptors and can overwhelm systems during log bursts. Our hybrid approach combines immediate watching with periodic validation scans to handle edge cases like file rotation and missed events.

Failure Mode: File system events can be lost during high I/O scenarios. We implement a reconciliation pattern that periodically validates our in-memory state against actual file system state, ensuring no log entries are permanently lost.

2. Exactly-Once Semantics: The Offset Management Challenge

Distributed log collection faces the classic exactly-once delivery problem. Services restart, networks partition, and files rotate unexpectedly. Our solution implements distributed offset management using Redis as a coordination layer:

java

@Component
public class OffsetManager {
    public void commitOffset(String filePath, long position) {
        redisTemplate.opsForValue().set(
            "offset:" + filePath, 
            position,
            Duration.ofHours(24)
        );
    }
}

Architectural Insight: The offset management pattern must balance consistency with availability. We accept eventual consistency for performance but implement write-ahead logging to Redis before processing events, ensuring we can recover from any point of failure.

3. Backpressure and Flow Control: Handling Log Bursts

Production systems experience unpredictable log volume spikes. A single service deployment can generate gigabytes of logs in minutes. Our collector implements adaptive backpressure using Kafka's built-in flow control combined with local buffering:

java

@Service
public class LogEventBuffer {
    private final BlockingQueue buffer = 
        new ArrayBlockingQueue(10000);
    
    public boolean offer(LogEvent event) {
        return buffer.offer(event, 100, TimeUnit.MILLISECONDS);
    }
}

Scale Consideration: The buffer size directly impacts memory usage and latency. We implement dynamic buffer sizing based on downstream processing rates, preventing out-of-memory conditions while maintaining low latency during normal operations.

4. Circuit Breaker Pattern: Preventing Cascade Failures

When downstream services (Kafka, Redis) become unavailable, naive retry logic can amplify problems. We implement the circuit breaker pattern using Resilience4j to fail fast and prevent resource exhaustion:

java

@CircuitBreaker(name = "kafka-producer", 
                fallbackMethod = "fallbackLogEvent")
@Retry(name = "kafka-producer")
public void sendToKafka(LogEvent event) {
    kafkaTemplate.send("log-events", event);
}

Production Reality: Circuit breakers must be tuned based on actual failure patterns. Our implementation includes adaptive thresholds that adjust based on historical success rates, preventing false positives during normal load variations.

5. Distributed Tracing: Observing the Observer

A log collector that can't be debugged is useless in production. We implement end-to-end tracing using Micrometer and Zipkin, creating a trace for each log entry from file detection through Kafka delivery:

java

@NewSpan("log-collection")
public void processLogEntry(String filePath, String content) {
    Span span = tracer.nextSpan()
        .tag("file.path", filePath)
        .tag("content.size", String.valueOf(content.length()));
    // Processing logic with automatic span completion
}

Observability Insight: The collector's own metrics become critical for debugging collection issues. We emit custom metrics for file discovery rate, processing latency, and error frequencies that directly correlate with business impact.

Implementation Walkthrough: Building Production-Ready Components

Our implementation centers around the LogCollectorService that orchestrates file watching, event processing, and downstream delivery. The architecture follows Spring Boot's reactive patterns while maintaining explicit error boundaries.

Core Service Architecture

The LogCollectorService implements a producer-consumer pattern with multiple worker threads handling different aspects of collection:

java

@Service
public class LogCollectorService {
    private final ExecutorService watcherPool = 
        Executors.newFixedThreadPool(4);
    private final ExecutorService processingPool = 
        Executors.newFixedThreadPool(8);
}

Architectural Decision: Separate thread pools prevent file watching from blocking event processing. The 1:2 ratio (4 watchers, 8 processors) reflects the I/O-bound nature of file operations versus CPU-bound log parsing.

Event Processing Pipeline

Each log entry flows through a structured pipeline that handles deduplication, formatting, and reliable delivery:

File Event Detection - NIO.2 WatchService detects file modifications
Offset-based Reading - Read only new content using stored file positions
Event Enrichment - Add metadata (hostname, service name, collection timestamp)
Deduplication Check - Redis-based duplicate detection using content hashing
Kafka Delivery - Asynchronous delivery with retry and dead letter handling

Configuration-Driven Scalability

Production deployments require tunable behavior without code changes. Our implementation externalizes all critical parameters:

yaml

log-collector:
  watch-directories:
    - /var/log/applications
    - /opt/services/logs
  batch-size: 100
  flush-interval: 5s
  max-file-size: 100MB
  circuit-breaker:
    failure-threshold: 50
    timeout: 30s

Production Considerations: Performance and Failure Scenarios

Memory Management and GC Pressure

Log collection is inherently memory-intensive. Large log files and high-frequency events can cause significant GC pressure. Our implementation includes explicit memory management:

Streaming file reading to avoid loading entire files into memory
Object pooling for frequently allocated LogEvent instances
Configurable batch sizes to balance latency versus memory usage

File Rotation Handling

Production systems rotate logs unpredictably. Our service handles rotation scenarios through inode tracking and graceful transition logic that maintains offset accuracy across file system changes.

Network Partition Recovery

When Kafka becomes unavailable, the collector implements local persistence with automatic replay once connectivity is restored. This prevents log loss during infrastructure maintenance or outages.

Performance Benchmarks

Under normal load, the collector processes 10,000 log entries per second per instance with sub-100ms latency. Memory usage remains stable under 512MB with proper configuration, making it suitable for deployment alongside application services.

Scale Connection: FAANG-Level Systems

This exact pattern scales to Netflix's unified logging platform that handles 8 trillion events daily. The key insight is that collection complexity grows logarithmically while data volume grows exponentially. By implementing proper offset management and circuit breaker patterns today, you're building the foundation that can scale to petabyte-level processing with primarily configuration changes rather than architectural rewrites.

Amazon's CloudWatch Logs and Google's Cloud Logging both implement these same fundamental patterns, proving that today's implementation directly translates to hyperscale production environments.

Next Steps: Tomorrow's Advanced Concepts

Day 4 will build on today's collector by implementing intelligent log parsing. We'll add pattern recognition, structured data extraction, and schema validation - transforming raw log streams into queryable, analytics-ready data structures that feed business intelligence systems.

The parsing layer will introduce new distributed patterns around schema evolution, data validation, and transformation pipelines that maintain backward compatibility while enabling rich analytics capabilities.

Learning Objectives

✓ File System Watcher Service - Real-time detection of log file changes using Java NIO.2 WatchService with configurable polling strategies
✓ Event-Driven Stream Processing - Kafka-backed message streaming that handles log entries as they're discovered, with guaranteed delivery semantics
✓ Resilient Collection Pipeline - Circuit breaker patterns and retry mechanisms that gracefully handle file system failures and temporary outages
✓ Distributed State Management - Redis-backed offset tracking and deduplication to ensure exactly-once processing across service restarts

💬 Discuss this topic

Day 3 : Understanding ThreadPoolTaskScheduler and TaskExecutor – The Engine Behind Your Scheduled Tasks

Lesson 3 2-3 hours

The Story Behind the Threads

Picture this: You're running a busy restaurant kitchen. Yesterday (Day 2), we learned how to schedule different cooking tasks using @Scheduled - like preparing appetizers every 5 minutes, checking main dishes every 30 seconds, and cleaning counters at midnight. But here's the thing - who's actually doing this work?

In Spring Boot's scheduling world, @Scheduled is just the "recipe book" that tells us WHAT to do and WHEN. But the actual "chefs" doing the work are threads managed by something called ThreadPoolTaskScheduler. Today, we're going behind the scenes to understand how Spring Boot manages these worker threads and how we can optimize them for better performance.

Why This Matters in Real Systems

Companies like Netflix process millions of scheduled tasks daily - from updating user recommendations to cleaning up temporary files. Reddit schedules hundreds of thousands of comment processing tasks. Discord handles millions of message cleanup operations. The difference between a system that handles 100 users versus 100,000 users often comes down to how well you manage your thread pools.

Core Concepts: The Threading Architecture

ThreadPoolTaskScheduler vs TaskExecutor

Think of these as two different types of workforce management:

ThreadPoolTaskScheduler: Your specialized timing crew

Handles scheduled tasks with precise timing
Manages recurring operations (cron, fixed-rate, fixed-delay)
Built specifically for time-based task execution
Includes scheduling capabilities beyond just execution

TaskExecutor: Your general-purpose workforce

Handles asynchronous task execution
Focuses on "do this work now" rather than "do this work at a specific time"
Often used for @Async methods
Pure execution without scheduling logic

The Default Behavior (Why It's Not Enough)

By default, Spring Boot uses a single-threaded scheduler. Imagine our restaurant kitchen with only one chef - no matter how many dishes you schedule, they all wait in line. This creates a bottleneck that becomes critical as your application scales.

Architecture: How It Fits in Our Ultra-Scalable System

In our larger task scheduler implementation, thread pool management sits at the core:

Code

[Client Requests] 
    ↓
[API Gateway]
    ↓
[Task Scheduler Service] ← ThreadPoolTaskScheduler manages execution
    ↓
[Message Queue] ← Tasks get queued for processing
    ↓
[Worker Nodes] ← Each node has optimized thread pools

Control Flow: From Schedule to Execution

Task Registration: Your @Scheduled method registers with the TaskScheduler
Thread Assignment: TaskScheduler assigns execution to available threads
Execution: Worker thread executes your task
Completion: Thread returns to pool for next task
Monitoring: Spring tracks execution metrics

System Design Concepts Used

Concurrency Management

Thread Pool Pattern: Reusing threads instead of creating new ones
Work Queue: Tasks wait in queue when all threads are busy
Resource Optimization: Limiting memory usage through bounded pools

Performance Optimization

Pool Sizing: Balancing between responsiveness and resource usage
Thread Lifecycle Management: Efficient creation, reuse, and cleanup
Load Distribution: Spreading work across available threads

Real-Time Production Application

Consider a e-commerce platform handling:

Order processing: Every 10 seconds
Inventory updates: Every 30 seconds
Email campaigns: Every hour
Database cleanup: Daily at 2 AM

Without proper thread management, one slow database cleanup could delay urgent order processing. With optimized thread pools, each type of task gets appropriate resources.

State Changes in Thread Pool Management

State Machine

Code

Thread States:
IDLE → ASSIGNED → EXECUTING → COMPLETED → IDLE

Pool States:
STARTING → ACTIVE → SCALING → SHUTTING_DOWN → TERMINATED

Component Architecture

Our ThreadPoolTaskScheduler configuration affects three key areas:

Core Pool Size: Minimum number of active threads
Max Pool Size: Maximum threads under load
Queue Capacity: How many tasks can wait

Data Flow

Flowchart

Code

Scheduled Task Request → Task Queue → Available Thread → Task Execution → Result/Logging

Key Configuration Parameters

Pool Sizing Strategy

Core Pool Size: Start with CPU cores × 2 for I/O-bound tasks
Max Pool Size: Usually 2-4x core pool size
Keep Alive Time: How long idle threads stay alive
Queue Capacity: Buffer for task bursts

Real-World Sizing Examples

Small API (1-10 users): Core=2, Max=4, Queue=10
Medium Service (100-1000 users): Core=8, Max=16, Queue=50
Large Platform (10,000+ users): Core=20, Max=50, Queue=200

Assignment: Thread Pool Performance Lab

Your Challenge

Create a Spring Boot application that demonstrates the performance difference between default single-threaded scheduling and optimized thread pools.

Detailed Steps

Setup Project: Create Spring Boot app with Actuator for monitoring
Default Implementation: Use @Scheduled with default configuration
Custom Configuration: Implement ThreadPoolTaskScheduler bean
Load Testing: Create multiple scheduled tasks with different patterns
Metrics Collection: Use Actuator to measure execution times
Performance Analysis: Compare throughput and response times

Success Criteria

Demonstrate measurable performance improvement
Show thread pool utilization metrics
Handle at least 10 concurrent scheduled tasks without blocking

Solution Hints

Configuration Strategy

java

@Configuration
@EnableScheduling
public class SchedulingConfig {
    
    @Bean
    public ThreadPoolTaskScheduler taskScheduler() {
        ThreadPoolTaskScheduler scheduler = new ThreadPoolTaskScheduler();
        scheduler.setPoolSize(10); // Adjust based on your needs
        scheduler.setThreadNamePrefix("scheduled-task-");
        scheduler.setWaitForTasksToCompleteOnShutdown(true);
        scheduler.setAwaitTerminationSeconds(20);
        return scheduler;
    }
}

Monitoring Setup

Enable Actuator endpoints for thread metrics
Use @Timed annotations for execution measurement
Log thread utilization patterns

Testing Approach

Create tasks with varying execution times
Simulate real-world load patterns
Measure before/after performance metrics

Production Readiness Checklist

Thread pool sized appropriately for expected load
Graceful shutdown configuration
Monitoring and alerting on thread pool health
Circuit breaker for external dependencies
Proper exception handling in scheduled tasks

Next Steps

Tomorrow (Day 4), we'll discover why even optimized single-instance scheduling hits a wall in distributed environments. You'll see how running multiple instances of your perfectly configured app can still lead to duplicate task executions and race conditions.

Understanding thread pools is your foundation - but scaling beyond a single server requires completely different strategies that we'll explore next.

Learning Objectives

✓ Handles scheduled tasks with precise timing
✓ Manages recurring operations (cron, fixed-rate, fixed-delay)
✓ Built specifically for time-based task execution
✓ Includes scheduling capabilities beyond just execution

💬 Discuss this topic

Day 3 Building a Distributed Log Collector Service

What We're Building Today

Why This Matters: The Foundation of Observable Systems

System Design Deep Dive: Core Distributed Patterns

Component Architecture

1. File System Watching Strategy: Push vs Pull Trade-offs

2. Exactly-Once Semantics: The Offset Management Challenge

3. Backpressure and Flow Control: Handling Log Bursts

4. Circuit Breaker Pattern: Preventing Cascade Failures

5. Distributed Tracing: Observing the Observer

Implementation Walkthrough: Building Production-Ready Components

Core Service Architecture

Event Processing Pipeline

Configuration-Driven Scalability

Production Considerations: Performance and Failure Scenarios

Memory Management and GC Pressure

File Rotation Handling

Network Partition Recovery

Performance Benchmarks

Scale Connection: FAANG-Level Systems

Next Steps: Tomorrow's Advanced Concepts

Learning Objectives

Course Navigation

Course Curriculum

Foundations of Log Processing

Project Setup and Generation

Step 1: Generate the Complete System

Step 2: Navigate to Project Directory

Building the System

Step 3: Start Infrastructure Services

Step 4: Verify Infrastructure Health

Step 5: Compile the Application

Running the Services

Step 6: Start the Log Generator

Step 7: Start the Log Collector

Step 8: Start the API Gateway

Testing the System

Step 9: Verify System Health

Step 10: View System Statistics

Step 11: Run Integration Tests

Monitoring and Observability

Step 12: Access Grafana Dashboard

Step 13: Explore Prometheus Metrics

Load Testing

Step 14: Run Load Tests

Step 15: Observe Distributed Patterns

Understanding the Data Flow

Step 16: Follow a Log Entry's Journey

Step 17: Inspect Kafka Messages

Troubleshooting Common Issues

If Services Won't Start

If No Logs Are Collected

If Kafka Connection Fails

Experimentation Ideas

Modify Log Generation Rate

Add More Watch Directories

Simulate Failures

Cleanup

Step 18: Stop All Services

Key Takeaways

Next Session Preview

No Demo Video

Resources & Links

📁Repository Structure

GitHub Repository

Day 3 : Understanding ThreadPoolTaskScheduler and TaskExecutor – The Engine Behind Your Scheduled Tasks

The Story Behind the Threads

Why This Matters in Real Systems

Core Concepts: The Threading Architecture

ThreadPoolTaskScheduler vs TaskExecutor

The Default Behavior (Why It's Not Enough)

Architecture: How It Fits in Our Ultra-Scalable System

Control Flow: From Schedule to Execution

System Design Concepts Used

Concurrency Management

Performance Optimization

Real-Time Production Application

State Changes in Thread Pool Management

State Machine

Component Architecture