What We're Building Today
Today we're constructing a production-grade log collector service that forms the backbone of any distributed logging system. Our implementation will deliver:
• File System Watcher Service - Real-time detection of log file changes using Java NIO.2 WatchService with configurable polling strategies
• Event-Driven Stream Processing - Kafka-backed message streaming that handles log entries as they're discovered, with guaranteed delivery semantics
• Resilient Collection Pipeline - Circuit breaker patterns and retry mechanisms that gracefully handle file system failures and temporary outages
• Distributed State Management - Redis-backed offset tracking and deduplication to ensure exactly-once processing across service restarts
Why This Matters: The Foundation of Observable Systems
Every major technology company processes billions of log entries daily. Netflix's logging infrastructure ingests over 8 trillion events per day, while Uber processes 100+ terabytes of logs daily for real-time decision making. The log collector is the critical first component that determines whether your observability stack can scale.
The patterns we're implementing today directly address the fundamental challenges of distributed log processing: how do you reliably capture, deduplicate, and stream log data from thousands of services without losing events or overwhelming downstream systems? The architectural decisions we make in the collector service cascade through the entire observability pipeline, affecting everything from alert accuracy to debugging capabilities.
System Design Deep Dive: Core Distributed Patterns
1. File System Watching Strategy: Push vs Pull Trade-offs
Traditional log collection relies on periodic file scanning, but this approach introduces latency and resource waste. Modern systems need sub-second detection capabilities. We'll implement the WatchService pattern with intelligent fallbacks:
Trade-off Analysis: Event-driven watching provides immediate notification but consumes file descriptors and can overwhelm systems during log bursts. Our hybrid approach combines immediate watching with periodic validation scans to handle edge cases like file rotation and missed events.
Failure Mode: File system events can be lost during high I/O scenarios. We implement a reconciliation pattern that periodically validates our in-memory state against actual file system state, ensuring no log entries are permanently lost.
2. Exactly-Once Semantics: The Offset Management Challenge
Distributed log collection faces the classic exactly-once delivery problem. Services restart, networks partition, and files rotate unexpectedly. Our solution implements distributed offset management using Redis as a coordination layer:
Architectural Insight: The offset management pattern must balance consistency with availability. We accept eventual consistency for performance but implement write-ahead logging to Redis before processing events, ensuring we can recover from any point of failure.
3. Backpressure and Flow Control: Handling Log Bursts
Production systems experience unpredictable log volume spikes. A single service deployment can generate gigabytes of logs in minutes. Our collector implements adaptive backpressure using Kafka's built-in flow control combined with local buffering:
Scale Consideration: The buffer size directly impacts memory usage and latency. We implement dynamic buffer sizing based on downstream processing rates, preventing out-of-memory conditions while maintaining low latency during normal operations.
4. Circuit Breaker Pattern: Preventing Cascade Failures
When downstream services (Kafka, Redis) become unavailable, naive retry logic can amplify problems. We implement the circuit breaker pattern using Resilience4j to fail fast and prevent resource exhaustion:
Production Reality: Circuit breakers must be tuned based on actual failure patterns. Our implementation includes adaptive thresholds that adjust based on historical success rates, preventing false positives during normal load variations.
5. Distributed Tracing: Observing the Observer
A log collector that can't be debugged is useless in production. We implement end-to-end tracing using Micrometer and Zipkin, creating a trace for each log entry from file detection through Kafka delivery:
Observability Insight: The collector's own metrics become critical for debugging collection issues. We emit custom metrics for file discovery rate, processing latency, and error frequencies that directly correlate with business impact.
Implementation Walkthrough: Building Production-Ready Components
Our implementation centers around the LogCollectorService that orchestrates file watching, event processing, and downstream delivery. The architecture follows Spring Boot's reactive patterns while maintaining explicit error boundaries.
Core Service Architecture
The LogCollectorService implements a producer-consumer pattern with multiple worker threads handling different aspects of collection:
Architectural Decision: Separate thread pools prevent file watching from blocking event processing. The 1:2 ratio (4 watchers, 8 processors) reflects the I/O-bound nature of file operations versus CPU-bound log parsing.
Event Processing Pipeline
Each log entry flows through a structured pipeline that handles deduplication, formatting, and reliable delivery:
File Event Detection - NIO.2 WatchService detects file modifications
Offset-based Reading - Read only new content using stored file positions
Event Enrichment - Add metadata (hostname, service name, collection timestamp)
Deduplication Check - Redis-based duplicate detection using content hashing
Kafka Delivery - Asynchronous delivery with retry and dead letter handling
Configuration-Driven Scalability
Production deployments require tunable behavior without code changes. Our implementation externalizes all critical parameters:
Production Considerations: Performance and Failure Scenarios
Memory Management and GC Pressure
Log collection is inherently memory-intensive. Large log files and high-frequency events can cause significant GC pressure. Our implementation includes explicit memory management:
Streaming file reading to avoid loading entire files into memory
Object pooling for frequently allocated LogEvent instances
Configurable batch sizes to balance latency versus memory usage
File Rotation Handling
Production systems rotate logs unpredictably. Our service handles rotation scenarios through inode tracking and graceful transition logic that maintains offset accuracy across file system changes.
Network Partition Recovery
When Kafka becomes unavailable, the collector implements local persistence with automatic replay once connectivity is restored. This prevents log loss during infrastructure maintenance or outages.
Performance Benchmarks
Under normal load, the collector processes 10,000 log entries per second per instance with sub-100ms latency. Memory usage remains stable under 512MB with proper configuration, making it suitable for deployment alongside application services.
Scale Connection: FAANG-Level Systems
This exact pattern scales to Netflix's unified logging platform that handles 8 trillion events daily. The key insight is that collection complexity grows logarithmically while data volume grows exponentially. By implementing proper offset management and circuit breaker patterns today, you're building the foundation that can scale to petabyte-level processing with primarily configuration changes rather than architectural rewrites.
Amazon's CloudWatch Logs and Google's Cloud Logging both implement these same fundamental patterns, proving that today's implementation directly translates to hyperscale production environments.
Next Steps: Tomorrow's Advanced Concepts
Day 4 will build on today's collector by implementing intelligent log parsing. We'll add pattern recognition, structured data extraction, and schema validation - transforming raw log streams into queryable, analytics-ready data structures that feed business intelligence systems.
The parsing layer will introduce new distributed patterns around schema evolution, data validation, and transformation pipelines that maintain backward compatibility while enabling rich analytics capabilities.