Day 1 : Event-Driven Architecture Fundamentals

Lesson 1 2-3 hours

What We're Building Today

Today we transform StreamSocial from a single-point-of-failure system into a resilient, distributed event platform. We'll deploy a 3-broker Kafka cluster that can handle millions of social media events without breaking a sweat.

Today's Agenda:

  • Deploy multi-broker Kafka cluster with Docker Compose

 

  • Implement broker discovery and failover mechanisms

 

  • Create monitoring dashboard for cluster health

 

  • Test fault tolerance with controlled broker failures

Core Concepts: Distributed Consensus in Action

Component Architecture

StreamSocial Component Architecture Client Layer Web Mobile API API Gateway FastAPI | WebSocket Event Bus Publish/Subscribe (Kafka/Redis) Event Store Immutable Ledger Feed Handler Timeline Ranking Notifications Push Alert Engine Analytics Real-time Metrics User Feeds DB NoSQL / Cache

Broker Architecture Deep Dive

Think of Kafka brokers like Netflix's content delivery network - instead of one massive server, you have multiple smaller servers working together. Each broker is an independent Kafka server that stores and serves data.

Why 3 Brokers Matter:

  • Fault Tolerance: Lose one broker? System keeps running

 

  • Load Distribution: 50M requests spread across multiple machines

 

  • Replication: Your precious user posts aren't lost forever

Zookeeper vs KRaft: The Leadership Challenge

Imagine coordinating a group project without a clear leader - chaos, right? That's why Kafka needs distributed consensus.

Zookeeper (Traditional):

  • External coordination service

 

  • Handles leader election and metadata

 

  • Like having a dedicated project manager

KRaft (Modern):

  • Self-managing Kafka cluster

 

  • No external dependencies

 

  • Like the team electing their own leader

StreamSocial uses KRaft because it's simpler and reduces operational overhead.

Context in Ultra-Scalable System Design

Flowchart

Event-Driven Flow Process User Action Create Event + Timestamp + Event ID Publish to Event Bus Store Event Immutable Log Async Processing Route to Subscribers Feed Handler Update Timeline Notification Send Alerts Analytics Track Metrics WebSocket Broadcast Real-time UI Update Event Processing Stats Latency: <5ms | Throughput: 1000+/sec Key System Benefits ✓ Non-blocking | ✓ Independent scaling | ✓ Highly Resilient

Real-World Production Patterns

Netflix Example: Their recommendation engine processes billions of viewing events across hundreds of Kafka brokers. When a broker fails during peak hours, traffic seamlessly routes to healthy brokers.

StreamSocial Architecture Position:

Code
`
User Actions → Load Balancer → Kafka Cluster → Event Processors → Database
`
  • User posts, likes, comments (high write volume)

 

  • Real-time notifications (low latency requirement)

 

  • Analytics events (high throughput tolerance)

Scaling Considerations

Current Target: 1M daily active users
Growth Path: 100M users (requiring 50+ brokers)

Starting with 3 brokers teaches core concepts without complexity overload.

Control Plane:

  • KRaft Controller: Manages cluster metadata and leader elections

 

  • Broker Registration: Dynamic discovery and health monitoring

 

  • Topic Management: Automated partition assignment

Data Plane:

  • Producer APIs: Accept events from StreamSocial frontend

 

  • Consumer APIs: Serve events to recommendation engine

 

  • Replication: Ensure data durability across brokers

  • Write Path: User posts → Producer → Leader Broker → Follower Replicas

 

  • Read Path: Consumer → Any Broker (leader or follower)

 

  • Failure Recovery: Failed broker → Leader election → Traffic redirection

Broker States:

  • STARTING: Loading local data and registering with cluster

 

  • RUNNING: Actively serving produce/consume requests

 

  • RECOVERING: Catching up after network partition

 

  • SHUTDOWN: Graceful termination with data consistency

Production System Integration

StreamSocial Event Flow

Code
`python
`
Code
`
class EventRouter:
def route_event(self, event_type, user_id):
if event_type == "user_action":
return f"broker-{user_id % 3}" # Round-robin distribution
elif event_type == "content_interaction":
return "broker-1" # Dedicated broker for hot data
`

Real production clusters need visibility. Our setup includes:

  • JMX metrics exposure

 

  • Docker health checks

 

  • Custom Python monitoring dashboard

 

  • Automated failover testing

Key Insights for System Design

Distributed Systems Reality

CAP Theorem in Practice: Kafka chooses Consistency + Partition tolerance over Availability. When network splits occur, affected partitions become unavailable rather than serving stale data.

Replication Factor = 3: Not arbitrary - allows tolerance of 1 broker failure while maintaining data safety. Formula: max_failures = (replication_factor - 1) / 2

Operational Complexity Trade-offs

Single Broker: Simple to deploy, impossible to scale
3-Broker Cluster: Sweet spot for learning and small production
100+ Broker Cluster: Requires sophisticated tooling and dedicated ops team

Real-World Application

State Machine

Event State Machine Start Event Created Published Stored Processing Effects Applied UI Updated ERROR COMPLETE Event Categories user_action content_interaction system_event error conditions Performance & Logic • Event Creation: < 1ms • Publish Latency: < 5ms • Handler Processing: < 100ms • UI Update: < 200ms Retry Path: Automated back-off for Error State

When You'll Use This

Startup Phase: 3-broker cluster handles 10M events/day
Growth Phase: Add brokers horizontally as traffic increases
Enterprise Phase: Multi-region clusters with hundreds of brokers

Common Production Patterns

  • Hot-standby regions: Disaster recovery setup

 

  • Cross-datacenter replication: Geographic distribution

 

  • Blue-green deployments: Zero-downtime upgrades

Success Criteria

By lesson end, you'll have:

  • ✅ Running 3-broker Kafka cluster

 

  • ✅ Fault tolerance demonstration (kill broker, system survives)

 

  • ✅ Monitoring dashboard showing cluster health

 

  • ✅ Event production/consumption verification

Assignment: Stress Test Your Cluster

Challenge: Simulate Black Friday traffic spike

  • Configure producers to send 10,000 events/second

 

  • Kill one broker during peak load

 

  • Measure recovery time and data loss

 

  • Document performance characteristics

Success Metrics:

  • Zero data loss during broker failure

 

  • Recovery time < 30 seconds

 

  • Throughput degradation < 50%

Solution Hints

Monitoring Strategy: Track under-replicated-partitions metric
Recovery Optimization: Tune unclean.leader.election.enable=false
Load Testing: Use built-in kafka-producer-perf-test.sh tool

Next Steps Preview

Tomorrow we'll optimize this cluster for StreamSocial's specific traffic patterns by designing topic partitioning strategies. You'll learn why user-actions needs 1000 partitions while notifications only need 10.


Remember: Every Netflix stream, every Instagram like, every Slack message flows through systems exactly like what you're building today. Master these fundamentals, and you're ready for any scale.

Need help?