Day 1 : Event-Driven Architecture Fundamentals

Lesson 1 2-3 hours

What We're Building Today

Today we transform StreamSocial from a single-point-of-failure system into a resilient, distributed event platform. We'll deploy a 3-broker Kafka cluster that can handle millions of social media events without breaking a sweat.

Today's Agenda:

Deploy multi-broker Kafka cluster with Docker Compose

Implement broker discovery and failover mechanisms

Create monitoring dashboard for cluster health

Test fault tolerance with controlled broker failures

Core Concepts: Distributed Consensus in Action

Component Architecture

Broker Architecture Deep Dive

Think of Kafka brokers like Netflix's content delivery network - instead of one massive server, you have multiple smaller servers working together. Each broker is an independent Kafka server that stores and serves data.

Why 3 Brokers Matter:

Fault Tolerance: Lose one broker? System keeps running

Load Distribution: 50M requests spread across multiple machines

Replication: Your precious user posts aren't lost forever

Zookeeper vs KRaft: The Leadership Challenge

Imagine coordinating a group project without a clear leader - chaos, right? That's why Kafka needs distributed consensus.

Zookeeper (Traditional):

External coordination service

Handles leader election and metadata

Like having a dedicated project manager

KRaft (Modern):

Self-managing Kafka cluster

No external dependencies

Like the team electing their own leader

StreamSocial uses KRaft because it's simpler and reduces operational overhead.

Context in Ultra-Scalable System Design

Flowchart

Real-World Production Patterns

Netflix Example: Their recommendation engine processes billions of viewing events across hundreds of Kafka brokers. When a broker fails during peak hours, traffic seamlessly routes to healthy brokers.

StreamSocial Architecture Position:

Code

`
User Actions → Load Balancer → Kafka Cluster → Event Processors → Database
`

User posts, likes, comments (high write volume)

Real-time notifications (low latency requirement)

Analytics events (high throughput tolerance)

Scaling Considerations

Current Target: 1M daily active users
Growth Path: 100M users (requiring 50+ brokers)

Starting with 3 brokers teaches core concepts without complexity overload.

Control Plane:

KRaft Controller: Manages cluster metadata and leader elections

Broker Registration: Dynamic discovery and health monitoring

Topic Management: Automated partition assignment

Data Plane:

Producer APIs: Accept events from StreamSocial frontend

Consumer APIs: Serve events to recommendation engine

Replication: Ensure data durability across brokers
Write Path: User posts → Producer → Leader Broker → Follower Replicas

Read Path: Consumer → Any Broker (leader or follower)

Failure Recovery: Failed broker → Leader election → Traffic redirection

Broker States:

STARTING: Loading local data and registering with cluster

RUNNING: Actively serving produce/consume requests

RECOVERING: Catching up after network partition

SHUTDOWN: Graceful termination with data consistency

Production System Integration

StreamSocial Event Flow

Code

`python
`

Code

`
class EventRouter:
def route_event(self, event_type, user_id):
if event_type == "user_action":
return f"broker-{user_id % 3}" # Round-robin distribution
elif event_type == "content_interaction":
return "broker-1" # Dedicated broker for hot data
`

Real production clusters need visibility. Our setup includes:

JMX metrics exposure

Docker health checks

Custom Python monitoring dashboard

Automated failover testing

Key Insights for System Design

Distributed Systems Reality

CAP Theorem in Practice: Kafka chooses Consistency + Partition tolerance over Availability. When network splits occur, affected partitions become unavailable rather than serving stale data.

Replication Factor = 3: Not arbitrary - allows tolerance of 1 broker failure while maintaining data safety. Formula: max_failures = (replication_factor - 1) / 2

Operational Complexity Trade-offs

Single Broker: Simple to deploy, impossible to scale
3-Broker Cluster: Sweet spot for learning and small production
100+ Broker Cluster: Requires sophisticated tooling and dedicated ops team

Real-World Application

State Machine

When You'll Use This

Startup Phase: 3-broker cluster handles 10M events/day
Growth Phase: Add brokers horizontally as traffic increases
Enterprise Phase: Multi-region clusters with hundreds of brokers

Common Production Patterns

Hot-standby regions: Disaster recovery setup

Cross-datacenter replication: Geographic distribution

Blue-green deployments: Zero-downtime upgrades

Success Criteria

By lesson end, you'll have:

✅ Running 3-broker Kafka cluster

✅ Fault tolerance demonstration (kill broker, system survives)

✅ Monitoring dashboard showing cluster health

✅ Event production/consumption verification

Assignment: Stress Test Your Cluster

Challenge: Simulate Black Friday traffic spike

Configure producers to send 10,000 events/second

Kill one broker during peak load

Measure recovery time and data loss

Document performance characteristics

Success Metrics:

Zero data loss during broker failure

Recovery time < 30 seconds

Throughput degradation < 50%

Solution Hints

Monitoring Strategy: Track under-replicated-partitions metric
Recovery Optimization: Tune unclean.leader.election.enable=false
Load Testing: Use built-in kafka-producer-perf-test.sh tool

Next Steps Preview

Tomorrow we'll optimize this cluster for StreamSocial's specific traffic patterns by designing topic partitioning strategies. You'll learn why user-actions needs 1000 partitions while notifications only need 10.

Remember: Every Netflix stream, every Instagram like, every Slack message flows through systems exactly like what you're building today. Master these fundamentals, and you're ready for any scale.

Learning Objectives

✓ Master event-driven vs request-response architectures
✓ Design StreamSocial's event taxonomy system
✓ Implement event classification patterns
✓ Build foundational event types for social platforms

Course Navigation

This lesson is part of:

Hands On Kafka Course View Full Course

💬 Discuss this topic

Day 1 : Event-Driven Architecture Fundamentals

What We're Building Today

Core Concepts: Distributed Consensus in Action

Component Architecture

Broker Architecture Deep Dive

Zookeeper vs KRaft: The Leadership Challenge

Context in Ultra-Scalable System Design

Flowchart

Real-World Production Patterns

Scaling Considerations

Production System Integration

StreamSocial Event Flow

Key Insights for System Design

Distributed Systems Reality

Operational Complexity Trade-offs

Real-World Application

State Machine

When You'll Use This

Common Production Patterns

Success Criteria

Assignment: Stress Test Your Cluster

Solution Hints

Next Steps Preview

Learning Objectives

Course Navigation

Course Curriculum

StreamSocial Kafka Cluster - Step-by-Step Implementation Guide

Prerequisites Setup

System Requirements

Environment Verification

Step 1: Create Project Structure

Phase 2: Docker Infrastructure

Step 4: Docker Compose Configuration

Step 5: Container Network Setup

Step 6: Start Kafka Cluster

Phase 3: Core Kafka Client

Step 7: Implement Kafka Client Library

✅ Connected to 3 brokers

👑 Controller: Broker X

✅ All required topics created

Phase 4: Monitoring Dashboard

Step 10: Flask Monitoring Service

Step 11: Metrics Collection Implementation

Phase 5: Testing Framework

Step 13: Unit Test Suite

Step 14: Run Test Suite

testclusterconnectivity ... OK

testtopiccreation ... OK

testproducerfunctionality ... OK

testconsumerfunctionality ... OK

testfaulttolerance ... OK

Step 15: Continuous Monitoring Test

Step 16: Start Script Creation

Step 17: Stop Script Creation

Step 18: End-to-End Verification

Step 19: Performance Testing

Common Issues & Solutions

Next Steps Integration

No Demo Video

Resources & Links

📁Repository Structure

GitHub Repository

Access Required