Week 1: Event-Driven Architecture & Kafka Foundations (Days 1–5)

Lesson 4 60 min

Week 1: Event-Driven Architecture & Kafka Foundations (Days 1–5)

1. Introduction

Most Kafka tutorials stop at kafka-console-producer. You learn topics, offsets, and consumer groups in isolation—then hit production and discover that reliability, partitioning, and failure handling were never wired into a coherent system.

Distributed event-driven platforms introduce a different class of problems: partial failures across brokers, skewed partition load, duplicate delivery, and observability gaps that only appear under real throughput. StreamSocial addresses this gap directly. Kafka Mastery: Building StreamSocial is a system-first curriculum: each day adds a production concern to a running platform, not a slide deck.

StreamSocial models a social engagement pipeline—likes, shares, comments, and system signals—flowing through a multi-broker Kafka cluster, into consumers, and out to metrics and storage. Week 1 establishes the foundation: event taxonomy, cluster topology, partition strategy, high-volume producers, and resilient consumers with retry and dead-letter handling.

Real-time architectures matter because user-facing products no longer tolerate batch-only analytics. Feeds, notifications, fraud checks, and ML feature pipelines all depend on events arriving in order, at scale, with clear failure boundaries. Week 1 teaches you to reason about those boundaries before you add stream processing, APIs, or dashboards in later weeks.

2. System Overview

StreamSocial Week 1 is an integrated capstone: Docker-hosted Kafka (KRaft, three brokers), Python producers and consumers, Redis-backed caching, and a FastAPI monitoring dashboard. Events originate from simulated user activity, publish to category-specific topics, and are consumed by an engagement pipeline that updates metrics and handles errors.

High-level components

ComponentRole
Kafka clusterDurable log for user-actions, content-interactions, and system-events
ProducersEventPublisher and HighVolumeProducer with connection pooling and acks=all
ConsumersEngagementConsumer in a consumer group, subscribed to content-interactions
Stream processing (Week 1 scope)In-process transformation and engagement handling; Kafka Streams in later weeks
Backend API layerFastAPI dashboard (/api/status, /api/run-demo) on port 8765
Storage layerRedis cache for hot engagement state; DLQ list for poison messages
Frontend dashboardHTML UI served by FastAPI for live metrics and demo control

How events move through the system

  1. Taxonomy validation — Ten core event types map to three categories and three topics (src/events/).

  2. Topic provisioningTopicManager creates topics with configured partitions and replication factor 3.

  3. Publish — Producers serialize JSON events, route by category, and key partitions for ordering where required.

  4. ConsumeEngagementConsumer polls content-interactions, deserializes, and runs engagement logic.

  5. Observe — Metrics, cluster health, and runtime state surface on the dashboard API.

[IMAGE: StreamSocial Week 1 dashboard showing broker health, topic partitions, and live publish/consume metrics]

3. Architecture Diagram

Component Architecture

StreamSocial: Week 1 System Component Architecture CONTROL PLANE / OBSERVABILITY TIER FastAPI Dashboard Port 8765 | Web UI & Metrics Runtime Controller Orchestrates Demo Loop Topic Manager Provisions Kafka Topics DATA PLANE / INGESTION & CONSUMPTION TIER Event Ingestion • EventPublisher (Taxonomy) • HighVolumeProducer • Config: acks=all, retries Kafka Cluster (3-Broker KRaft) user-actions (12 Partitions, RF=3) content-interactions (6 Partitions, RF=3) system-events (3 Partitions, RF=3) Consumer Group • EngagementConsumer • In-process transformation • Error & Retry Handler STORAGE & STATE PERSISTENCE TIER Redis Cache Manager Hot State Engagement Counters Dead-Letter Queue (DLQ) Poison Pill Isolation Storage Provision Topics Trigger Loop Update Counters Route Poison Pills Metrics / Health

The dashboard drives the runtime controller, which orchestrates producers and consumers against the broker tier. Storage sits adjacent to the processing path for cache and DLQ persistence.

4. Core Concepts and Their Role

4.1 Event-Driven Architecture

What it does. Components communicate by publishing facts (events) instead of synchronous RPC chains. Producers remain unaware of downstream consumers.

Where it is used. src/events/models.py defines EventType, categories, and topic routing. EventPublisher emits domain events into Kafka.

Why it matters. Decoupling lets you add consumers—analytics, moderation, notifications—without redeploying producers.

4.2 Apache Kafka & Distributed Logs

What it does. Kafka stores ordered, replicated partitions across brokers. Producers append; consumers track offsets.

Where it is used. docker/docker-compose.yml runs a three-broker KRaft cluster on ports 9092–9094. config/kafka_config.py centralizes broker and client settings.

Why it matters. The log is the system of record. Everything else—APIs, caches, dashboards—is a projection of that log.

4.3 Partitioning

What it does. Topics split into partitions for parallelism. Keys route related events to the same partition for ordering.

Where it is used. src/partitioning/partition_calculator.py models 50M req/s sizing; topic_manager.py provisions user-actions (12 local / 1000 curriculum), content-interactions (6 / 500), and system-events (3).

Why it matters. Under-partitioned topics become bottlenecks; over-partitioned topics waste broker resources.

4.4 Producer Reliability

What it does. acks=all, retries, compression, and batching trade latency for durability.

Where it is used. PRODUCER_CONFIG in kafka_config.py; HighVolumeProducer with a connection pool in src/producer/.

Why it matters. Lost events in a social feed pipeline translate directly into broken user trust and bad analytics.

4.5 Consumer Groups

What it does. Consumers sharing a group.id coordinate partition assignment and offset commits.

Where it is used. EngagementConsumer with CONSUMER_GROUP_ID; runtime controller spawns a fresh group per demo session when needed.

Why it matters. Horizontal scale-out of consumers is bounded by partition count—plan both together.

4.6 Schema Evolution & Event Taxonomy

What it does. A stable taxonomy (10 events, 3 categories) prevents topic sprawl and documents contracts.

Where it is used. src/events/taxonomy.py validates exactly ten core types; validate_ten_core_events() runs at setup.

Why it matters. Production systems fail when event names and payloads drift without versioning strategy.

4.7 Fault Tolerance, Retry & Dead-Letter Queues

What it does. Transient errors retry with exponential backoff; poison messages land in a DLQ instead of blocking the consumer.

Where it is used. src/handlers/error_handler.py tracks per-offset retries and accumulates dead_letter_messages.

Why it matters. Without DLQ discipline, one bad message can stall an entire partition.

4.8 Monitoring & Observability

What it does. Exposes cluster health, publish/consume rates, and runtime state for operators.

Where it is used. src/cluster/health.py, src/utils/metrics.py, and src/monitoring/dashboard.py (FastAPI + live UI).

Why it matters. You cannot operate what you cannot see; Week 1 builds observability in from day one.

5. Engine / Core System Design

Week 1 implements the ingestion and consumption spine that later weeks extend into full stream processing (Kafka Streams, windowed aggregations, state stores).

StageResponsibility
Event ingestionHighVolumeProducer + EventPublisher accept typed events and publish to category topics
Transformation pipelineEngagement handler maps raw records to EngagementEvent models
Stateful processingRedis CacheManager holds hot engagement counters
Windowed aggregationPlanned in later curriculum; partition calculator models throughput targets
Retry workflowsErrorHandler with configurable MAX_RETRIES and backoff
Dead-letter queuesIn-memory DLQ list; production would publish to *.dlq topics
Event orchestrationRuntimeController coordinates setup, start, demo loop, and shutdown

Target layout for the full StreamSocial engine (Week 1 modules map into this structure):

plaintext
streamsocial-engine/
  producers/          # high_volume_producer.py, event_publisher.py
  consumers/          # engagement_consumer.py
  streams/            # Kafka Streams (later weeks)
  schemas/            # events/models.py, taxonomy.py
  pipelines/          # transformation & routing
  monitoring/         # dashboard.py, metrics.py, cluster/health.py
  retry/              # error_handler retry logic
  dlq/                # dead-letter capture & replay
  api/                # FastAPI control plane
  config/             # kafka_config.py, settings.py

Week 1 actual layout:

plaintext
streamsocial-week1/
  docker/             # 3-broker KRaft + Redis
  config/             # Kafka & app settings
  src/
    events/           # Day 1 taxonomy
    cluster/          # Day 2 health
    partitioning/     # Day 3 topics & calculator
    producer/         # Day 4 publish path
    consumer/         # Day 5 engagement + DLQ
    handlers/         # retry / DLQ
    monitoring/       # FastAPI dashboard
    runtime/          # demo orchestration
  tests/

6. Data Flow / Control Flow Diagram

Flowchart

StreamSocial: Run-Demo Execution & Message Lifecycle Flow 1. POST /api/run-demo User kicks off simulation loop 2. Topic Provision Check TopicManager ensures configs exist 3. Generate & Type Validate Runs 10 core event taxonomy engine 4. Kafka Appending Log HighVolumeProducer (acks=all) 5. Polling Engagement Group EngagementConsumer reads batch Is Payload Healthy? 6A. Incremental Counter State Redis commits user interaction records Retries Exhausted? Execute Backoff ErrorHandler sleep 7B. Append DLQ Isolate poison pills 8. UI Stream Metrics Render FastAPI frontend renders offsets & lag YES NO NO YES

A Run Demo request hits the FastAPI control plane, which starts the producer pool and background consumer, publishes engagement events to content-interactions, and returns aggregated status JSON to the browser.

[IMAGE: Sequence of POST /api/run-demo triggering topic setup, producer start, and consumer metrics incrementing on the dashboard]

7. API Design

Week 1 ships a control-plane API on the monitoring service. Representative endpoints:

Operational (Week 1)

http
GET /api/status

Returns taxonomy catalog, cluster broker summary, topic partition config, partition calculator output, and live producer/consumer metrics.

http
POST /api/run-demo

Provisions topics if needed, starts HighVolumeProducer and EngagementConsumer, and runs the demo publish loop.

http
GET /

Serves the interactive HTML dashboard (metrics polling, Run Demo button).

Platform endpoints (roadmap)

These illustrate how the API layer will sit in front of Kafka in later weeks:

http
POST /api/v1/events
Content-Type: application/json

{
  "event_type": "post_liked",
  "user_id": "u_42",
  "content_id": "post_991",
  "timestamp": "2026-05-27T14:00:00Z"
}

Accepts a validated domain event and publishes to the correct topic via EventPublisher.

http
GET /api/v1/engagement/{content_id}

Reads aggregated engagement from Redis (projection of consumed events).

http
GET /api/v1/health/kafka

Returns broker count, ISR status, and topic metadata—backed by ClusterHealth.

http
GET /api/v1/metrics/consumer

Exposes consumer lag, retry counts, and DLQ depth for alerting.


8. State Machine Diagram

State Machine

StreamSocial: Controller & Message Processing State Machines CORE RUNTIME CONTROLLER MODES SYSTEM_STOPPED Idle / Cluster Booted DEMO_RUNNING Publish/Consume Active SYSTEM_ERROR Broker Down / Outage POST /run-demo Execution Ends Health Check Fail MESSAGE CONSUMPTION PIPELINE LIFECYCLE STATE 1. UNVALIDATED Raw Polled Record 2. PROCESSING Running Logic Maps 3A. COMMITTED Offsets Moved Ahead 3B. RETRY_DELAY Exponential Backoff 4. POISON_DLQ Isolated For Analysis Taxonomy Pass Success Catch Err Interval Expiry Retries Exhausted (Max=5) Invalid Structural Schema Drop

Kafka-driven pipelines move through explicit runtime states. The Week 1 RuntimeController tracks stopped, running, and demo modes; message processing adds finer-grained transitions.

Validation failures short-circuit to Error. Processing failures enter Retry with backoff; exhausted retries route to Dead Letter Queue while the consumer continues on healthy offsets.

9. Why This Approach Matters

Theory connected to running code. Partition math from Day 3 uses the same TOPIC_CONFIG that TopicManager applies to Docker brokers. Producer acks=all is not an exam answer—it is what protects demo events under broker restarts.

Less fragmented learning. Instead of five disconnected exercises, you get one repository: taxonomy → cluster → topics → producer → consumer. Each layer composes with the previous one.

Better engineering judgment. You learn to ask production questions early: How many partitions for 50M req/s? What happens when a consumer throws? Where do poison messages go? How do you prove the cluster is healthy before accepting traffic?

Week 1 is deliberately smaller than a full microservices mesh—but architecturally honest. That honesty scales when you add Kafka Streams, Kubernetes, and cross-service contracts in later weeks.

10. Conclusion

Week 1 of Kafka Mastery: Building StreamSocial transforms isolated Kafka concepts into a working event-driven platform: typed events, a three-broker cluster, provisioned topics, reliable producers, consumer groups with retry and DLQ, and a FastAPI observability layer.

The lesson is not memorizing broker ports. It is designing systems where the log is central, failures have defined paths, and every concept has a file path in the repository. Carry that system-first habit into stream processing, API gateways, and production deployments—the foundation you build in Days 1–5 is what makes those layers survivable.

Next step: Clone streamsocial-week1, run ./start.sh, open http://localhost:8765, and click Run Demo. Watch events flow while you read the code that moves them.

Need help?