Week 1: Event-Driven Architecture & Kafka Foundations (Days 1–5)
1. Introduction
Most Kafka tutorials stop at kafka-console-producer. You learn topics, offsets, and consumer groups in isolation—then hit production and discover that reliability, partitioning, and failure handling were never wired into a coherent system.
Distributed event-driven platforms introduce a different class of problems: partial failures across brokers, skewed partition load, duplicate delivery, and observability gaps that only appear under real throughput. StreamSocial addresses this gap directly. Kafka Mastery: Building StreamSocial is a system-first curriculum: each day adds a production concern to a running platform, not a slide deck.
StreamSocial models a social engagement pipeline—likes, shares, comments, and system signals—flowing through a multi-broker Kafka cluster, into consumers, and out to metrics and storage. Week 1 establishes the foundation: event taxonomy, cluster topology, partition strategy, high-volume producers, and resilient consumers with retry and dead-letter handling.
Real-time architectures matter because user-facing products no longer tolerate batch-only analytics. Feeds, notifications, fraud checks, and ML feature pipelines all depend on events arriving in order, at scale, with clear failure boundaries. Week 1 teaches you to reason about those boundaries before you add stream processing, APIs, or dashboards in later weeks.
2. System Overview
StreamSocial Week 1 is an integrated capstone: Docker-hosted Kafka (KRaft, three brokers), Python producers and consumers, Redis-backed caching, and a FastAPI monitoring dashboard. Events originate from simulated user activity, publish to category-specific topics, and are consumed by an engagement pipeline that updates metrics and handles errors.
High-level components
How events move through the system
Taxonomy validation — Ten core event types map to three categories and three topics (
src/events/).Topic provisioning —
TopicManagercreates topics with configured partitions and replication factor 3.Publish — Producers serialize JSON events, route by category, and key partitions for ordering where required.
Consume —
EngagementConsumerpollscontent-interactions, deserializes, and runs engagement logic.Observe — Metrics, cluster health, and runtime state surface on the dashboard API.
[IMAGE: StreamSocial Week 1 dashboard showing broker health, topic partitions, and live publish/consume metrics]
3. Architecture Diagram
The dashboard drives the runtime controller, which orchestrates producers and consumers against the broker tier. Storage sits adjacent to the processing path for cache and DLQ persistence.
4. Core Concepts and Their Role
4.1 Event-Driven Architecture
What it does. Components communicate by publishing facts (events) instead of synchronous RPC chains. Producers remain unaware of downstream consumers.
Where it is used. src/events/models.py defines EventType, categories, and topic routing. EventPublisher emits domain events into Kafka.
Why it matters. Decoupling lets you add consumers—analytics, moderation, notifications—without redeploying producers.
4.2 Apache Kafka & Distributed Logs
What it does. Kafka stores ordered, replicated partitions across brokers. Producers append; consumers track offsets.
Where it is used. docker/docker-compose.yml runs a three-broker KRaft cluster on ports 9092–9094. config/kafka_config.py centralizes broker and client settings.
Why it matters. The log is the system of record. Everything else—APIs, caches, dashboards—is a projection of that log.
4.3 Partitioning
What it does. Topics split into partitions for parallelism. Keys route related events to the same partition for ordering.
Where it is used. src/partitioning/partition_calculator.py models 50M req/s sizing; topic_manager.py provisions user-actions (12 local / 1000 curriculum), content-interactions (6 / 500), and system-events (3).
Why it matters. Under-partitioned topics become bottlenecks; over-partitioned topics waste broker resources.
4.4 Producer Reliability
What it does. acks=all, retries, compression, and batching trade latency for durability.
Where it is used. PRODUCER_CONFIG in kafka_config.py; HighVolumeProducer with a connection pool in src/producer/.
Why it matters. Lost events in a social feed pipeline translate directly into broken user trust and bad analytics.
4.5 Consumer Groups
What it does. Consumers sharing a group.id coordinate partition assignment and offset commits.
Where it is used. EngagementConsumer with CONSUMER_GROUP_ID; runtime controller spawns a fresh group per demo session when needed.
Why it matters. Horizontal scale-out of consumers is bounded by partition count—plan both together.
4.6 Schema Evolution & Event Taxonomy
What it does. A stable taxonomy (10 events, 3 categories) prevents topic sprawl and documents contracts.
Where it is used. src/events/taxonomy.py validates exactly ten core types; validate_ten_core_events() runs at setup.
Why it matters. Production systems fail when event names and payloads drift without versioning strategy.
4.7 Fault Tolerance, Retry & Dead-Letter Queues
What it does. Transient errors retry with exponential backoff; poison messages land in a DLQ instead of blocking the consumer.
Where it is used. src/handlers/error_handler.py tracks per-offset retries and accumulates dead_letter_messages.
Why it matters. Without DLQ discipline, one bad message can stall an entire partition.
4.8 Monitoring & Observability
What it does. Exposes cluster health, publish/consume rates, and runtime state for operators.
Where it is used. src/cluster/health.py, src/utils/metrics.py, and src/monitoring/dashboard.py (FastAPI + live UI).
Why it matters. You cannot operate what you cannot see; Week 1 builds observability in from day one.
5. Engine / Core System Design
Week 1 implements the ingestion and consumption spine that later weeks extend into full stream processing (Kafka Streams, windowed aggregations, state stores).
Target layout for the full StreamSocial engine (Week 1 modules map into this structure):
Week 1 actual layout:
6. Data Flow / Control Flow Diagram
A Run Demo request hits the FastAPI control plane, which starts the producer pool and background consumer, publishes engagement events to content-interactions, and returns aggregated status JSON to the browser.
[IMAGE: Sequence of POST /api/run-demo triggering topic setup, producer start, and consumer metrics incrementing on the dashboard]
7. API Design
Week 1 ships a control-plane API on the monitoring service. Representative endpoints:
Operational (Week 1)
Returns taxonomy catalog, cluster broker summary, topic partition config, partition calculator output, and live producer/consumer metrics.
Provisions topics if needed, starts HighVolumeProducer and EngagementConsumer, and runs the demo publish loop.
Serves the interactive HTML dashboard (metrics polling, Run Demo button).
Platform endpoints (roadmap)
These illustrate how the API layer will sit in front of Kafka in later weeks:
Accepts a validated domain event and publishes to the correct topic via EventPublisher.
Reads aggregated engagement from Redis (projection of consumed events).
Returns broker count, ISR status, and topic metadata—backed by ClusterHealth.
Exposes consumer lag, retry counts, and DLQ depth for alerting.
8. State Machine Diagram
Kafka-driven pipelines move through explicit runtime states. The Week 1 RuntimeController tracks stopped, running, and demo modes; message processing adds finer-grained transitions.
Validation failures short-circuit to Error. Processing failures enter Retry with backoff; exhausted retries route to Dead Letter Queue while the consumer continues on healthy offsets.
9. Why This Approach Matters
Theory connected to running code. Partition math from Day 3 uses the same TOPIC_CONFIG that TopicManager applies to Docker brokers. Producer acks=all is not an exam answer—it is what protects demo events under broker restarts.
Less fragmented learning. Instead of five disconnected exercises, you get one repository: taxonomy → cluster → topics → producer → consumer. Each layer composes with the previous one.
Better engineering judgment. You learn to ask production questions early: How many partitions for 50M req/s? What happens when a consumer throws? Where do poison messages go? How do you prove the cluster is healthy before accepting traffic?
Week 1 is deliberately smaller than a full microservices mesh—but architecturally honest. That honesty scales when you add Kafka Streams, Kubernetes, and cross-service contracts in later weeks.
10. Conclusion
Week 1 of Kafka Mastery: Building StreamSocial transforms isolated Kafka concepts into a working event-driven platform: typed events, a three-broker cluster, provisioned topics, reliable producers, consumer groups with retry and DLQ, and a FastAPI observability layer.
The lesson is not memorizing broker ports. It is designing systems where the log is central, failures have defined paths, and every concept has a file path in the repository. Carry that system-first habit into stream processing, API gateways, and production deployments—the foundation you build in Days 1–5 is what makes those layers survivable.
Next step: Clone streamsocial-week1, run ./start.sh, open http://localhost:8765, and click Run Demo. Watch events flow while you read the code that moves them.