 <?xml version="1.0" encoding="UTF-8"?>    <rss version="2.0"
        xmlns:content="http://purl.org/rss/1.0/modules/content/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:atom="http://www.w3.org/2005/Atom"
        xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
        xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
        >
    
    <channel>
        <title>System Design Roadmap - Hands-On System Design Lessons</title>
        <atom:link href="https://systemdrd.com/feed/lessons" rel="self" type="application/rss+xml" />
        <link></link>
        <description>Hands-On System Design lessons, AI Agents tutorials, and practical programming tutorials. Learn by doing with real-world projects and examples.</description>
        <lastBuildDate>Mon, 29 Jun 2026 03:47:32 +0000</lastBuildDate>
        <language>en-US</language>
        <sy:updatePeriod>hourly</sy:updatePeriod>
        <sy:updateFrequency>1</sy:updateFrequency>
        <generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://systemdrd.com/wp-content/uploads/2025/10/cropped-ChatGPT-Image-Sep-4-2025-at-09_23_55-PM-32x32.png</url>
	<title>System Design Roadmap</title>
	<link>https://systemdrd.com</link>
	<width>32</width>
	<height>32</height>
</image> 
        
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[system11]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 10 : Advanced Dashboard Platform ## What We Are Building InfraWatch **Dashboard** — a Grafana-class observability console for infrastructure operators: 1. **Composable Grid** — Drag-and-drop widgets with persistent... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Lesson 10 : Advanced Dashboard Platform</p>
<p data-ai-summary="true">## What We Are Building</p>
<p data-ai-summary="true">InfraWatch **Dashboard** — a Grafana-class observability console for infrastructure operators:</p>
<p>1. **Composable Grid** — Drag-and-drop widgets with persistent layouts and a widget library<br />
2. **Advanced Charts** — Multi-series, stacked, scatter, heatmap, latency distribution, status timelines<br />
3. **Customization** — Per-dashboard themes, save/load layouts, tokenized sharing<br />
4. **Interactive Analytics** — Cross-filtering, drill-down hierarchy, time-range presets<br />
5. **<span data-ai-definition="performance">performance</span> Layer** — LTTB downsampling, Redis query cache, WebSocket live streams<br />
6. **Template Gallery** — Semver-versioned marketplace with ratings and one-click deployment</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concept</p>
<p data-ai-summary="true">Datadog and Grafana do not render raw million-point time series to the browser. They **downsample at query time** (LTTB algorithm), **cache aggregated buckets in Redis**, and **push deltas over WebSockets** so the UI stays at 60fps. The insight most tutorials miss: the dashboard is a *composition engine* — widgets are independent data contracts, not hard-coded charts. Layout state (x, y, w, h) persists separately from widget config, enabling template reuse without duplicating query logic.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Where This Fits</p>
<p>&#8220;`<br />
InfraWatch Reports (exports + analytics <span data-ai-definition="API">API</span>)<br />
        │<br />
InfraWatch Dashboard          ◀ this module<br />
        │<br />
Operator fleet monitoring<br />
&#8220;`</p>
<p data-ai-summary="true">Dashboard consumes aggregated metric samples upstream and presents the operator&#8217;s primary control surface for fleet health.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Component Architecture</p>
<p>&#8220;`<br />
React Console (:5230)<br />
        │  REST + WebSocket<br />
FastAPI Gateway (:5060)<br />
        │<br />
   ┌────┴────┬──────────┐<br />
PostgreSQL  Redis   LTTB Engine<br />
  :5490     :6440<br />
&#8220;`</p>
<p>| Service | Port | Image | Role |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;|&#8212;&#8212;-|&#8212;&#8212;|<br />
| backend | 5060 | python:3.12-slim | FastAPI + WebSocket metrics |<br />
| frontend | 5230 | node:22 / nginx | Grafana-style console |<br />
| postgres | 5490 | postgres:16-alpine | Dashboards, widgets, metrics |<br />
| redis | 6440 | redis:7-alpine | Chart query cache |</p>
<p data-ai-summary="true">### Project Layout</p>
<p data-ai-summary="true">Once you open the project folder, here is how the code is organized. Think of it as three layers: the <span data-ai-definition="API">API</span> (backend), the browser UI (frontend), and the scripts that wire everything together.</p>
<p>&#8220;`<br />
infrawatch-dashboard/<br />
├── backend/<br />
│   ├── app/<br />
│   │   ├── main.py              # FastAPI app, lifespan, CORS<br />
│   │   ├── <span data-ai-definition="API">API</span>/<br />
│   │   │   ├── dashboards.py    # Dashboard CRUD, layout, widgets<br />
│   │   │   ├── charts.py        # Multi-series, stacked, scatter, heatmap<br />
│   │   │   ├── filters.py       # Cross-filter options<br />
│   │   │   ├── drilldown.py     # Hierarchy navigation<br />
│   │   │   ├── templates.py     # Marketplace CRUD, instantiate<br />
│   │   │   ├── sharing.py       # Share tokens<br />
│   │   │   ├── <span data-ai-definition="performance">performance</span>.py   # Cache stats, WebSocket, Prometheus<br />
│   │   │   └── operations.py    # Overview, demo, E2E<br />
│   │   ├── core/                # config, <span data-ai-definition="database">database</span><br />
│   │   ├── models/              # Dashboard, Widget, Template, MetricSample<br />
│   │   ├── schemas/             # Pydantic request/response<br />
│   │   └── services/<br />
│   │       ├── metrics_service.py  # Query, aggregate, LTTB pipeline<br />
│   │       ├── cache_service.py    # Redis query cache<br />
│   │       ├── downsample.py       # LTTB algorithm<br />
│   │       ├── template_service.py # Semver, instantiate<br />
│   │       └── seed.py             # 1900+ metric samples<br />
│   ├── tests/<br />
│   │   ├── conftest.py<br />
│   │   └── test_api.py<br />
│   ├── Dockerfile<br />
│   └── requirements.txt<br />
├── frontend/<br />
│   ├── src/<br />
│   │   ├── pages/               # Console, Marketplace, <span data-ai-definition="performance">performance</span><br />
│   │   ├── components/<br />
│   │   │   ├── dashboard/       # Grid, WidgetRenderer, FilterPanel<br />
│   │   │   └── Layout.tsx<br />
│   │   ├── <span data-ai-definition="API">API</span>/client.ts<br />
│   │   ├── store/filterStore.ts # Zustand state<br />
│   │   └── hooks/useMetricsStream.ts<br />
│   ├── package.json<br />
│   └── Dockerfile<br />
├── docker-compose.yml<br />
├── requirements.txt<br />
├── requirements-dev.txt<br />
├── build.sh<br />
├── stop.sh<br />
└── cleanup.sh<br />
&#8220;`</p>
<p data-ai-summary="true">### How the Pieces Work Together</p>
<p data-ai-summary="true">**Widget Grid.** Dashboards store layout as JSON (`x`, `y`, `w`, `h` per widget). `react-grid-layout` handles drag-and-resize on the frontend. Layout saves via `PUT /<span data-ai-definition="API">API</span>/v1/dashboards/{id}/layout` with optimistic versioning — the server returns HTTP 409 on version conflict.</p>
<p data-ai-summary="true">**Chart Pipeline.** `MetricsService.timeseries()` buckets raw `MetricSample` rows into 5-minute windows, then applies LTTB downsampling when points exceed 1920. Results cache in Redis (60s TTL) before JSON serialization.</p>
<p data-ai-summary="true">**Cross-Filter and Drill-Down.** Zustand store holds `timeRange`, `services[]`, `regions[]`, and `drillPath[]`. Progressive filter options come from `/<span data-ai-definition="API">API</span>/v1/filters/available`. Drill-down navigates service → endpoint → region → environment via `/<span data-ai-definition="API">API</span>/v1/drilldown/navigate`.</p>
<p data-ai-summary="true">**Template Marketplace.** Templates store `config: { layout, widgets, theme }`. Instantiating creates a new Dashboard and Widget rows. Config changes bump semver (major if widget removed, minor if added, patch otherwise).</p>
<p data-ai-summary="true">**Real-Time Stream.** WebSocket at `/<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/stream` pushes `metrics_batch` every 3 seconds. The frontend `useMetricsStream` hook updates the live point counter in the <span data-ai-definition="performance">performance</span> panel.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Control &#038; Data Flow</p>
<p>1. Operator selects time range and cross-filters (service, region)<br />
2. <span data-ai-definition="API">API</span> checks Redis; cache misses query indexed `metric_samples` rows<br />
3. LTTB reduces points to ≤1920 before JSON serialization<br />
4. WebSocket pushes metric batches every 3s for live widgets<br />
5. Drag-resize triggers versioned layout save (optimistic locking on conflict)</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Widget Lifecycle</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Operator Console</p>
<p data-ai-summary="true">The UI follows a Grafana-inspired dark theme (`#111217` background, `#73BF69` accent). Four alternate themes ship with the project: light, ocean, and sunset.</p>
<p>| Page | Route | What You See |<br />
|&#8212;&#8212;|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8211;|<br />
| Command Center | `/` | KPI cards, widget grid, filters, drill-down, <span data-ai-definition="performance">performance</span> panel |<br />
| Template Gallery | `/marketplace` | Search, ratings, one-click deploy |<br />
| <span data-ai-definition="performance">performance</span> | `/<span data-ai-definition="performance">performance</span>` | Cache stats, WebSocket status, live points |</p>
<p data-ai-summary="true">### Widget Types</p>
<p>| Type | What It Shows |<br />
|&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| `metric_card` | Single KPI value |<br />
| `multi_series` | Line chart, multiple metrics |<br />
| `stacked` | Stacked bar by service |<br />
| `scatter` | CPU vs memory |<br />
| `heatmap` | Traffic by hour/day |<br />
| `status_timeline` | Service status chips |<br />
| `latency` | Latency distribution histogram |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## <span data-ai-definition="API">API</span> Reference</p>
<p data-ai-summary="true">These are the endpoints you will call from the frontend or test with curl.</p>
<p data-ai-summary="true">### Dashboards</p>
<p>| Endpoint | Method | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;|<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards` | GET | List dashboards |<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards` | POST | Create dashboard |<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}` | GET | Dashboard detail + widgets |<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}/layout` | PUT | Persist grid layout |<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}/theme` | PUT | Change theme |<br />
| `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}/widgets` | POST | Add widget |</p>
<p data-ai-summary="true">### Charts</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/timeseries` | LTTB-downsampled time series |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/multi-series` | Multi-metric lines |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/stacked` | Stacked bar by service |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/scatter` | CPU vs memory correlation |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/heatmap` | Hour/day traffic heatmap |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/latency-distribution` | Latency histogram |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/charts/status-timeline` | Service status events |</p>
<p data-ai-summary="true">### Filters, Drill-Down, Templates, and Sharing</p>
<p>| Endpoint | Method | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;|<br />
| `/<span data-ai-definition="API">API</span>/v1/filters/available` | GET | Cross-filter options with counts |<br />
| `/<span data-ai-definition="API">API</span>/v1/drilldown/navigate` | POST | Drill into hierarchy level |<br />
| `/<span data-ai-definition="API">API</span>/v1/drilldown/details` | POST | Detail records for path |<br />
| `/<span data-ai-definition="API">API</span>/v1/templates` | GET | Marketplace search |<br />
| `/<span data-ai-definition="API">API</span>/v1/templates/{id}/instantiate` | POST | Deploy template as dashboard |<br />
| `/<span data-ai-definition="API">API</span>/v1/templates/{id}/rate` | POST | Rate template |<br />
| `/<span data-ai-definition="API">API</span>/v1/sharing/{id}/share` | POST | Create share link |<br />
| `/<span data-ai-definition="API">API</span>/v1/sharing/shared/{token}` | GET | View shared dashboard |</p>
<p data-ai-summary="true">### <span data-ai-definition="performance">performance</span> and Operations</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `GET /<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/cache` | Cache hit rate stats |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/stats` | Uptime, clients, cache |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/prometheus` | Prometheus metrics |<br />
| `WS /<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/stream` | Live metric batches |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/operations/overview` | Dashboard KPIs |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/demo` | Inject live metric samples |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/e2e` | Pipeline validation |</p>
<p data-ai-summary="true">Quick health check after the stack is running:</p>
<p>&#8220;`bash<br />
curl -s http://localhost:5060/<span data-ai-definition="API">API</span>/health<br />
curl -s http://localhost:5060/<span data-ai-definition="API">API</span>/v1/operations/overview | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Build, Test, and Demo</p>
<p data-ai-summary="true">Everything runs from inside the `infrawatch-dashboard` folder. Make the scripts executable once, then pick the mode that fits your machine.</p>
<p>&#8220;`bash<br />
cd infrawatch-dashboard<br />
chmod +x build.sh stop.sh cleanup.sh<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 1 — Run Tests First</p>
<p data-ai-summary="true">Always start here. If tests pass, your environment is wired correctly before you spin up services.</p>
<p>&#8220;`bash<br />
./build.sh test     # 14 pytest + 1 vitest<br />
&#8220;`</p>
<p>| Suite | Count | What It Covers |<br />
|&#8212;&#8212;-|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| pytest | 14 | health, overview, charts, filters, drilldown, templates, layout, sharing, cache, e2e |<br />
| vitest | 1 | App renders InfraWatch Dashboard branding |</p>
<p data-ai-summary="true">Backend tests use in-memory SQLite with `StaticPool` and dependency overrides in `tests/conftest.py`.</p>
<p data-ai-summary="true">### Step 2 — Start the Stack (Hybrid Mode)</p>
<p data-ai-summary="true">This is the recommended path for daily development. Postgres and Redis run in Docker; the <span data-ai-definition="API">API</span> and UI run on your host machine.</p>
<p data-ai-summary="true">### Without Docker (local dev)</p>
<p>&#8220;`bash<br />
cd infrawatch-dashboard<br />
./build.sh local    # Postgres/Redis in Docker, <span data-ai-definition="API">API</span>+UI on host<br />
./build.sh demo     # inject metrics, verify non-zero chart points<br />
&#8220;`</p>
<p data-ai-summary="true">Open **http://localhost:5230** — Command Center shows live KPI cards and charts.</p>
<p data-ai-summary="true">Expected output from `./build.sh demo`:</p>
<p>&#8220;`<br />
{&#8220;status&#8221;:&#8221;healthy&#8221;,&#8221;service&#8221;:&#8221;infrawatch-dashboard&#8221;}<br />
Metric samples: 1900+ | Chart points: 9+<br />
{&#8220;passed&#8221;: true}<br />
Open http://localhost:5230<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 3 — Full Docker Stack (Optional)</p>
<p data-ai-summary="true">Use this when you want every service inside containers, matching a production-like layout.</p>
<p data-ai-summary="true">### With Docker (full stack)</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
./build.sh demo<br />
&#8220;`</p>
<p data-ai-summary="true">Requires Docker Hub access for `python:3.12-slim`, `node:22-alpine`, `nginx:alpine`, `postgres:16-alpine`, and `redis:7-alpine`. If your network blocks Docker Hub, pre-pull the images:</p>
<p>&#8220;`bash<br />
docker pull python:3.12-slim node:22-alpine nginx:alpine postgres:16-alpine redis:7-alpine<br />
./build.sh docker<br />
&#8220;`</p>
<p data-ai-summary="true">Or run compose directly:</p>
<p>&#8220;`bash<br />
docker compose up &#8211;build -d<br />
docker compose logs -f backend<br />
docker compose down -v<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 4 — Verify the Dashboard</p>
<p data-ai-summary="true">Walk through these checks in the browser after `./build.sh demo`:</p>
<p>1. Command Center KPI cards show non-zero numbers (not dashes or zeros)<br />
2. Charts render with visible data points<br />
3. Drag a widget, refresh the page — layout stays where you left it<br />
4. <span data-ai-definition="performance">performance</span> page shows WebSocket as Connected with live points updating<br />
5. Template Gallery lists templates and Deploy Template creates a new board</p>
<p data-ai-summary="true">**Success criteria:** `metric_samples > 0`, chart `points > 0`, E2E checks all pass, dashboard KPIs update after demo injection.</p>
<p data-ai-summary="true">### Stop and Clean Up</p>
<p data-ai-summary="true">When you are done for the day:</p>
<p>&#8220;`bash<br />
./stop.sh       # stop host processes and Docker containers<br />
./cleanup.sh    # remove node_modules, venv, caches, prune Docker resources<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Key Source Files</p>
<p data-ai-summary="true">If you get lost in the codebase, start with these files.</p>
<p>| Path | Role |<br />
|&#8212;&#8212;|&#8212;&#8212;|<br />
| `backend/app/main.py` | FastAPI app, lifespan, router registration |<br />
| `backend/app/services/metrics_service.py` | Query, aggregate, chart data |<br />
| `backend/app/services/downsample.py` | LTTB downsampling |<br />
| `backend/app/services/cache_service.py` | Redis query cache |<br />
| `backend/app/services/seed.py` | Metric samples + default dashboard |<br />
| `backend/app/<span data-ai-definition="API">API</span>/<span data-ai-definition="performance">performance</span>.py` | WebSocket stream, Prometheus |<br />
| `frontend/src/pages/Console.tsx` | Main dashboard with grid + filters |<br />
| `frontend/src/components/dashboard/WidgetRenderer.tsx` | Chart widgets |<br />
| `frontend/src/components/dashboard/DashboardGrid.tsx` | react-grid-layout |<br />
| `frontend/src/<span data-ai-definition="API">API</span>/client.ts` | Typed <span data-ai-definition="API">API</span> client |<br />
| `frontend/src/hooks/useMetricsStream.ts` | WebSocket live updates |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Troubleshooting</p>
<p>| Issue | Fix |<br />
|&#8212;&#8212;-|&#8212;&#8211;|<br />
| Port 5060/5230 in use | `./stop.sh` then retry |<br />
| Dashboard zeros | Run `./build.sh demo` or click refresh icon in UI |<br />
| Layout 409 conflict | Refresh page to get latest version |<br />
| Docker build fails | Use `./build.sh local` (hybrid mode) |<br />
| Stale node_modules | `./cleanup.sh` then `./build.sh local` |<br />
| Redis cache always miss | Ensure Redis container on :6440 |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment</p>
<p data-ai-summary="true">Build a **custom widget type** `alert_summary` that displays the top 5 services by error-rate from the aggregated <span data-ai-definition="API">API</span>. Add it to the widget library, persist it on a dashboard, and verify layout survives page refresh.</p>
<p data-ai-summary="true">### Solution Hints</p>
<p>1. Extend `WidgetRenderer` with a new case calling `/<span data-ai-definition="API">API</span>/v1/charts/aggregated?group_by=service`<br />
2. Register the type in `WidgetLibrary` with an appropriate icon<br />
3. POST to `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}/widgets` with `widget_type: alert_summary`<br />
4. Confirm layout JSON in GET `/<span data-ai-definition="API">API</span>/v1/dashboards/{id}` includes your widget&#8217;s grid position</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">*Next: User &#038; Team Management — multi-tenant dashboards with RBAC inheritance.*</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[system11]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 9: Data Export &#038; Reporting Platform ## What We Are Building InfraWatch **Reports** — a production-grade export and reporting control plane: 1. **Export Engine** — Multi-format streaming exports... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p>## Lesson 9:  Data Export &#038; Reporting Platform<br />
## What We Are Building</p>
<p data-ai-summary="true">InfraWatch **Reports** — a production-grade export and reporting control plane:</p>
<p>1. **Export Engine** — Multi-format streaming exports (CSV, JSON, Excel, PDF) with checksum validation<br />
2. **Report Templates** — Jinja2-driven dynamic reports with preview and scheduled delivery<br />
3. **Visualization <span data-ai-definition="API">API</span>** — Chart endpoints with pre-aggregated hourly metrics and Redis <span data-ai-definition="caching">caching</span><br />
4. **Advanced Analytics** — Statistical baselines, anomaly detection, forecasting, and correlation matrices<br />
5. **<span data-ai-definition="performance">performance</span> Layer** — Query <span data-ai-definition="caching">caching</span>, index recommendations, Prometheus metrics, slow-query tracking<br />
6. **Operator Console** — Grafana-inspired dashboard with export wizard, real-time progress, and history</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concept</p>
<p data-ai-summary="true">Export systems in production (Datadog, Grafana, Splunk) never dump raw tables to users. They **stream, compress, and cache** query results because a single unbounded export can saturate <span data-ai-definition="database">database</span> I/O and starve live dashboards. The non-obvious insight: treat exports as **asynchronous jobs with TTL-bound artifacts**, not synchronous <span data-ai-definition="API">API</span> responses. Grafana&#8217;s report renderer and Datadog&#8217;s scheduled snapshots both follow this pattern — queue the work, stream batches, validate integrity, expire the file.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Where This Fits</p>
<p>&#8220;`<br />
InfraWatch Realtime (live streams)<br />
        │<br />
InfraWatch Reports          ◀ this module<br />
        │<br />
InfraWatch Dashboard (widgets)<br />
&#8220;`</p>
<p data-ai-summary="true">Reports consumes aggregated metrics and notification data upstream, producing downloadable artifacts and scheduled executive summaries downstream.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Component Architecture</p>
<p>| Service | Port | Image | Role |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;|&#8212;&#8212;-|&#8212;&#8212;|<br />
| backend | 5050 | Dockerfile | FastAPI <span data-ai-definition="API">API</span> + WebSocket progress |<br />
| frontend | 5220 | Dockerfile / Vite | Operator console |<br />
| postgres | 5480 | postgres:16-alpine | Metrics, jobs, templates |<br />
| redis | 6430 | redis:7-alpine | Chart query cache |</p>
<p data-ai-summary="true">The <span data-ai-definition="API">API</span> sits in the middle. The React console talks to it over REST and WebSocket. PostgreSQL stores your metrics and export jobs. Redis caches chart queries so repeated dashboard loads do not hammer the <span data-ai-definition="database">database</span>.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Inside the Project</p>
<p data-ai-summary="true">Once you open the repository, the layout maps directly to those services:</p>
<p>&#8220;`<br />
infrawatch-reports/<br />
├── backend/<br />
│   ├── app/<br />
│   │   ├── main.py              # FastAPI app, lifespan, WebSocket<br />
│   │   ├── <span data-ai-definition="API">API</span>/                 # exports, templates, schedules, analytics<br />
│   │   ├── core/                # config, <span data-ai-definition="database">database</span><br />
│   │   ├── models/              # ExportJob, MetricSample, etc.<br />
│   │   ├── schemas/             # Pydantic request/response<br />
│   │   └── services/            # export_engine, analytics, cache, seed<br />
│   ├── tests/test_api.py<br />
│   └── requirements.txt<br />
├── frontend/<br />
│   ├── src/pages/               # 7 console pages<br />
│   ├── src/<span data-ai-definition="API">API</span>/client.ts<br />
│   └── package.json<br />
├── docker-compose.yml<br />
├── build.sh<br />
├── stop.sh<br />
└── cleanup.sh<br />
&#8220;`</p>
<p data-ai-summary="true">You do not need to memorize every file. Focus on three folders: `<span data-ai-definition="API">API</span>/` (routes), `services/` (business logic), and `frontend/src/pages/` (what the operator sees).</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## How the Pieces Work Together</p>
<p data-ai-summary="true">**Export engine.** A POST to `/<span data-ai-definition="API">API</span>/v1/exports` creates a job. The job moves through `pending`, `processing`, and `completed` (or `failed` / `expired`). Rows stream in batches, the file gets an MD5 checksum, and downloads expire after 24 hours.</p>
<p data-ai-summary="true">**Report templates.** Jinja2 templates support variable extraction, live preview, and scheduled generation. A weekly summary template is seeded on first startup.</p>
<p data-ai-summary="true">**Analytics pipeline.** Raw metric samples roll up into hourly buckets. Chart endpoints check Redis first (60-second TTL), then PostgreSQL. Anomaly detection uses z-scores; forecasting uses linear regression; correlations use Pearson coefficients.</p>
<p data-ai-summary="true">**<span data-ai-definition="performance">performance</span> layer.** In-memory cache plus Redis, latency histograms at P50/P95/P99, connection pool stats, and a Prometheus endpoint at `/<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/prometheus`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Control &#038; Data Flow</p>
<p>1. Operator selects data source and format in the export wizard<br />
2. <span data-ai-definition="API">API</span> checks Redis cache for chart queries; misses hit PostgreSQL with indexed scans<br />
3. Export engine streams rows in batches, writes file, computes MD5 checksum<br />
4. WebSocket pushes progress stages: fetching → converting → complete<br />
5. Completed files expire after 24 hours; Prometheus records latency percentiles</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Export Job Lifecycle</p>
<p data-ai-summary="true">Every export job follows this path. If something breaks during processing, the job lands in `failed` with an error message. If nobody downloads within 24 hours, it moves to `expired`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## <span data-ai-definition="API">API</span> Reference</p>
<p data-ai-summary="true">### Exports</p>
<p>| Endpoint | Method | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;|<br />
| `/<span data-ai-definition="API">API</span>/v1/exports` | POST | Create export job |<br />
| `/<span data-ai-definition="API">API</span>/v1/exports` | GET | List jobs |<br />
| `/<span data-ai-definition="API">API</span>/v1/exports/{id}` | GET | Job status |<br />
| `/<span data-ai-definition="API">API</span>/v1/exports/{id}/download` | GET | Download file |<br />
| `/<span data-ai-definition="API">API</span>/v1/exports/{id}` | DELETE | Cancel job |</p>
<p data-ai-summary="true">### Templates and Schedules</p>
<p>| Endpoint | Method | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;|<br />
| `/<span data-ai-definition="API">API</span>/v1/templates` | GET/POST | Template CRUD |<br />
| `/<span data-ai-definition="API">API</span>/v1/templates/{id}/preview` | POST | Render preview |<br />
| `/<span data-ai-definition="API">API</span>/v1/schedules` | GET/POST | Cron schedules |<br />
| `/<span data-ai-definition="API">API</span>/v1/schedules/run-due` | POST | Trigger due schedules |</p>
<p data-ai-summary="true">### Analytics</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/chart` | Multi-series chart data |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/trends` | Direction, percent change, moving average |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/compare` | Dimension comparison |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/drilldown` | Channel to template hierarchy |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/statistics/{metric}` | mean, p95, p99 |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/anomalies` | z-score detection |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/predictions/{metric}` | Forecast with confidence band |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/analytics/correlations/{metric}` | Cross-metric Pearson r |</p>
<p data-ai-summary="true">### Operations</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `GET /<span data-ai-definition="API">API</span>/v1/operations/overview` | Dashboard KPIs |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/demo` | Sample export |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/e2e` | Pipeline validation |</p>
<p data-ai-summary="true">Quick <span data-ai-definition="API">API</span> check after the stack is running:</p>
<p>&#8220;`bash<br />
curl -s http://localhost:5050/<span data-ai-definition="API">API</span>/health<br />
curl -s http://localhost:5050/<span data-ai-definition="API">API</span>/v1/operations/overview | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Operator Console</p>
<p data-ai-summary="true">The UI uses a Grafana-inspired dark theme. Each sidebar page maps to one part of the platform:</p>
<p>| Page | Route | What you will see |<br />
|&#8212;&#8212;|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| Overview | `/` | 8 KPI cards and a delivery volume chart |<br />
| Exports | `/exports` | 3-step wizard and job table |<br />
| Templates | `/templates` | Jinja2 template cards |<br />
| Schedules | `/schedules` | Cron schedule table |<br />
| Analytics | `/analytics` | Comparison, trends, stats, anomalies, predictions, correlations |<br />
| <span data-ai-definition="performance">performance</span> | `/<span data-ai-definition="performance">performance</span>` | Latency percentiles, CPU, cache stats |<br />
| History | `/history` | Completed and failed export log |</p>
<p data-ai-summary="true">Walk every page after startup. If KPIs show zero, run the demo step below before assuming something is broken.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Build, Test and Demo</p>
<p data-ai-summary="true">### Prerequisites</p>
<p data-ai-summary="true">Python 3.12+, Node.js 22+, and Docker with Compose v2. Ports 5050, 5220, 5480, and 6430 must be free.</p>
<p data-ai-summary="true">### Step 1 — Enter the project and make scripts executable</p>
<p>&#8220;`bash<br />
cd week9/infrawatch-reports<br />
chmod +x build.sh stop.sh cleanup.sh<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 2 — Run tests</p>
<p>&#8220;`bash<br />
./build.sh test<br />
&#8220;`</p>
<p data-ai-summary="true">This runs 14 pytest cases (health, charts, exports, download, end-to-end) and 1 vitest case (UI renders). All should pass before you start the stack.</p>
<p data-ai-summary="true">### Step 3 — Start the stack (hybrid mode, recommended)</p>
<p data-ai-summary="true">Postgres and Redis run in Docker; <span data-ai-definition="API">API</span> and UI run on the host.</p>
<p>&#8220;`bash<br />
./build.sh local<br />
&#8220;`</p>
<p data-ai-summary="true">Open **http://localhost:5220**. The <span data-ai-definition="API">API</span> listens on **http://localhost:5050**.</p>
<p data-ai-summary="true">### Step 4 — Run the demo validation</p>
<p data-ai-summary="true">With the stack running:</p>
<p>&#8220;`bash<br />
./build.sh demo<br />
&#8220;`</p>
<p data-ai-summary="true">**Expected demo output:**</p>
<p>&#8220;`<br />
{&#8220;status&#8221;:&#8221;healthy&#8221;,&#8221;service&#8221;:&#8221;infrawatch-reports&#8221;}<br />
Metric samples: 2500+ | Delivery volume: >0<br />
{&#8220;passed&#8221;: true, &#8220;chart_points&#8221;: >0, &#8220;export_rows&#8221;: >0}<br />
Open http://localhost:5220<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 5 — Walk the UI manually</p>
<p>1. **Overview** — confirm KPI cards are not zero; click **Generate Sample Export**<br />
2. **Exports** — click Next twice, then **Start Export**; wait for the spinner; download the file<br />
3. **Analytics** — open each tab; charts should render with real data<br />
4. **<span data-ai-definition="performance">performance</span>** — CPU and latency values should update</p>
<p data-ai-summary="true">### Full pipeline in one command</p>
<p>&#8220;`bash<br />
./build.sh all<br />
&#8220;`</p>
<p data-ai-summary="true">Runs test, local start, and demo in sequence.</p>
<p data-ai-summary="true">### With Docker (full stack)</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
&#8220;`</p>
<p data-ai-summary="true">Requires Docker Hub access for `python:3.12-slim`, `node:22-alpine`, and `nginx:alpine`. If Hub is unreachable, the script falls back to hybrid mode automatically.</p>
<p data-ai-summary="true">Pre-pull images when your network is stable:</p>
<p>&#8220;`bash<br />
docker pull python:3.12-slim node:22-alpine nginx:alpine postgres:16-alpine redis:7-alpine<br />
./build.sh docker<br />
&#8220;`</p>
<p data-ai-summary="true">On WSL, if you see `lookup registry-1.docker.io: no such host`:</p>
<p>&#8220;`bash<br />
echo &#8220;nameserver 8.8.8.8&#8221; | sudo tee /etc/resolv.conf<br />
&#8220;`</p>
<p data-ai-summary="true">Restart Docker Desktop, then retry.</p>
<p data-ai-summary="true">### Stop and clean up</p>
<p>&#8220;`bash<br />
./stop.sh        # stop host processes and compose stack<br />
./cleanup.sh     # remove node_modules, venv, caches, project Docker images<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Success Criteria</p>
<p>&#8211; [ ] `./build.sh test` — all backend and frontend tests pass<br />
&#8211; [ ] Overview KPIs show non-zero metric samples and delivery volume<br />
&#8211; [ ] Export wizard creates downloadable CSV/JSON/Excel/PDF files<br />
&#8211; [ ] Analytics tabs render comparison charts, trends, and correlations<br />
&#8211; [ ] <span data-ai-definition="performance">performance</span> page shows live CPU and query latency percentiles<br />
&#8211; [ ] Templates and schedules pages display seeded data</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Troubleshooting</p>
<p>| Issue | What to do |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;|<br />
| Port 5050 in use | Run `./stop.sh`, then retry |<br />
| Export button seems dead | Complete all 3 wizard steps; wait for the exporting spinner |<br />
| Download shows server error | Restart backend; confirm export status is `completed` |<br />
| Docker build fails on Hub | Use `./build.sh local` instead |<br />
| Dashboard shows all zeros | Run `./build.sh demo` or **Generate Sample Export** |<br />
| Stale dependencies | Run `./cleanup.sh`, then `./build.sh local` |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Key Files to Know</p>
<p>| Path | Role |<br />
|&#8212;&#8212;|&#8212;&#8212;|<br />
| `backend/app/main.py` | FastAPI app, lifespan, WebSocket |<br />
| `backend/app/services/export_engine.py` | Format writers, job processing |<br />
| `backend/app/services/analytics_service.py` | Charts, trends, statistical analysis |<br />
| `backend/app/services/cache_service.py` | Redis and memory cache |<br />
| `backend/app/services/seed.py` | Sample metric data on startup |<br />
| `frontend/src/pages/Exports.tsx` | Export wizard and download |<br />
| `frontend/src/pages/Analytics.tsx` | Six-tab analytics console |<br />
| `frontend/src/<span data-ai-definition="API">API</span>/client.ts` | <span data-ai-definition="API">API</span> client and export polling |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment</p>
<p data-ai-summary="true">Add a custom report template that includes a bar chart of failure rates per channel. Schedule it to run every Monday at 08:00 UTC and deliver via email recipients list.</p>
<p data-ai-summary="true">**Hints:** Create a Jinja2 template with a `{% for %}` loop over channel data. Use the `/<span data-ai-definition="API">API</span>/v1/analytics/compare` endpoint variables in the template context. Register a cron schedule via the Schedules <span data-ai-definition="API">API</span> with expression `0 8 * * 1`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Solution Hints</p>
<p>1. Fetch comparison data: `GET /<span data-ai-definition="API">API</span>/v1/analytics/compare?metric_name=failure_rate`<br />
2. POST to `/<span data-ai-definition="API">API</span>/v1/templates` with HTML containing `{% for item in channels %}`<br />
3. POST to `/<span data-ai-definition="API">API</span>/v1/schedules` with `cron_expression: &#8220;0 8 * * 1&#8243;` and `email_recipients: [&#8220;you@domain.com&#8221;]`<br />
4. Trigger manually: `POST /<span data-ai-definition="API">API</span>/v1/schedules/run-due` after setting `next_run_at` to past</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">*Next: Advanced Dashboard Features — drag-and-drop widgets consuming these chart APIs.*</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[system11]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 8: Real-time Streaming Control Plane If you have ever watched a Datadog dashboard update live, or seen PagerDuty fire an alert the second a server crosses a threshold,... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Lesson 8: Real-time Streaming Control Plane</p>
<p data-ai-summary="true">If you have ever watched a Datadog dashboard update live, or seen PagerDuty fire an alert the second a server crosses a threshold, you have already seen real-time infrastructure monitoring in action. Today you will build that transport layer yourself — not as a toy chat app, but as a production-grade streaming control plane called **InfraWatch Realtime**.</p>
<p data-ai-summary="true">By the end of this lesson you will have a working operator console with live charts, authenticated WebSocket connections, throttled metric delivery, and a reliability lab that measures p99 latency under load. Let us get into it.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## What We Are Building</p>
<p data-ai-summary="true">InfraWatch **Realtime** — a production streaming control plane for infrastructure operators:</p>
<p>1. **Connection Hub** — WebSocket transport with JWT auth, rooms, and reconnection tracking<br />
2. **Data Streams** — Live metric streaming, alert broadcasting, and throttled delivery<br />
3. **Event Protocol** — Versioned envelopes, client sync, conflict resolution, offline queue<br />
4. **<span data-ai-definition="performance">performance</span> Layer** — Redis queuing, bandwidth optimization, circuit breakers<br />
5. **Reliability Lab** — Load tests, concurrency bursts, and p99 benchmarking<br />
6. **Live Console** — Real-time charts, connection status, auto-refresh, offline fallback<br />
7. **Operations** — End-to-end orchestration with graceful degradation and recovery</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concept</p>
<p data-ai-summary="true">Real-time infrastructure monitoring is not about pushing raw data faster — it is about **controlled delivery under load**. Datadog Live Processes, PagerDuty incident streams, and Grafana live annotations all share the same pattern: authenticate connections, subscribe to topics, throttle noisy metrics, and degrade gracefully when backends saturate.</p>
<p data-ai-summary="true">The non-obvious insight: **the WebSocket is a control channel, not a firehose**. Production systems batch, compress, and prioritize — critical alerts bypass throttles. When a client disconnects, events queue locally and flush on reconnect with version-based conflict resolution, the same last-write-wins pattern Stripe uses for dashboard state sync.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Where This Fits</p>
<p>&#8220;`<br />
InfraWatch Alerts + Notify<br />
        │<br />
InfraWatch Realtime          ◀ this module<br />
        │<br />
InfraWatch Export / Dashboard<br />
&#8220;`</p>
<p data-ai-summary="true">Realtime sits downstream of alerts and notifications, providing the live transport layer that powers operator consoles across the platform.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Component Architecture</p>
<p>| Service | Port | Image / Build | Role |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;|&#8212;&#8212;|<br />
| backend | 5050 | Dockerfile | FastAPI <span data-ai-definition="API">API</span> + WebSocket hub |<br />
| frontend | 5220 | Dockerfile / Vite | Operator console |<br />
| postgres | 5480 | postgres:16-alpine | Sessions, alerts, events, test runs |<br />
| redis | 6430 | redis:7-alpine | Priority message queue |</p>
<p data-ai-summary="true">Think of the backend as air traffic control. It does not fly the planes (collect metrics on every server in your fleet) — it routes, prioritizes, and delivers updates to the right screens at the right speed.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Control &#038; Data Flow</p>
<p>1. Operator UI opens WebSocket to `/ws/{client_id}`<br />
2. Client subscribes to `metrics_update`, `alerts_critical`, `alerts_warning`<br />
3. Background simulator collects psutil metrics every 500ms<br />
4. StreamManager throttles and batches; critical alerts bypass the queue<br />
5. Events broadcast to subscribed clients; offline clients queue for reconnect flush<br />
6. Dashboard polls REST fallback when WebSocket is down</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Connection Lifecycle</p>
<p data-ai-summary="true">A connection is never just &#8220;on&#8221; or &#8220;off&#8221; in production. It moves through distinct states — connecting, streaming, offline, syncing queued events, and degraded when backends are under stress. Your UI should reflect each state so operators know whether they are looking at live data or a cached fallback.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Building Each Layer</p>
<p data-ai-summary="true">The sections below map directly to the seven capabilities listed above. Work through them in order — each layer depends on the one before it.</p>
<p data-ai-summary="true">### 1. Connection Hub</p>
<p data-ai-summary="true">The connection manager (`backend/app/services/connection_manager.py`) is the front door. Every client that wants live data must pass through it.</p>
<p>What it does:<br />
&#8211; Issues and verifies JWT tokens on WebSocket connect<br />
&#8211; Keeps a registry of every active client and which room they belong to<br />
&#8211; Handles `join_room` / `leave_room` with broadcast notifications to other members<br />
&#8211; Tracks reconnection counts and messages-per-second throughput</p>
<p data-ai-summary="true">**REST endpoints**</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `POST /<span data-ai-definition="API">API</span>/v1/connections/login` | Issue JWT for WebSocket auth |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/connections/stats` | Active connections, rooms, msg/s |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/connections/list` | Session table for the UI |</p>
<p data-ai-summary="true">**WebSocket protocol** — connect to `WS /ws/{client_id}?token=&#8230;`</p>
<p>| Client sends | Server replies |<br />
|&#8212;&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| `ping` | `pong` |<br />
| `subscribe` with `{topics:[]}` | `subscribed` |<br />
| `join_room` | `room_joined` |</p>
<p data-ai-summary="true">Check it works:</p>
<p>&#8220;`bash<br />
curl -s http://localhost:5050/<span data-ai-definition="API">API</span>/v1/connections/stats | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">You should see `active_connections` and `messages_per_second` fields in the response.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 2. Data Streams</p>
<p data-ai-summary="true">Once clients are connected, they need something to stream. The StreamManager handles topic subscriptions and delivery pacing.</p>
<p>What it does:<br />
&#8211; Lets each client subscribe to specific topics (`metrics_update`, `alerts_critical`, etc.)<br />
&#8211; Batches outgoing messages on a 100ms flush interval so you do not flood the browser<br />
&#8211; Lets critical alerts jump the queue — warnings can wait, outages cannot<br />
&#8211; Tracks gzip compression savings for bandwidth reporting</p>
<p data-ai-summary="true">The metric collector uses `psutil` to read real CPU, memory, and disk numbers from the host machine and pushes them into a rolling history buffer. No mock data — the numbers on your dashboard are live readings from your own machine.</p>
<p data-ai-summary="true">**REST endpoints**</p>
<p>| Endpoint | Purpose |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| `GET /<span data-ai-definition="API">API</span>/v1/streaming/metrics/current` | Live snapshot |<br />
| `GET /<span data-ai-definition="API">API</span>/v1/streaming/alerts` | Alert broadcast history |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/streaming/alerts` | Create and broadcast alert |</p>
<p data-ai-summary="true">Check it works:</p>
<p>&#8220;`bash<br />
curl -s http://localhost:5050/<span data-ai-definition="API">API</span>/v1/streaming/metrics/current<br />
&#8220;`</p>
<p data-ai-summary="true">Expected output looks like: `{&#8220;cpu_percent&#8221;: 24.5, &#8220;memory_percent&#8221;: 58.2, &#8230;}` — values will differ on your machine, but they should never be zero after the stack is running.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 3. Event Protocol</p>
<p data-ai-summary="true">Raw WebSocket messages become hard to manage at scale. The event protocol wraps every message in a standard envelope: `version`, `type`, `id`, `timestamp`, `payload`, `metadata`.</p>
<p>What it does:<br />
&#8211; **SyncEngine** — detects version conflicts and resolves them with last-write-wins<br />
&#8211; **OfflineQueue** — stores pending events per client; drains them automatically on reconnect</p>
<p data-ai-summary="true">Check it works:</p>
<p>&#8220;`bash<br />
curl -s -X POST http://localhost:5050/<span data-ai-definition="API">API</span>/v1/events/publish<br />
  -H &#8216;Content-Type: application/json&#8217;<br />
  -d &#8216;{&#8220;type&#8221;:&#8221;STATUS&#8221;,&#8221;payload&#8221;:{&#8220;service&#8221;:&#8221;<span data-ai-definition="API">API</span>&#8221;},&#8221;client_id&#8221;:&#8221;test&#8221;}&#8217;<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 4. <span data-ai-definition="performance">performance</span> Layer</p>
<p data-ai-summary="true">When hundreds of operators connect at once, you need backpressure. The <span data-ai-definition="performance">performance</span> layer prevents the hub from collapsing under its own output.</p>
<p>What it does:<br />
&#8211; **RedisQueue** — priority queues (`critical`, `normal`, `low`) with in-memory fallback during tests<br />
&#8211; **CircuitBreaker** — per-subsystem breakers (`streaming`, `alerts`, `events`) that trip open when error rates spike<br />
&#8211; **PerformanceMonitor** — aggregates queue depth, memory usage, pool size, and delivery counters</p>
<p data-ai-summary="true">Check it works:</p>
<p>&#8220;`bash<br />
curl -s http://localhost:5050/<span data-ai-definition="API">API</span>/v1/<span data-ai-definition="performance">performance</span>/snapshot | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 5. Reliability Lab</p>
<p data-ai-summary="true">You cannot trust a real-time system until you have measured it under load. The load test harness simulates N simultaneous WebSocket connections, records p99 latency, and stores results in the `load_test_runs` table.</p>
<p data-ai-summary="true">Check it works:</p>
<p>&#8220;`bash<br />
curl -s -X POST http://localhost:5050/<span data-ai-definition="API">API</span>/v1/testing/start<br />
  -H &#8216;Content-Type: application/json&#8217;<br />
  -d &#8216;{&#8220;name&#8221;:&#8221;guide_test&#8221;,&#8221;connections&#8221;:25}&#8217;<br />
&#8220;`</p>
<p data-ai-summary="true">Expected: `&#8221;status&#8221;: &#8220;completed&#8221;` and `&#8221;success_rate&#8221;: 100.0`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 6. Live Console UI</p>
<p data-ai-summary="true">The React frontend uses a `RealtimeContext` provider that owns the WebSocket lifecycle for the entire app:</p>
<p>&#8211; Exponential backoff on reconnect (waits longer after each failed attempt)<br />
&#8211; Offline message queue that flushes automatically when the socket comes back<br />
&#8211; Auto-refresh toggle so operators can pause live updates without disconnecting</p>
<p data-ai-summary="true">**Console pages**</p>
<p>| Page | Route | What you will see |<br />
|&#8212;&#8212;|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| Command Center | `/` | Overview KPIs and live CPU chart |<br />
| Connections | `/connections` | Auth status, rooms, session table |<br />
| Data Streams | `/streaming` | Metric feed, alert panel, throttle stats |<br />
| Event Protocol | `/events` | Sync state and conflict counter |<br />
| <span data-ai-definition="performance">performance</span> | `/<span data-ai-definition="performance">performance</span>` | Queue depth, memory, circuit breakers |<br />
| Reliability Lab | `/testing` | Load test controls and p99 results |<br />
| Live Console | `/console` | Connection status and offline queue |<br />
| Operations | `/operations` | E2E controls and degradation triggers |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">### 7. Operations Orchestration</p>
<p data-ai-summary="true">The RealtimeHub runs a background simulator that pushes metrics through the full throttle pipeline. Operations endpoints let you test the whole system end-to-end:</p>
<p>| Endpoint | What it does |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8211;|<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/simulate` | Create demo alerts |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/degrade` | Trip a circuit breaker |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/recover` | Reset all breakers |<br />
| `POST /<span data-ai-definition="API">API</span>/v1/operations/e2e` | Validate the full pipeline |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Key Files</p>
<p>| Path | Role |<br />
|&#8212;&#8212;|&#8212;&#8212;|<br />
| `backend/app/main.py` | FastAPI app, lifespan, WebSocket handler |<br />
| `backend/app/services/hub.py` | Orchestrator and background simulator |<br />
| `backend/app/services/connection_manager.py` | WS sessions, rooms, auth |<br />
| `backend/app/services/stream_manager.py` | Throttle, subscriptions, metrics |<br />
| `backend/app/services/event_manager.py` | Events, sync, offline queue |<br />
| `backend/app/services/performance_monitor.py` | Redis queue, circuit breakers |<br />
| `frontend/src/context/RealtimeContext.tsx` | WebSocket and offline state |<br />
| `frontend/src/components/Layout.tsx` | Sidebar navigation (8 sections) |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Getting the Project</p>
<p data-ai-summary="true">The runnable project lives in `infrawatch-realtime/`. If you need to regenerate it from scratch (for example after cloning only the lesson files), run the backup generator from the parent directory:</p>
<p>&#8220;`bash<br />
./setup.sh generate<br />
&#8220;`</p>
<p data-ai-summary="true">That recreates the full `infrawatch-realtime/` directory from embedded source files in the script. Verify the output with:</p>
<p>&#8220;`bash<br />
./setup.sh verify<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Prerequisites</p>
<p>&#8211; Python 3.12 or newer<br />
&#8211; Node.js 22 or newer<br />
&#8211; Docker with Compose v2</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Build, Test and Demo</p>
<p data-ai-summary="true">All commands below run from inside `infrawatch-realtime/`:</p>
<p>&#8220;`bash<br />
cd infrawatch-realtime<br />
chmod +x build.sh stop.sh<br />
&#8220;`</p>
<p data-ai-summary="true">### Run tests only</p>
<p>&#8220;`bash<br />
./build.sh test<br />
&#8220;`</p>
<p data-ai-summary="true">This runs 9 backend pytest cases and 1 frontend vitest case, then builds the production frontend bundle. All tests must pass before you start the stack.</p>
<p data-ai-summary="true">### Start local development (without full Docker)</p>
<p data-ai-summary="true">Postgres and Redis run in Docker containers. The <span data-ai-definition="API">API</span> and UI run directly on your machine — faster iteration, easier debugging.</p>
<p>&#8220;`bash<br />
./build.sh local<br />
&#8220;`</p>
<p data-ai-summary="true">When healthy, you will see:</p>
<p>&#8220;`<br />
Backend healthy on :5050<br />
Frontend on http://localhost:5220<br />
&#8220;`</p>
<p data-ai-summary="true">### Start the full Docker stack</p>
<p data-ai-summary="true">Everything — backend, frontend, Postgres, Redis — runs inside containers. Use this when you want a clean, reproducible environment.</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
&#8220;`</p>
<p data-ai-summary="true">### Run the functional demo</p>
<p data-ai-summary="true">The stack must already be running (`local` or `docker`):</p>
<p>&#8220;`bash<br />
./build.sh demo<br />
&#8220;`</p>
<p data-ai-summary="true">**Expected demo output:**</p>
<p>&#8220;`<br />
{&#8220;status&#8221;:&#8221;healthy&#8221;,&#8221;service&#8221;:&#8221;infrawatch-realtime&#8221;}<br />
{&#8220;simulated&#8221;:true,&#8221;alerts_created&#8221;:2}<br />
Alerts: 5 | Events: 2<br />
{&#8220;passed&#8221;: true, &#8230;}<br />
Open http://localhost:5220<br />
&#8220;`</p>
<p data-ai-summary="true">### Run everything in one shot</p>
<p>&#8220;`bash<br />
./build.sh all<br />
&#8220;`</p>
<p data-ai-summary="true">Runs tests, starts the local stack, and executes the demo automatically.</p>
<p data-ai-summary="true">### Stop all services</p>
<p>&#8220;`bash<br />
./stop.sh<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Walking the Dashboard</p>
<p data-ai-summary="true">Open http://localhost:5220 and click through every sidebar section. KPIs should show non-zero values after you run the demo or click **Run Live Burst** on the Command Center.</p>
<p data-ai-summary="true">What to look for on each page:</p>
<p>&#8211; **Command Center** — live CPU chart updating, 8 KPI cards with real numbers<br />
&#8211; **Connections** — session table showing active WebSocket clients<br />
&#8211; **Data Streams** — metric history and alert feed with throttle statistics<br />
&#8211; **Event Protocol** — sync state panel and conflict counter<br />
&#8211; **<span data-ai-definition="performance">performance</span>** — Redis queue depth, memory usage, circuit breaker status<br />
&#8211; **Reliability Lab** — run a load test and confirm p99 latency appears<br />
&#8211; **Live Console** — connection indicator (green = connected) and offline queue size<br />
&#8211; **Operations** — trigger degradation, then recovery, and watch the status change</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Docker &#038; Runtime</p>
<p data-ai-summary="true">### Without Docker (local dev)</p>
<p data-ai-summary="true">Postgres and Redis run in Docker; <span data-ai-definition="API">API</span> and UI run on the host.</p>
<p>&#8220;`bash<br />
cd infrawatch-realtime<br />
chmod +x build.sh stop.sh<br />
./build.sh local<br />
&#8220;`</p>
<p data-ai-summary="true">### With Docker (reproducible)</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
&#8220;`</p>
<p data-ai-summary="true">### Stop</p>
<p>&#8220;`bash<br />
./stop.sh<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Success Criteria</p>
<p>&#8211; [ ] `./build.sh test` — 9 pytest + 1 vitest pass<br />
&#8211; [ ] `./build.sh local` — <span data-ai-definition="API">API</span> on :5050, UI on :5220<br />
&#8211; [ ] `./build.sh demo` — alerts > 0, events > 0, E2E passed<br />
&#8211; [ ] Command Center shows live CPU chart and 8 KPI cards<br />
&#8211; [ ] Connections page shows auth, rooms, session table<br />
&#8211; [ ] Data Streams shows metrics, alert feed, throttle stats<br />
&#8211; [ ] Event Protocol shows sync state and conflict counter<br />
&#8211; [ ] <span data-ai-definition="performance">performance</span> shows queue depth, memory, circuit breakers<br />
&#8211; [ ] Reliability Lab runs load test with p99 results<br />
&#8211; [ ] Live Console shows connection status and offline queue<br />
&#8211; [ ] Operations triggers degradation and recovery</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Troubleshooting</p>
<p>| Problem | Fix |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8211;|<br />
| Port 5050 already in use | Run `./stop.sh`, then retry |<br />
| Postgres connection refused | `docker compose up -d postgres redis` from inside `infrawatch-realtime/` |<br />
| Dashboard shows all zeros | Run `./build.sh demo` or click **Run Live Burst** on the Command Center |<br />
| WebSocket shows Offline | Check `.logs/backend.log` inside `infrawatch-realtime/` |<br />
| Tests fail on first run | Make sure no other service is bound to ports 5050, 5220, 5480, or 6430 |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment</p>
<p>1. Add a new subscription topic `deployments` that broadcasts deployment status changes<br />
2. Implement per-room message history (last 50 messages) with a UI panel on the Connections page<br />
3. Add a chaos test that simulates 200 rapid reconnects and reports success rate in Reliability Lab</p>
<p data-ai-summary="true">## Assignment Hints</p>
<p>&#8211; **Task 1:** Extend `StreamManager.subscribe()` topics in `backend/app/services/stream_manager.py`; add a broadcast route in `streaming.py`; subscribe from `RealtimeContext.tsx`<br />
&#8211; **Task 2:** Store room messages in a new `RoomMessage` model; expose `GET /<span data-ai-definition="API">API</span>/v1/connections/rooms/{room}/history`<br />
&#8211; **Task 3:** Add `POST /<span data-ai-definition="API">API</span>/v1/testing/chaos/reconnect` in `testing.py`; mirror the load harness pattern from `LoadTestHarness`</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[system11]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 7 : Notification Dispatch Control Plane ## What We Are Building InfraWatch **Notify** — a production notification dispatch console modeled after PagerDuty and Opsgenie: 1. **Multi-channel delivery** —... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Lesson 7 : Notification Dispatch Control Plane</p>
<p data-ai-summary="true">## What We Are Building</p>
<p data-ai-summary="true">InfraWatch **Notify** — a production notification dispatch console modeled after PagerDuty and Opsgenie:</p>
<p>1. **Multi-channel delivery** — Email, SMS, Slack, webhook, and push providers behind a unified interface<br />
2. **Template engine** — Jinja2 rendering with per-channel formats and preview<br />
3. **Delivery pipeline** — Redis priority queue, background worker, retry with backoff, rate limiting<br />
4. **Preference routing** — Severity-based channel selection, quiet hours, suppression rules<br />
5. **Reliability layer** — Circuit breakers, delivery tracking timeline, integration test harness<br />
6. **Operator console** — Dark-theme dashboard with live WebSocket feed and channel health  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concept: Alerts Are Inputs, Notifications Are Outputs</p>
<p data-ai-summary="true">PagerDuty does not blast every engineer for every metric spike. It routes **alert events** through **user preferences**, **escalation policies**, and **channel capacity** before a human ever sees a ping. The non-obvious insight: **notification delivery is its own distributed system** — separate from alert detection, separate from metric storage.</p>
<p data-ai-summary="true">A firing alert is a signal. A notification is a **delivery attempt** with its own lifecycle: queued → processing → delivered or failed → retry. Datadog and Grafana solve the &#8220;what broke&#8221; problem; Notify solves &#8220;who gets told, how, and did it actually arrive.&#8221;</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Where This Fits</p>
<p>&#8220;`<br />
Alert Management          decides something needs attention<br />
        │<br />
Notification Dispatch     ◀ this lesson<br />
        │<br />
Human Response            on-call engineer acts<br />
&#8220;`</p>
<p data-ai-summary="true">Notify sits downstream of alert systems (like InfraWatch Alerts). It assumes alerts arrive with severity, service context, and a message — then handles the hard part of reliable multi-channel fan-out.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Component Architecture</p>
<p>**Orchestrator** — Receives alerts, resolves recipients, renders templates, enqueues delivery jobs.<br />
**Preference Router** — Applies quiet hours, severity-channel matrix, and global opt-out.<br />
**Queue Manager** — Redis-backed priority heap; CRITICAL jobs dequeue before LOW.<br />
**Delivery Worker** — Background thread processes batches, records events, handles retries.<br />
**Channel Providers** — Pluggable senders with validation and latency tracking.<br />
**WebSocket Hub** — Pushes delivery state changes to the operator console.</p>
<p>| Service | Port | Role |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;|&#8212;&#8212;|<br />
| FastAPI backend | 5030 | REST <span data-ai-definition="API">API</span>, WebSocket, workers |<br />
| React frontend | 5200 | Operator console |<br />
| PostgreSQL | 5460 | Alerts, notifications, preferences, templates |<br />
| Redis | 6410 | Queue, circuit breaker, rate limits |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Control &#038; Data Flow</p>
<p data-ai-summary="true">Alert POST → preference check per user → template render per channel → idempotency key → enqueue → worker dequeue → circuit breaker check → channel send → delivery event logged → WebSocket broadcast → dashboard card refresh.</p>
<p data-ai-summary="true">Escalation worker polls unacknowledged alerts. After the configured timeout, it bumps escalation level and re-triggers fan-out to the next tier — exactly how Opsgenie tiered paging works.</p>
<p data-ai-summary="true">When you trace a single alert through the code, this is the path it follows:</p>
<p>1. Operator or <span data-ai-definition="API">API</span> POSTs alert to `/<span data-ai-definition="API">API</span>/v1/alerts/`.<br />
2. Orchestrator loads all user preferences.<br />
3. `PreferenceService.should_send()` checks global enable, quiet hours, and severity exceptions.<br />
4. Channels resolved per severity (user override or defaults).<br />
5. Jinja2 template rendered per channel (`alert_email`, `alert_sms`, etc.).<br />
6. Idempotency key prevents duplicate delivery for same alert + user + channel.<br />
7. Notification enqueued in Redis priority queue.<br />
8. Delivery worker dequeues, checks circuit breaker and rate limit, sends via channel provider.<br />
9. Delivery events logged; WebSocket broadcasts status to dashboard.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Notification State Machine</p>
<p data-ai-summary="true">Each notification record tracks its own state independently of the parent alert.</p>
<p>| State | Meaning |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| QUEUED | Accepted, waiting in priority queue |<br />
| PROCESSING | Worker picked up the job |<br />
| DELIVERED | Channel provider confirmed send |<br />
| FAILED | Exhausted retries or circuit open |<br />
| SUPPRESSED | Quiet hours or user opt-out |</p>
<p data-ai-summary="true">The parent alert moves through its own states separately:</p>
<p>| State | Meaning |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;|<br />
| NEW | Alert created, not yet fan-out |<br />
| NOTIFIED | Notifications dispatched |<br />
| ACKNOWLEDGED | Operator acknowledged |<br />
| ESCALATED | Escalation worker bumped tier |<br />
| RESOLVED | Alert closed |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Design Notes Worth Remembering</p>
<p data-ai-summary="true">These patterns show up in real on-call systems. You implemented smaller versions of each one:</p>
<p>&#8211; **Alerts are inputs, notifications are outputs** — Each delivery attempt has its own state machine, independent of the parent alert. PagerDuty and Opsgenie work the same way.<br />
&#8211; **Idempotency keys** — SHA-256 of `alert_id + user_id + channel` stops duplicate pages during retries.<br />
&#8211; **Priority queue** — CRITICAL jobs dequeue before LOW via Redis sorted-set scores.<br />
&#8211; **Circuit breaker** — After three consecutive channel failures, Redis key `notify:circuit:{CHANNEL}` blocks attempts for 30 seconds.<br />
&#8211; **Quiet hours** — Timezone-aware suppression with severity exceptions (CRITICAL always delivers).<br />
&#8211; **Webhook receiver** — Built-in `POST /<span data-ai-definition="API">API</span>/v1/channels/webhook-receiver` lets you test webhooks locally without external services.<br />
&#8211; **No <span data-ai-definition="API">API</span> keys required** — Channel providers validate and dispatch in development. Set `SECRET_KEY` via `.env` in production. <span data-ai-definition="database">database</span> credentials are local Docker defaults only.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Prerequisites</p>
<p data-ai-summary="true">Before you start, confirm you have:</p>
<p>&#8211; Python 3.12 or newer<br />
&#8211; Node.js 22 or newer<br />
&#8211; Docker with Compose v2</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Build, Test, and Run</p>
<p data-ai-summary="true">All commands run from inside `infrawatch-notify/`.</p>
<p data-ai-summary="true">### Step 1 — Install dependencies</p>
<p>&#8220;`bash<br />
cd infrawatch-notify<br />
chmod +x build.sh stop.sh cleanup.sh</p>
<p>python3 -m venv backend/.venv<br />
source backend/.venv/bin/activate<br />
pip install -r requirements-dev.txt</p>
<p>cd frontend &#038;&#038; npm install &#038;&#038; cd ..<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 2 — Run tests</p>
<p>&#8220;`bash<br />
./build.sh test<br />
&#8220;`</p>
<p data-ai-summary="true">You should see 8 backend tests pass, one frontend smoke test pass, and a successful Vite production build.</p>
<p data-ai-summary="true">To run tests individually:</p>
<p>&#8220;`bash<br />
# Backend<br />
cd backend<br />
export PYTHONPATH=.<br />
export TESTING=1<br />
export DATABASE_URL=&#8221;sqlite://&#8221;<br />
pytest -q &#8211;tb=short</p>
<p># Frontend<br />
cd frontend<br />
npm run test &#8212; &#8211;run<br />
npm run build<br />
&#8220;`</p>
<p data-ai-summary="true">### Step 3 — Start the stack (local mode)</p>
<p>&#8220;`bash<br />
./build.sh local<br />
&#8220;`</p>
<p data-ai-summary="true">Docker starts Postgres and Redis. The <span data-ai-definition="API">API</span> and UI run on your machine.</p>
<p data-ai-summary="true">Expected output:</p>
<p>&#8220;`<br />
[build] Postgres ready<br />
<span data-ai-definition="database">database</span> seeded<br />
[build] Backend healthy on :5030<br />
[build] Frontend on http://localhost:5200<br />
&#8220;`</p>
<p data-ai-summary="true">Open http://localhost:5200 in your browser.</p>
<p data-ai-summary="true">### Step 4 — Run the demo</p>
<p data-ai-summary="true">With the stack running:</p>
<p>&#8220;`bash<br />
./build.sh demo<br />
&#8220;`</p>
<p data-ai-summary="true">Or do everything in one pass:</p>
<p>&#8220;`bash<br />
./build.sh all<br />
&#8220;`</p>
<p data-ai-summary="true">The demo fires sample alerts, waits for the delivery worker, and prints notification stats. Delivered count should be greater than zero.</p>
<p data-ai-summary="true">### Step 5 — Stop and clean up</p>
<p>&#8220;`bash<br />
./stop.sh        # stop processes and containers<br />
./cleanup.sh     # remove artifacts and prune Docker<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Docker</p>
<p data-ai-summary="true">### Hybrid mode (what `./build.sh local` uses)</p>
<p data-ai-summary="true">Only <span data-ai-definition="database">database</span> services run in Docker:</p>
<p>&#8220;`bash<br />
docker compose up -d postgres redis<br />
&#8220;`</p>
<p data-ai-summary="true">The app connects via `localhost:5460` (Postgres) and `localhost:6410` (Redis).</p>
<p data-ai-summary="true">### Full stack in containers</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
&#8220;`</p>
<p>| Container | Image | Host port |<br />
|&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8211;|<br />
| postgres | postgres:16-alpine | 5460 |<br />
| redis | redis:7-alpine | 6410 |<br />
| backend | backend/Dockerfile | 5030 |<br />
| frontend | frontend/Dockerfile (nginx) | 5200 |</p>
<p data-ai-summary="true">Frontend nginx proxies `/<span data-ai-definition="API">API</span>` and `/ws` to the backend container.</p>
<p data-ai-summary="true">Tear down:</p>
<p>&#8220;`bash<br />
docker compose down -v &#8211;remove-orphans<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Working with the <span data-ai-definition="API">API</span></p>
<p data-ai-summary="true">Base URL: `http://localhost:5030`</p>
<p data-ai-summary="true">### Health check</p>
<p>&#8220;`bash<br />
curl -sS http://localhost:5030/<span data-ai-definition="API">API</span>/health | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">Expected:</p>
<p>&#8220;`json<br />
{&#8220;status&#8221;: &#8220;healthy&#8221;, &#8220;service&#8221;: &#8220;infrawatch-notify&#8221;}<br />
&#8220;`</p>
<p data-ai-summary="true">### Create an alert (triggers fan-out)</p>
<p>&#8220;`bash<br />
curl -sS -X POST http://localhost:5030/<span data-ai-definition="API">API</span>/v1/alerts/<br />
  -H &#8220;Content-Type: application/json&#8221;<br />
  -d &#8216;{<br />
    &#8220;service_name&#8221;: &#8220;<span data-ai-definition="API">API</span>-gateway&#8221;,<br />
    &#8220;severity&#8221;: &#8220;CRITICAL&#8221;,<br />
    &#8220;title&#8221;: &#8220;CPU saturation&#8221;,<br />
    &#8220;message&#8221;: &#8220;CPU at 96% on prod-us-east-1&#8221;<br />
  }&#8217;<br />
&#8220;`</p>
<p data-ai-summary="true">### Simulate a dispatch burst</p>
<p>&#8220;`bash<br />
curl -sS -X POST http://localhost:5030/<span data-ai-definition="API">API</span>/v1/alerts/simulate | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">### Check notification stats</p>
<p>&#8220;`bash<br />
curl -sS http://localhost:5030/<span data-ai-definition="API">API</span>/v1/notifications/stats | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">### Test all five channels</p>
<p>&#8220;`bash<br />
curl -sS -X POST http://localhost:5030/<span data-ai-definition="API">API</span>/v1/testing/integration | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">### Endpoint reference</p>
<p>| Method | Path | Purpose |<br />
|&#8212;&#8212;&#8211;|&#8212;&#8212;|&#8212;&#8212;&#8212;|<br />
| GET | `/<span data-ai-definition="API">API</span>/health` | Health check |<br />
| POST/GET | `/<span data-ai-definition="API">API</span>/v1/alerts/` | Create and list alerts |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/simulate` | Demo alert burst |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/{id}/acknowledge` | Acknowledge alert |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/{id}/resolve` | Resolve alert |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/notifications/` | List notifications |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/notifications/stats` | Dashboard aggregates |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/notifications/tracking/{id}` | Delivery event timeline |<br />
| GET/PUT | `/<span data-ai-definition="API">API</span>/v1/preferences/{user_id}` | User delivery preferences |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/templates/` | List message templates |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/templates/render` | Preview template |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/channels/status` | Per-channel health |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/channels/test` | Send test notification |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/testing/integration` | All-channel integration test |<br />
| WS | `/ws/notifications` | Live delivery updates |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Dashboard Walkthrough</p>
<p data-ai-summary="true">Open http://localhost:5200 after `./build.sh local`.</p>
<p>| Section | What to look for |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| Command Center | Stats cards, hourly trend, channel health |<br />
| Dispatch Simulator | Click **Run Dispatch Burst** to fire sample alerts |<br />
| Notification Feed | Live table — refreshes via WebSocket |<br />
| Alert Inbox | Acknowledge and resolve alerts |<br />
| Preferences | Toggle quiet hours per on-call user |<br />
| Templates | View Jinja2 templates per channel |<br />
| Channel Tester | Send a test to any channel |</p>
<p data-ai-summary="true">Metrics stay at zero until alerts are dispatched. Click **Run Dispatch Burst** or run the simulate curl command, wait 3–5 seconds for the delivery worker, then refresh.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Success Criteria</p>
<p data-ai-summary="true">After running `./build.sh all`:</p>
<p>&#8211; Dashboard shows **non-zero** total and delivered notification counts<br />
&#8211; **Run Dispatch Burst** creates alerts and fans out across all five channels<br />
&#8211; Channel health panel shows success/failure counters updating<br />
&#8211; Preferences page toggles quiet hours live<br />
&#8211; Integration test reports **5/5 channels passed**  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Troubleshooting</p>
<p>| Problem | What to try |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;-|<br />
| Dashboard shows zero notifications | Run simulate curl or click Dispatch Burst, wait 5 seconds |<br />
| Backend not healthy | Check `.logs/backend.log` or `docker compose logs backend` |<br />
| Webhook channel fails | <span data-ai-definition="API">API</span> must be running; webhook targets `localhost:5030/<span data-ai-definition="API">API</span>/v1/channels/webhook-receiver` |<br />
| Port already in use | Change ports in `docker-compose.yml` and `.env` |<br />
| Stale containers or old builds | Run `./cleanup.sh` and start again |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment</p>
<p data-ai-summary="true">Add a **digest mode** preference: when enabled, non-CRITICAL notifications batch into a single email every 15 minutes instead of immediate delivery.</p>
<p>**Steps:**<br />
1. Add `digest_enabled` and `digest_interval_minutes` fields to `UserPreference`<br />
2. Modify the orchestrator to queue digest-eligible notifications separately<br />
3. Create a periodic task that renders a combined Jinja2 template and sends one email<br />
4. Add a toggle on the Preferences UI page<br />
5. Write a test proving three MEDIUM alerts produce one digest email  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment Solution Hints</p>
<p>&#8211; Store digest candidates in a Redis sorted set keyed by `user_id`, scored by timestamp<br />
&#8211; The digest worker should use `ZPOPMIN` to grab expired batches<br />
&#8211; Reuse `alert_email` template with a `{% for alert in alerts %}` loop<br />
&#8211; CRITICAL severity must bypass digest regardless of preference — check severity before enqueue<br />
&#8211; Test: create 3 alerts with `severity=MEDIUM`, wait for digest interval, assert `Notification` count for EMAIL equals 1  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Practical Takeaway</p>
<p data-ai-summary="true">You built the layer that turns infrastructure signals into human action. The queue, retry logic, and circuit breaker are the same patterns Twilio, SendGrid, and PagerDuty use at scale — just smaller. Next time an on-call page fails silently, you will know exactly which state transition broke.</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[system11]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 6 : Alert Management Control Plane ## What We Are Building InfraWatch **Alerts** — a production alert management console modeled after PagerDuty and Datadog: 1. **Rule engine** —... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Lesson 6 : Alert Management Control Plane</p>
<p data-ai-summary="true">## What We Are Building</p>
<p data-ai-summary="true">InfraWatch **Alerts** — a production alert management console modeled after PagerDuty and Datadog:</p>
<p>1. **Rule engine** — Threshold, anomaly, and expression-based rules with templates<br />
2. **Evaluation pipeline** — Sustained breach detection, deduplication, suppression windows<br />
3. **Lifecycle processing** — Severity scoring, escalation policies, auto-resolution<br />
4. **Query plane** — Search, statistics, bulk operations, CSV/JSON export<br />
5. **Operator console** — Dark-theme dashboard with live WebSocket updates<br />
6. **Notification routing** — Channel delivery with circuit-breaker protection  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concept: Signal vs Noise</p>
<p data-ai-summary="true">Every monitoring platform ingests thousands of metric points per minute. The hard problem is not detection — it is **deciding which signals deserve human attention**. Datadog and PagerDuty solve this with pending timers (avoid flapping), fingerprint deduplication (one alert per root cause), and suppression windows (maintenance silence).</p>
<p data-ai-summary="true">The non-obvious insight: **alert state is a first-class entity**, not a side effect of rule evaluation. A rule fires; an alert instance is born with its own lifecycle. Operators acknowledge, escalate, and resolve *alerts* — not rules.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Where This Fits</p>
<p>&#8220;`<br />
Metrics Collection          raw telemetry<br />
        │<br />
Background Processing       scheduled evaluation jobs<br />
        │<br />
Alert Management            ◀ this lesson<br />
        │<br />
Notifications               delivers outcomes to humans<br />
&#8220;`</p>
<p data-ai-summary="true">Alerts sit between completed metric pipelines and human response. They assume metrics arrive reliably and workers can evaluate rules on schedule.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Component Architecture</p>
<p>**Ingest** — Agents POST metric samples; each sample triggers rule evaluation.<br />
**Evaluator** — Compares values against thresholds or Z-score anomaly models.<br />
**Processor** — Manages pending→firing transitions, severity scoring, history audit.<br />
**Query <span data-ai-definition="API">API</span>** — Powers search, statistics, and export for the operator console.<br />
**WebSocket** — Pushes state changes to the dashboard without polling.</p>
<p>| Service | Port | Role |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;|&#8212;&#8212;|<br />
| FastAPI backend | 5020 | REST <span data-ai-definition="API">API</span>, WebSocket, evaluation loop |<br />
| React frontend | 5190 | Operator console (Vite locally, nginx in Docker) |<br />
| PostgreSQL | 5450 | Rules, alerts, metrics, state history |<br />
| Redis | 6400 | Cache and future queue integration |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Control &#038; Data Flow</p>
<p data-ai-summary="true">Metric arrives → stored in PostgreSQL → evaluator checks all enabled rules → suppression filter → dedup fingerprint → create or update alert instance → pending timer → firing → notification router → WebSocket broadcast → dashboard card refresh.</p>
<p data-ai-summary="true">Step-by-step path your code follows:</p>
<p>1. Agent or simulator POSTs a sample to `/<span data-ai-definition="API">API</span>/v1/metrics/ingest`.<br />
2. Sample lands in the `metric_samples` table.<br />
3. The evaluation engine loads every enabled rule.<br />
4. The suppression service checks maintenance windows and regex patterns.<br />
5. The evaluator compares the value against a threshold or Z-score anomaly model.<br />
6. An alert instance is created or updated — pending becomes firing after `for_duration_seconds`.<br />
7. The notification router logs delivery; WebSocket pushes the update.<br />
8. Dashboard stats cards and the alert table refresh.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Alert Lifecycle</p>
<p data-ai-summary="true">States: **OK → Pending → Firing → Acknowledged → Resolved**, with a **Suppressed** branch during maintenance windows. Pending requires sustained breach (`for_duration_seconds`) before firing — the same pattern Prometheus `for:` clauses use.</p>
<p>| State | What it means |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| pending | Threshold breached; waiting for sustained duration |<br />
| firing | Sustained breach; notifications sent |<br />
| acknowledged | Operator acknowledged; auto-resolve timer may apply |<br />
| resolved | Condition cleared or manually resolved |<br />
| suppressed | Matched by suppression rule during maintenance window |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Real-World Context</p>
<p>&#8211; **PagerDuty** routes firing alerts through escalation policies with time-delayed notifications.<br />
&#8211; **Datadog** monitors use evaluation windows to prevent alert storms from brief spikes.<br />
&#8211; **Grafana** alert rules separate rule definition from alert instance management.  </p>
<p data-ai-summary="true">Your console implements all three patterns in one self-contained stack.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Design Notes</p>
<p data-ai-summary="true">These choices show up directly in the code you will run:</p>
<p>&#8211; **Pending before firing** — Rules use `for_duration_seconds` so a one-second CPU spike does not wake anyone at 3 a.m.<br />
&#8211; **Fingerprint dedup** — A SHA-256 fingerprint per rule and metric stops duplicate alert rows from flooding the table.<br />
&#8211; **DB-first state** — Every transition is written to `alert_state_history` so you can audit what happened and when.<br />
&#8211; **Same-origin WebSocket** — The frontend connects to `ws://<host>/ws/alerts`; Vite and nginx proxy it correctly in both local and Docker modes.<br />
&#8211; **Self-contained project** — Clone `infrawatch-alerts/` alone. Everything you need is inside that folder.<br />
&#8211; **No <span data-ai-definition="API">API</span> keys in source** — <span data-ai-definition="database">database</span> credentials are local Docker defaults for development. Set `SECRET_KEY` in `.env` before production.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Project Layout</p>
<p>&#8220;`<br />
infrawatch-alerts/<br />
├── backend/<br />
│   ├── app/<br />
│   │   ├── main.py                    # FastAPI routes + WebSocket + evaluation loop<br />
│   │   ├── core/config.py             # pydantic-settings<br />
│   │   ├── core/<span data-ai-definition="database">database</span>.py           # SQLAlchemy engine + session<br />
│   │   ├── models/alert.py            # AlertRule, AlertInstance, EscalationPolicy, etc.<br />
│   │   ├── schemas/alert.py           # Pydantic request/response models<br />
│   │   ├── services/<br />
│   │   │   ├── evaluator.py           # Threshold + anomaly evaluation, dedup<br />
│   │   │   ├── processor.py           # Lifecycle, severity, notifications<br />
│   │   │   ├── query.py               # Search, statistics, export<br />
│   │   │   ├── suppression.py         # Maintenance window logic<br />
│   │   │   ├── validator.py           # Rule validation + test harness<br />
│   │   │   └── seed.py                # Default rules and templates<br />
│   │   ├── <span data-ai-definition="API">API</span>/<br />
│   │   │   ├── rules.py               # Rule CRUD, templates, testing<br />
│   │   │   ├── alerts.py              # Search, ack, resolve, bulk ops<br />
│   │   │   └── metrics.py             # Ingest + simulate<br />
│   │   └── websocket/manager.py       # Live dashboard push<br />
│   ├── tests/<br />
│   ├── requirements.txt<br />
│   ├── requirements-dev.txt<br />
│   └── Dockerfile<br />
├── frontend/<br />
│   ├── src/<br />
│   │   ├── pages/Dashboard.tsx        # Command center + metric simulator<br />
│   │   ├── pages/Alerts.tsx           # Searchable alert explorer<br />
│   │   ├── pages/Rules.tsx            # Rule list + test dialog<br />
│   │   ├── pages/Templates.tsx        # Template gallery<br />
│   │   ├── components/AlertTable.tsx  # Bulk ack/resolve<br />
│   │   ├── components/MetricSimulator.tsx<br />
│   │   ├── <span data-ai-definition="API">API</span>/client.ts<br />
│   │   └── hooks/useWebSocket.ts<br />
│   ├── package.json<br />
│   ├── vite.config.ts<br />
│   └── Dockerfile<br />
├── docker-compose.yml<br />
├── build.sh                           # test | local | docker | demo | all<br />
├── start.sh<br />
├── stop.sh<br />
├── cleanup.sh<br />
├── requirements.txt<br />
├── requirements-dev.txt<br />
├── .env.example<br />
└── README.md<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Success Criteria</p>
<p data-ai-summary="true">After completing this lesson you should be able to:</p>
<p>&#8211; [ ] Create threshold and anomaly rules via <span data-ai-definition="API">API</span><br />
&#8211; [ ] Ingest metrics and observe alerts transition through pending → firing<br />
&#8211; [ ] Acknowledge and bulk-resolve alerts from the dashboard<br />
&#8211; [ ] Search, export, and view severity statistics<br />
&#8211; [ ] See live updates via WebSocket without page refresh  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Build, Test, and Run</p>
<p data-ai-summary="true">Everything below runs from inside the `infrawatch-alerts/` directory. You need Python 3.12+, Node.js 22+, and Docker with Compose v2.</p>
<p data-ai-summary="true">### First-time setup</p>
<p>&#8220;`bash<br />
cd infrawatch-alerts<br />
chmod +x build.sh stop.sh start.sh cleanup.sh</p>
<p>python3 -m venv backend/.venv<br />
source backend/.venv/bin/activate<br />
pip install -r requirements-dev.txt</p>
<p>cd frontend &#038;&#038; npm install &#038;&#038; cd ..<br />
&#8220;`</p>
<p data-ai-summary="true">Copy environment defaults when you need them:</p>
<p>&#8220;`bash<br />
cp .env.example .env<br />
&#8220;`</p>
<p data-ai-summary="true">### Run tests</p>
<p>&#8220;`bash<br />
./build.sh test<br />
&#8220;`</p>
<p data-ai-summary="true">You should see 11 pytest tests pass, a Vitest smoke test pass, and a successful Vite production build.</p>
<p data-ai-summary="true">To run test suites individually:</p>
<p>&#8220;`bash<br />
cd backend<br />
export PYTHONPATH=.<br />
export TESTING=1<br />
export DATABASE_URL=&#8221;sqlite://&#8221;<br />
pytest -q &#8211;tb=short</p>
<p>cd ../frontend<br />
npm run test &#8212; &#8211;run<br />
npm run build<br />
&#8220;`</p>
<p data-ai-summary="true">### Start the stack (local mode)</p>
<p>&#8220;`bash<br />
./build.sh local<br />
&#8220;`</p>
<p data-ai-summary="true">Docker starts Postgres and Redis only. The <span data-ai-definition="API">API</span> and UI run on your machine. Wait for:</p>
<p>&#8220;`<br />
[build] Postgres ready<br />
<span data-ai-definition="database">database</span> seeded<br />
[build] Backend healthy on :5020<br />
[build] Frontend on http://localhost:5190<br />
&#8220;`</p>
<p data-ai-summary="true">Open http://localhost:5190</p>
<p data-ai-summary="true">### Generate metrics and see alerts</p>
<p data-ai-summary="true">Metrics do not appear until data is ingested. Pick either path:</p>
<p data-ai-summary="true">**From the dashboard** — Click **Run Production Burst**, or set CPU above 95 and click **Push Custom Metrics**.</p>
<p data-ai-summary="true">**From the terminal:**</p>
<p>&#8220;`bash<br />
curl -X POST http://localhost:5020/<span data-ai-definition="API">API</span>/v1/metrics/simulate<br />
curl http://localhost:5020/<span data-ai-definition="API">API</span>/v1/alerts/statistics<br />
curl http://localhost:5020/<span data-ai-definition="API">API</span>/v1/alerts/active<br />
&#8220;`</p>
<p data-ai-summary="true">Wait 30–60 seconds for pending timers, then refresh the dashboard. Statistics should show `total` greater than zero.</p>
<p data-ai-summary="true">### Run the automated demo</p>
<p data-ai-summary="true">With the stack already running:</p>
<p>&#8220;`bash<br />
./build.sh demo<br />
&#8220;`</p>
<p data-ai-summary="true">### Stop and clean up</p>
<p>&#8220;`bash<br />
./stop.sh        # stop app processes and Docker Compose services<br />
./cleanup.sh     # remove node_modules, venv, caches; prune Docker resources<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Docker</p>
<p data-ai-summary="true">**Hybrid mode** (what `./build.sh local` uses) — infrastructure only:</p>
<p>&#8220;`bash<br />
docker compose up -d postgres redis<br />
&#8220;`</p>
<p data-ai-summary="true">The app connects via `localhost:5450` and `localhost:6400`.</p>
<p data-ai-summary="true">**Full stack** — all four services in containers:</p>
<p>&#8220;`bash<br />
./build.sh docker<br />
&#8220;`</p>
<p>| Container | Image / build | Host port |<br />
|&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8211;|<br />
| postgres | `postgres:16-alpine` | 5450 |<br />
| redis | `redis:7-alpine` | 6400 |<br />
| backend | `backend/Dockerfile` | 5020 |<br />
| frontend | `frontend/Dockerfile` (nginx) | 5190 |</p>
<p data-ai-summary="true">Tear down containers and volumes:</p>
<p>&#8220;`bash<br />
docker compose down -v &#8211;remove-orphans<br />
&#8220;`</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Dashboard Walkthrough</p>
<p>| Section | What you will see |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-|<br />
| Overview | Stats cards (total, firing, critical+, MTTR), severity chart, hourly trend |<br />
| Metric Simulator | CPU and <span data-ai-definition="API">API</span> latency sliders; buttons to push real metric samples |<br />
| Active Alerts | Severity chips, bulk acknowledge and resolve, live WebSocket refresh |<br />
| Rules | Rule list with a test dialog against sample values |<br />
| Templates | Pre-built infrastructure templates you can turn into rules |</p>
<p data-ai-summary="true">The **Live** badge in the header means the WebSocket connection is active. When an alert changes state, the table updates without reloading the page.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## <span data-ai-definition="API">API</span> Reference</p>
<p data-ai-summary="true">Base URL: `http://localhost:5020`</p>
<p data-ai-summary="true">### Health check</p>
<p>&#8220;`bash<br />
curl -sS http://localhost:5020/<span data-ai-definition="API">API</span>/health | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">Expected:</p>
<p>&#8220;`json<br />
{&#8220;status&#8221;: &#8220;healthy&#8221;, &#8220;service&#8221;: &#8220;infrawatch-alerts&#8221;, &#8220;rules&#8221;: 4}<br />
&#8220;`</p>
<p data-ai-summary="true">### Ingest a single metric</p>
<p>&#8220;`bash<br />
curl -sS -X POST http://localhost:5020/<span data-ai-definition="API">API</span>/v1/metrics/ingest<br />
  -H &#8220;Content-Type: application/json&#8221;<br />
  -d &#8216;{&#8220;metric_name&#8221;:&#8221;cpu.utilization&#8221;,&#8221;value&#8221;:95.0,&#8221;service&#8221;:&#8221;compute&#8221;}&#8217;<br />
&#8220;`</p>
<p data-ai-summary="true">### Simulate a production burst</p>
<p>&#8220;`bash<br />
curl -sS -X POST http://localhost:5020/<span data-ai-definition="API">API</span>/v1/metrics/simulate | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">### Query alerts</p>
<p>&#8220;`bash<br />
curl -sS http://localhost:5020/<span data-ai-definition="API">API</span>/v1/alerts/statistics | python3 -m json.tool<br />
curl -sS http://localhost:5020/<span data-ai-definition="API">API</span>/v1/alerts/active | python3 -m json.tool<br />
&#8220;`</p>
<p data-ai-summary="true">### Acknowledge and test a rule</p>
<p>&#8220;`bash<br />
curl -sS -X POST &#8220;http://localhost:5020/<span data-ai-definition="API">API</span>/v1/alerts/{alert_id}/acknowledge&#8221;</p>
<p>curl -sS -X POST http://localhost:5020/<span data-ai-definition="API">API</span>/v1/test/rule<br />
  -H &#8220;Content-Type: application/json&#8221;<br />
  -d &#8216;{&#8220;rule_id&#8221;:&#8221;<id>&#8220;,&#8221;test_values&#8221;:[95.0, 50.0, 10.0]}&#8217;<br />
&#8220;`</p>
<p data-ai-summary="true">### Endpoint summary</p>
<p>| Method | Path | Purpose |<br />
|&#8212;&#8212;&#8211;|&#8212;&#8212;|&#8212;&#8212;&#8212;|<br />
| GET | `/<span data-ai-definition="API">API</span>/health` | Health check |<br />
| GET/POST | `/<span data-ai-definition="API">API</span>/v1/rules` | Rule CRUD |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/rules/bulk` | Bulk enable/disable/delete |<br />
| GET/POST | `/<span data-ai-definition="API">API</span>/v1/templates` | Rule templates |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/test/rule` | Test rule against sample values |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/metrics/ingest` | Ingest metric + trigger evaluation |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/metrics/simulate` | Demo metric burst |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/search` | Filtered search with pagination |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/alerts/statistics` | Dashboard aggregates |<br />
| GET | `/<span data-ai-definition="API">API</span>/v1/alerts/active` | Non-resolved alerts |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/{id}/acknowledge` | Acknowledge |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/{id}/resolve` | Resolve |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/bulk-update` | Bulk status change |<br />
| POST | `/<span data-ai-definition="API">API</span>/v1/alerts/export` | CSV/JSON export |<br />
| WS | `/ws/alerts` | Live alert updates |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Troubleshooting</p>
<p>| Issue | What to do |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;|<br />
| `Internal Server Error` on `/<span data-ai-definition="API">API</span>/health` | Run `./stop.sh` then `./build.sh local` |<br />
| Dashboard shows zero alerts | Run the simulate curl command, wait 60 seconds, refresh |<br />
| Backend not healthy | Check `tail -50 .logs/backend.log` |<br />
| Port already in use | Change ports in `docker-compose.yml` and `.env` |<br />
| Stale containers or old builds | Run `./cleanup.sh` and start again |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Assignment</p>
<p>1. Add a composite rule that fires only when **both** `cpu.utilization > 80` AND `memory.utilization > 75`.<br />
2. Create a suppression rule for `payments.*` metrics during a 2-hour maintenance window.<br />
3. Build an escalation policy that notifies Slack at level 1 (5 min) and PagerDuty at level 2 (15 min).<br />
4. Add a dashboard widget showing MTTR trend over the last 24 hours.  </p>
<p data-ai-summary="true">### Solution Hints</p>
<p>&#8211; Composite rules: extend `RuleType.COMPOSITE` in `evaluator.py` with AND/OR sub-condition evaluation.<br />
&#8211; Suppression: POST to `/<span data-ai-definition="API">API</span>/v1/suppression` with regex `metric_pattern` and ISO window timestamps.<br />
&#8211; Escalation: attach `EscalationPolicy` rows to rules; background loop checks `delay_minutes` on unacknowledged firing alerts.<br />
&#8211; MTTR widget: query `/<span data-ai-definition="API">API</span>/v1/alerts/statistics` hourly and plot `mttr_minutes` in a `LineChart`.  </p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">*Next in series: Notification channels — email, Slack, webhooks, and preference-based delivery.*</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[Anjali]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Atlas Hyperscale Platform — Engineering Lesson **Domain:** DevOps · SecOps · SRE · Cloud Engineering · Platform Engineering · Release Engineering · Automation Engineering · Infrastructure Engineering ## What... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Atlas Hyperscale Platform — Engineering Lesson</p>
<p data-ai-summary="true">**Domain:** DevOps · SecOps · SRE · Cloud Engineering · Platform Engineering · Release Engineering · Automation Engineering · Infrastructure Engineering</p>
<p data-ai-summary="true">## What We Build</p>
<p data-ai-summary="true">You will operate **Atlas Hyperscale Platform** — a single control plane that unifies global traffic engineering, predictive scaling, FinOps accountability, and production-readiness gates. When Netflix serves a global catalog or Shopify absorbs a flash-sale spike, no single team owns “the server.” They own **loops**: route traffic cheaply, scale before saturation, spend consciously, and gate launches with evidence. Atlas teaches that integration pattern in one repository.</p>
<p data-ai-summary="true">The platform is built **from scratch as one product**: four domain routers share one FastAPI process, one React console, and one WebSocket broadcast. Traffic, <span data-ai-definition="performance">performance</span>, economics, and readiness are first-class modules — not separate repos glued at the UI.</p>
<p data-ai-summary="true">**Agenda**</p>
<p>&#8211; **Traffic engineering** — geographic routing, consistent-hash sharding, multi-tier cache, regional failover<br />
&#8211; **<span data-ai-definition="performance">performance</span> operations** — profiling, predictive auto-scaling, query optimization, capacity runway<br />
&#8211; **FinOps governance** — namespace cost attribution, optimization recommendations, anomaly alerting<br />
&#8211; **Production readiness** — six-pillar validation, integration tests, runbooks, knowledge base<br />
&#8211; **Unified console** — live dashboard with cross-domain **Run Platform Demo**</p>
<p data-ai-summary="true">**Success criteria:** Dashboard shows live metrics; validation returns six pillar scores; FinOps total cost updates after collection; failover reroutes traffic away from a failed region.</p>
<p data-ai-summary="true">## Why a Unified Control Plane Matters</p>
<p data-ai-summary="true">In most organizations, traffic routing lives in CDN configs, scaling in a separate autoscaler service, cost in a spreadsheet, and launch approval in a wiki. Each layer works until something breaks at a boundary — a region fails but FinOps still bills it, autoscaler adds replicas after latency already spiked, or a service ships without a runbook.</p>
<p data-ai-summary="true">Atlas models the **opposite architecture**:</p>
<p>| Problem in siloed stacks | Atlas integration answer |<br />
|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| On-call sees latency before autoscaler reacts | Predictive scaler forecasts 30 min ahead |<br />
| Cost spike discovered at month-end | FinOps collector + 3σ anomaly alerts |<br />
| Region failure needs manual DNS flip | Traffic engine reroutes to nearest healthy cell |<br />
| Launch approved without evidence | Readiness gate scores six pillars before go-live |<br />
| Console polls four different tools | One `/ws/live` stream feeds all panels |</p>
<p data-ai-summary="true">You are not learning four unrelated APIs — you are learning **how platform engineers wire hyperscale loops together**.</p>
<p data-ai-summary="true">## Platform Placement</p>
<p data-ai-summary="true">Atlas sits at the **coordination layer** between product traffic and raw cloud infrastructure — the same role internal platforms play at hyperscale retailers, social feeds, and fintech payment rails.</p>
<p>| Plane | Responsibility | Atlas module |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8211;|<br />
| Traffic | Geo-routing, sharding, cache | Traffic Engine |<br />
| <span data-ai-definition="performance">performance</span> | Scale, profile, capacity | <span data-ai-definition="performance">performance</span> services |<br />
| Economics | Cost, waste, commitments | FinOps module |<br />
| Governance | Readiness gates, runbooks | Readiness module |</p>
<p data-ai-summary="true">Atlas does not replace Kubernetes, cloud consoles, or observability backends. It exposes a stable internal <span data-ai-definition="API">API</span> so a product team can request a user profile without knowing which region, shard, or cache tier served it — while SRE and platform teams retain policy control behind that abstraction.</p>
<p data-ai-summary="true">## Core Concepts by Engineering Discipline</p>
<p data-ai-summary="true">### Infrastructure &#038; Cloud Engineering</p>
<p data-ai-summary="true">Hyperscale systems partition work across **regions** (latency), **shards** (data scale), and **cache tiers** (cost). Atlas simulates four regions (`us-east`, `us-west`, `eu-west`, `ap-south`) with independent health scores and capacity. A request carries optional `X-User-Lat` / `X-User-Lon` headers; the engine picks the nearest healthy region by Euclidean distance — the same intuition behind AWS Route 53 latency routing and Cloudflare anycast edge selection.</p>
<p data-ai-summary="true">**Sharding** uses consistent hashing: each physical shard owns 256 virtual nodes on a hash ring. When you add shard five, only ~20% of keys move (idealized) versus naive modulo partitioning that reshuffles half the dataset. Cassandra, DynamoDB, and Redis Cluster all rely on this property.</p>
<p>| Cache tier | TTL | Role |<br />
|&#8212;&#8212;&#8212;&#8212;|&#8212;&#8211;|&#8212;&#8212;|<br />
| L1 | 30 s | Hot user sessions, flash traffic |<br />
| L2 | 5 min | Warm catalog / profile data |<br />
| L3 | 30 min | Stable reference data |<br />
| Miss | — | Origin fetch; counted for hit-rate SLO |</p>
<p data-ai-summary="true">Cloud engineers care because **cache hit-rate is a cost lever**. Every miss is a <span data-ai-definition="database">database</span> round-trip billed in CPU, connection pool slots, and replication lag risk.</p>
<p data-ai-summary="true">### Platform &#038; Release Engineering</p>
<p data-ai-summary="true">Four domain routers share one FastAPI application:</p>
<p>| Router prefix | Domain |<br />
|&#8212;&#8212;&#8212;&#8212;&#8212;|&#8212;&#8212;&#8211;|<br />
| `/<span data-ai-definition="API">API</span>/traffic` | Geo-route, shard, cache, failover |<br />
| `/<span data-ai-definition="API">API</span>/<span data-ai-definition="performance">performance</span>` | Profiler, autoscaler, optimizer, runway |<br />
| `/<span data-ai-definition="API">API</span>/finops` | Cost, recommendations, alerts |<br />
| `/<span data-ai-definition="API">API</span>/readiness` | Validation, integration tests, runbooks |<br />
| `/<span data-ai-definition="API">API</span>/platform` | Summary + cross-domain demo |</p>
<p data-ai-summary="true">On startup (`lifespan` in `main.py`), Atlas wires router dependencies, collects initial FinOps metrics, and launches background loops: profiler sampling, autoscaler forecasting, query optimizer refresh, capacity runway calculation, and metrics broadcast every second. That bootstrap mirrors real controllers that reconcile desired state after the <span data-ai-definition="API">API</span> process comes up.</p>
<p data-ai-summary="true">Release engineers attach promotion logic to **one surface**: readiness score ≥ threshold, integration pass rate 100%, no active FinOps alerts — before tagging a production release.</p>
<p data-ai-summary="true">### DevOps &#038; Automation Engineering</p>
<p data-ai-summary="true">The **platform demo** (`POST /<span data-ai-definition="API">API</span>/platform/demo`) is an orchestrated simulation: ramp traffic target RPS, bump load metrics, trigger cost collection, and return a unified summary. Automation teams use the same pattern in production — a “game day” script that proves subsystems cooperate before Black Friday.</p>
<p data-ai-summary="true">Background tasks run as asyncio loops, not cron jobs inside the <span data-ai-definition="API">API</span> process. That is intentional: shared in-memory state (replica count, cache stats, cost totals) stays consistent for WebSocket subscribers without a separate message bus in this learning stack.</p>
<p data-ai-summary="true">### SecOps</p>
<p data-ai-summary="true">Readiness validation includes a **security pillar** scoring policy compliance and automation coverage. FinOps anomaly detection uses statistical thresholds (3σ above baseline) — the same class of signal SIEM rules use for billing fraud or crypto-mining hijacks on compromised accounts.</p>
<p data-ai-summary="true">Regional failover is a security-adjacent control: when `us-east` is marked failed, traffic never routes there until recovery — limiting blast radius during regional compromise or BGP incidents.</p>
<p data-ai-summary="true">### SRE</p>
<p data-ai-summary="true">Atlas exposes Prometheus gauges at `/metrics` (`atlas_readiness_score`, `atlas_cluster_cost_usd`, `atlas_validation_runs_total`). The console subscribes to `/ws/live` so operators see golden signals without refreshing.</p>
<p>| Signal | Where observed | Action triggered |<br />
|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| Latency proxy | Cache miss rate rising | Traffic panel hit-rate drops |<br />
| Traffic | `request_count`, region RPS | Demo ramp, load generate |<br />
| Saturation | CPU %, replica utilization | Autoscaler scale-up |<br />
| Errors | Integration test failures | Readiness gate blocks launch |</p>
<p data-ai-summary="true">**Error-budget thinking:** If p99 latency degrades while average CPU looks healthy, suspect cache stampede or shard hotspot — not “add more servers blindly.” The Traffic tab shows per-shard load % exactly for this diagnosis.</p>
<p data-ai-summary="true">### FinOps &#038; Cloud Financial Management</p>
<p data-ai-summary="true">Namespace-level costs (`production`, `staging`, `ml-training`, etc.) enable **chargeback** — each team sees its bill before finance closes the month. Recommendations cover rightsizing, reserved instances, idle volume cleanup, and spot migration. Waste reports aggregate orphaned resources.</p>
<p data-ai-summary="true">Production FinOps teams run this loop weekly: collect → attribute → recommend → alert → remediate. Atlas compresses it into a five-minute demo.</p>
<p data-ai-summary="true">## Architecture, Control Flow &#038; Data Flow</p>
<p data-ai-summary="true">**Control flow:** React console calls REST endpoints and subscribes to `/ws/live`. User actions (fail region, start load, run validation) mutate service state; background loops propagate changes to all connected clients.</p>
<p data-ai-summary="true">**Data flow (user request path):**</p>
<p>&#8220;`text<br />
GET /<span data-ai-definition="API">API</span>/traffic/user/{id}<br />
  → parse X-User-Lat, X-User-Lon<br />
  → route_request() → nearest healthy region<br />
  → get_shard(user_id) → consistent-hash ring lookup<br />
  → check_cache() → L1 → L2 → L3 → miss<br />
  → set_cache() on miss<br />
  → return { region, shard, cache_hit, profile }<br />
&#8220;`</p>
<p data-ai-summary="true">**Data flow (observability path):**</p>
<p>&#8220;`text<br />
background_metrics (1s)  ─┐<br />
background_finops (120s) ─┼→ broadcast_ws() → /ws/live → dashboard panels<br />
autoscaler (30s)       ─┘<br />
&#8220;`</p>
<p data-ai-summary="true">**State transitions:**</p>
<p>| State | Trigger | Next state |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;|<br />
| Idle | Platform start | Steady (background loops running) |<br />
| Traffic ramp | `start-load` or demo | Elevated RPS, shard load shifts |<br />
| Scale event | Predicted utilization > 80% | Replica count increases |<br />
| Cost collect | Timer or `POST /cost/collect` | Namespace totals refresh |<br />
| Alert | 3σ cost spike | FinOps alert panel |<br />
| Readiness gate | `POST /validate` | Scorecard published |</p>
<p data-ai-summary="true">## Capability Deep Dive</p>
<p data-ai-summary="true">### Traffic module</p>
<p data-ai-summary="true">The `HyperscaleEngine` maintains region health, shard statistics, and cache counters. Admin endpoints simulate production controls:</p>
<p>&#8220;`text<br />
POST /<span data-ai-definition="API">API</span>/traffic/admin/start-load     {&#8220;target_rps&#8221;: 5000}<br />
POST /<span data-ai-definition="API">API</span>/traffic/admin/fail-region/us-east<br />
POST /<span data-ai-definition="API">API</span>/traffic/admin/recover-region/us-east<br />
GET  /<span data-ai-definition="API">API</span>/traffic/metrics<br />
&#8220;`</p>
<p data-ai-summary="true">**Failover workflow:** fail region → `route_request` skips it → users near NYC route to `us-west` or `eu-west` → shard assignment unchanged (shard is independent of region in this model). Recovery restores optimal latency routing.</p>
<p data-ai-summary="true">**Design note:** Production systems separate **routing** (which edge serves the request) from **data placement** (which shard owns the key). Atlas keeps both visible in one response so you practice reading `region` and `shard` together — a common on-call skill during incidents.</p>
<p data-ai-summary="true">### <span data-ai-definition="performance">performance</span> module</p>
<p>| Component | Endpoint | Behavior |<br />
|&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;-|<br />
| Profiler | `GET /profiler/hotspots` | Top CPU consumers by function |<br />
| Autoscaler | `GET /autoscaler/status` | Replicas, utilization, 6-step forecast |<br />
| Optimizer | `GET /optimizer/recommendations` | Index suggestions with impact score |<br />
| Capacity | `GET /capacity/runway` | Days until CPU/memory exhaustion |<br />
| Load gen | `POST /load/generate` | Patterns: `steady`, `spike`, `ramp` |</p>
<p data-ai-summary="true">The autoscaler uses exponential smoothing (α = 0.2) plus a sinusoidal seasonality factor — a simplified version of what HPA+VPA controllers and KEDA scalers approximate with real metrics pipelines.</p>
<p>&#8220;`text<br />
if predicted_utilization > 0.80:  scale up (proactive)<br />
if sustained_utilization < 0.40:  scale down (conservative)
```

**Insight:** Scale-down waits for sustained low load because premature teardown causes **flapping** — replicas added and removed every few minutes, wasting provisioning <span data-ai-definition="API">API</span> calls and cold-start latency.

### FinOps module

```text
POST /<span data-ai-definition="API">API</span>/finops/cost/collect
GET  /<span data-ai-definition="API">API</span>/finops/cost/total
GET  /<span data-ai-definition="API">API</span>/finops/cost/breakdown      # compute 58% / storage 27% / network 15%
GET  /<span data-ai-definition="API">API</span>/finops/optimize/recommendations
GET  /<span data-ai-definition="API">API</span>/finops/optimize/waste
GET  /<span data-ai-definition="API">API</span>/finops/alerts
```

Cost collection runs on a 120-second background loop and on demand. Anomaly detector may emit alerts when spend spikes above baseline — surfacing in the FinOps tab and WebSocket payload.

**Real-world parallel:** Kubernetes cost tools (Kubecost, OpenCost) attribute spend by namespace label; Atlas uses the same mental model with simulated namespace totals.

### Readiness module

Six validators run concurrently via `asyncio.gather`:

| Pillar | What it measures |
|--------|------------------|
| Reliability | Failover, redundancy checks |
| Security | Policy compliance |
| Observability | Metrics and tracing coverage |
| Cost | Budget and optimization posture |
| <span data-ai-definition="scalability">scalability</span> | Shard balance, autoscaler health |
| Operability | Runbook and knowledge completeness |

```text
POST /<span data-ai-definition="API">API</span>/readiness/validate           → six scores + overall + recommendations
POST /<span data-ai-definition="API">API</span>/readiness/integration-tests  → 10 E2E scenarios
GET  /<span data-ai-definition="API">API</span>/readiness/runbooks
GET  /<span data-ai-definition="API">API</span>/readiness/knowledge
```

Integration tests cover `auth_flow`, `traffic_routing`, `cache_consistency`, `autoscaler_response`, `cost_collection`, `readiness_gate`, `failover_recovery`, `shard_rebalance`, `metrics_pipeline`, and `runbook_generation` — proving cross-module wiring, not isolated unit behavior.

Runbooks include `latency_spike`, `region_failover`, `cost_spike`, `shard_hotspot`, `cache_stampede`, and `scale_event` — each with severity and step count. Knowledge materials support onboarding without hunting tribal wiki pages.

## Console Behavior

The React dashboard uses five tabs:

| Tab | Primary data source | Key actions |
|-----|---------------------|-------------|
| Overview | `/<span data-ai-definition="API">API</span>/platform/summary` + WebSocket | Run Platform Demo |
| Traffic | `/<span data-ai-definition="API">API</span>/traffic/metrics` | Start load, fail/recover region |
| <span data-ai-definition="performance">performance</span> | autoscaler, runway, optimizer | View forecast confidence |
| FinOps | cost total, breakdown, recommendations | View active alerts |
| Readiness | validation, runbooks, knowledge | Run validation, integration tests |

WebSocket payload bundles traffic, <span data-ai-definition="performance">performance</span>, finops, and readiness in one JSON frame — the pattern production consoles use to avoid four polling intervals fighting each other.

## Context in Distributed Systems

Atlas is a **teaching compression** of patterns seen at:

- **Global edge networks** (Cloudflare, Fastly) — geographic routing and cache tiers
- **Hyperscale data planes** (DynamoDB, Cassandra) — consistent hashing and shard rebalancing
- **E-commerce flash events** (Shopify, Amazon Prime Day) — predictive scaling ahead of demand
- **Cloud FinOps maturity** (AWS Cost Explorer, Finout) — namespace attribution and anomaly detection
- **Production readiness reviews** (Google PRR, Meta readiness templates) — multi-pillar scorecards before launch

None of these companies run exactly this stack. They all run **the same loops** with different implementations. Atlas makes the loops visible in one afternoon.

## Assignment

1. Add a fifth region (`ap-northeast`) with Tokyo coordinates; verify routing from headers `X-User-Lat: 35.6`, `X-User-Lon: 139.7`.
2. Implement a cache-stampede runbook scenario and surface it in the Readiness tab.
3. Tune autoscaler `scale_up_threshold` to 0.75; document replica changes under a 2× load spike via `/<span data-ai-definition="API">API</span>/<span data-ai-definition="performance">performance</span>/load/generate`.
4. Add a FinOps chargeback tag `team=checkout` on the `production` namespace cost entry.
5. Achieve overall readiness score ≥ 85% and integration test pass rate 100% in one session.

## Solution Hints

1. Extend `HyperscaleEngine.regions` dict; `route_request` picks minimum distance automatically — no router change needed.
2. Add `"cache_stampede"` to `RunbookGenerator.SCENARIOS`; call `POST /<span data-ai-definition="API">API</span>/readiness/runbooks/generate?scenario=cache_stampede`.
3. Edit `PredictiveAutoscaler.scale_up_threshold` in `autoscaler.py`; POST `{"pattern":"spike"}` to load endpoint; watch `/<span data-ai-definition="API">API</span>/<span data-ai-definition="performance">performance</span>/autoscaler/status`.
4. Enrich `CostCollector.namespace_costs` values with a `tags` dict in the <span data-ai-definition="API">API</span> response layer of `finops.py`.
5. Run validation twice (scores are stochastic 78–96); integration tests retry until pass or seed `random` in tests for determinism during development.

## Takeaway

Production hyperscale is not one big server — it is **four cooperating loops**: route traffic cheaply, scale before saturation, spend consciously, and gate launches with evidence. Atlas gives you a working miniature of that loop. The skill you carry into your first platform team is not memorizing endpoints — it is recognizing which loop is broken when the dashboard turns red, and knowing which module owns the fix.

</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[Anjali]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Nexus AI Operations Platform — Engineering Lesson **Domain:** DevOps · SecOps · SRE · Cloud Engineering · Platform Engineering · Release Engineering · Automation Engineering · Infrastructure Engineering ##... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Nexus AI Operations Platform — Engineering Lesson</p>
<p data-ai-summary="true">**Domain:** DevOps · SecOps · SRE · Cloud Engineering · Platform Engineering · Release Engineering · Automation Engineering · Infrastructure Engineering</p>
<p data-ai-summary="true">## What We Build</p>
<p data-ai-summary="true">You will design and operate **Nexus AI Operations Platform** — a single control plane that treats specialized accelerators, edge inference nodes, ML-driven infrastructure decisions, secure delivery tooling, responsible-AI governance, and autonomous remediation as one coherent system rather than a collection of disconnected tools.</p>
<p data-ai-summary="true">The platform is built **from scratch as one product**: every module shares the same <span data-ai-definition="API">API</span> process, the same console, and the same operational vocabulary. There are no separate mini-projects stitched together at the UI — compute, edge, intelligence, governance, and automation are first-class domains inside one repository.</p>
<p data-ai-summary="true">**Agenda**</p>
<p>&#8211; **Compute orchestration** — GPU multi-instance sharing, placement policies, and spend visibility<br />
&#8211; **Accelerator scheduling** — TPU pod queues with preemptible cost optimization<br />
&#8211; **Edge fleet management** — device registration, model rollout, heartbeat health, cloud sync<br />
&#8211; **Infrastructure intelligence** — forecasting, anomaly detection, predictive scaling, incident response<br />
&#8211; **Delivery intelligence** — code security analysis, log anomaly detection, alert correlation, documentation generation<br />
&#8211; **Model governance** — bias analysis, fairness monitoring, explainability, approval workflows<br />
&#8211; **Autonomous operations** — DAG workflows, self-healing, controlled chaos, AI-assisted scheduling<br />
&#8211; **Unified console** — real-time dashboard with cross-domain **Run Demo** simulation</p>
<p data-ai-summary="true">**Platform objective:** Deliver the integration pattern production ML platform teams use before opening a shared GPU pool to multiple product lines — one <span data-ai-definition="API">API</span> boundary, explicit promotion gates, observable state, and automation that closes the loop.</p>
<p data-ai-summary="true">## Why a Unified Control Plane Matters</p>
<p data-ai-summary="true">In most organizations, GPU scheduling lives in one team’s scripts, edge rollout in another’s Ansible playbooks, bias review in a spreadsheet, and incident correlation in a third-party SaaS. Each layer works in isolation until something breaks at a boundary — a model promoted without review, inference scaled without reading logs, or a TPU job submitted onto full-GPU capacity because nobody exposed MIG slices.</p>
<p data-ai-summary="true">Nexus models the **opposite architecture**: one FastAPI application registers seven domain routers; one React console polls them; one demo orchestrator proves they cooperate. That shape mirrors how mature internal platforms (not public cloud consoles) are actually operated:</p>
<p>| Problem in siloed stacks | Nexus integration answer |<br />
|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| Training team does not know MIG exists | Scheduler returns `strategy: mig` with slice metadata |<br />
| FinOps sees spend only at month-end | Cost module tracks per resource type at schedule time |<br />
| Edge devices flap without central visibility | Heartbeat TTL + fleet summary in platform summary |<br />
| Alerts flood on-call | DBSCAN clusters alerts into one incident |<br />
| Model ships without ethics review | Governance gate before approval records exist |<br />
| Recovery untested until 3 a.m. | Chaos inject + self-healing log in automation panel |</p>
<p data-ai-summary="true">You are not learning seven unrelated APIs — you are learning **how platform engineers wire them together**.</p>
<p data-ai-summary="true">## Platform Placement in the Overall System</p>
<p data-ai-summary="true">Modern AI infrastructure at scale (recommendation training at Meta, TPU fleets at Google, factory-floor inference at industrial operators) converges on three planes that must cooperate:</p>
<p>| Plane | Responsibility | Nexus role |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;|<br />
| **Compute** | GPUs, TPUs, edge accelerators | Schedule, partition, place, and cost-track workloads |<br />
| **Operations** | Metrics, incidents, scaling, CI signals | Predict, detect, correlate, and remediate |<br />
| **Governance** | Model risk, bias, approvals | Gate promotion before compute consumes the job |</p>
<p data-ai-summary="true">Nexus is the **coordination layer** between product engineers and raw infrastructure. It does not replace Kubernetes, cloud consoles, or observability backends. It exposes a stable internal <span data-ai-definition="API">API</span> so an inference team can submit a job without knowing whether it lands on a MIG slice, a preemptible TPU pod, or an edge CPU node — while SRE and platform teams retain policy control behind that abstraction.</p>
<p data-ai-summary="true">## Core Concepts by Engineering Discipline</p>
<p data-ai-summary="true">### Infrastructure &#038; Cloud Engineering</p>
<p data-ai-summary="true">Accelerator capacity is a **scarce, billed state machine**. NVIDIA MIG (Multi-Instance GPU) partitions one physical GPU into isolated instances with fixed memory profiles. Common profiles on A100-class hardware include:</p>
<p>| Profile | Approx. memory | Typical use |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;-|<br />
| `1g.5gb` | 5 GB | Small inference, embedding models |<br />
| `2g.10gb` | 10 GB | Medium inference batches |<br />
| `3g.20gb` | 20 GB | Larger single-model serving |<br />
| `7g.40gb` | 40 GB | Near full-GPU workloads |</p>
<p data-ai-summary="true">MIG is not “fractional sharing” in the sense of best-effort CPU quotas — each instance has **hardware-enforced isolation**. That is why platform teams prefer MIG for multi-tenant inference: one bad neighbor cannot exhaust another tenant’s memory budget.</p>
<p data-ai-summary="true">TPU pods are similarly finite; preemptible instances trade restart risk for sharply lower hourly rates. A training job that checkpoints every N steps is a natural preemptible candidate; a latency-sensitive inference path is not.</p>
<p data-ai-summary="true">The Nexus scheduler encodes placement economics:</p>
<p>&#8220;`text<br />
if memory_gb < 10:        strategy = mig
elif checkpoint_enabled:  strategy = spot
else:                     strategy = full_gpu
```

Cloud engineers care because misplacement appears on the invoice weeks before it appears in latency dashboards. Cost tracking per resource type (`on_demand`, `spot`, `mig_slice`) makes FinOps visible at scheduling time, not only at month-end reconciliation.

**Production parallel:** Cloud providers expose MIG via drivers and device plugins; Nexus simulates inventory and allocation so you practice the **policy layer** without bare-metal GPUs.

### Platform &#038; Release Engineering

Seven domain routers share one FastAPI application:

| Router prefix | Domain |
|---------------|--------|
| `/<span data-ai-definition="API">API</span>/compute/gpu` | MIG, scheduling, cost |
| `/<span data-ai-definition="API">API</span>/compute/tpu` | Job queue, pod matching |
| `/<span data-ai-definition="API">API</span>/edge` | Fleet, devices, deploy, sync |
| `/<span data-ai-definition="API">API</span>/infra` | Forecast, anomaly, scale |
| `/<span data-ai-definition="API">API</span>/devops` | Code, logs, incidents, docs |
| `/<span data-ai-definition="API">API</span>/governance` | Bias, fairness, explain, reviews |
| `/<span data-ai-definition="API">API</span>/automation` | Workflows, heal, chaos, schedule |
| `/<span data-ai-definition="API">API</span>/platform` | Summary + cross-domain demo |

Each exposes a versioned REST contract. Release engineers benefit because promotion logic — bias check passed, governance approved, chaos recovery verified — can attach to **one platform surface** regardless of deployment target.

On application startup (`lifespan` in `main.py`), Nexus:

1. Initializes SQLite tables for governance and bias records
2. Starts background subsystems (self-healing, chaos framework, AI scheduler)
3. Seeds a demo edge fleet
4. Submits a bootstrap workflow DAG (“Validate GPU Inventory” → “Sync Edge Fleet”)

That bootstrap sequence mirrors real controllers that reconcile desired state after the <span data-ai-definition="API">API</span> process comes up.

### DevOps &#038; Automation Engineering

Security analysis runs against abstract syntax trees before merge, catching patterns like `eval()`, hardcoded credentials, and unsafe deserialization. Log streams feed an isolation-forest detector; correlated alerts collapse into a single incident via density clustering (DBSCAN) rather than five separate pages. Workflow engines execute DAGs asynchronously; chaos experiments validate that recovery paths actually work.

Automation design principle: **idempotent remediation** — restart, scale, reschedule — not runbooks that assume a human is awake.

The automation router also exposes a WebSocket for live workflow status — useful when operators want push updates instead of polling.

### SecOps

Correlation before paging is the operational difference between a healthy on-call rotation and burnout. Nexus groups alerts that share temporal and source proximity into one incident object with an inferred root cause string. Secret-pattern scanning in the delivery path is cheaper than credential rotation after exfiltration.

When you `POST` code containing `password="..."` to the analyzer, the score drops and severity-tagged issues return — that is the same class of signal CI secret scanners emit, implemented locally for learning.

### SRE

Time-series forecasting on CPU, memory, and request-rate history enables scale-up **before** saturation. Anomaly detection flags metric spikes that precede SLO breach. Prometheus-compatible GPU metrics at `/metrics` allow the same observability toolchain to scrape accelerator posture alongside application SLIs.

Error-budget thinking applies to GPU pools: if inference latency degrades while average utilization looks healthy, the scheduler policy — not traffic — is the likely culprit.

**Golden signals in Nexus context:**

| Signal | Where observed | Action triggered |
|--------|----------------|------------------|
| Latency | Log ingest + incident | Correlated alert |
| Traffic | Metric history | Forecaster train/predict |
| Errors | Log level + anomaly score | Incident create |
| Saturation | CPU/memory % | Scaling agent evaluate |

### Release Engineering

Governance is a **deployment gate**, not a spreadsheet. Bias analysis computes demographic parity and equalized-odds spreads; failed models do not receive approval records. Fairness monitoring logs per-group outcome rates over time. Explainability returns feature-attribution weights for audit trails.

The fair demo model `loan-model-v1` is designed to pass parity checks; other model IDs in the simulator may fail — practice reading metric spreads, not just boolean pass/fail.


## Capability Deep Dive

### GPU compute module

The MIG controller maintains simulated A100 inventory (four GPUs in demo), enables partitioning per GPU, configures profile/instance counts, and allocates slices to workloads.

**Typical <span data-ai-definition="API">API</span> sequence:**

```text
GET  /<span data-ai-definition="API">API</span>/compute/gpu/inventory
POST /<span data-ai-definition="API">API</span>/compute/gpu/{id}/mig/enable
POST /<span data-ai-definition="API">API</span>/compute/gpu/{id}/mig/configure   # body: profiles + instance counts
POST /<span data-ai-definition="API">API</span>/compute/gpu/schedule             # body: name, memory_gb, checkpoint_enabled
GET  /<span data-ai-definition="API">API</span>/compute/gpu/cost/analytics
GET  /metrics                              # Prometheus text with gpu_utilization
```

The scheduler creates jobs with strategy metadata; the cost module tracks hourly spend and emits migration recommendations when on-demand hours accumulate (e.g., suggest spot or MIG partitioning).

**Workflow:** enable MIG → configure profiles → schedule job → allocate slice → record cost → expose Prometheus gauges.

**Design note:** Scheduling is synchronous in the demo — production systems enqueue to a controller that reconciles against real node capacity. The **policy function** (MIG vs spot vs full) is what you carry forward.

### TPU compute module

A resource manager holds `v4-8` and `v4-32` pods with preemptible flags. Jobs enter a queue; the orchestrator matches type and preemptible preference to available pods. Cost estimation aggregates running jobs into hourly and daily projections with preemptible savings.

**Submit payload concepts:**

| Field | Role |
|-------|------|
| `name` | Human-readable job identifier |
| `tpu_type` | Pod shape (`v4-8`, `v4-32`) |
| `prefer_preemptible` | Cost policy — prefer cheaper interruptible capacity |

**Insight:** Preemptible preference is a **cost policy knob**, not an afterthought — it belongs in the submit payload, not in a post-hoc FinOps report.

Cluster metrics (`/<span data-ai-definition="API">API</span>/compute/tpu/metrics/cluster`) expose utilization % and pod counts — the same numbers the overview card displays after **Run Demo**.

### Edge fleet module

Devices register with name, location, and capability tags (`gpu`, `inference`, `cpu`). Heartbeats refresh online state and CPU/memory telemetry. Model deployment uses capability selectors so inference models reach appropriate nodes. Sync metrics report bytes transferred during fleet reconciliation.

**Device lifecycle:**

```text
register → online ←── periodic heartbeat ──→ offline (TTL expired)
                └── deploy model (selector match)
                └── sync_metrics (bytes reconciled)
```

**Distributed systems note:** Heartbeat TTL marks devices offline without blocking the control plane — partition tolerance by design. The fleet summary (`online` / `total`) appears in `/<span data-ai-definition="API">API</span>/platform/summary` for the overview panel.

### Infrastructure intelligence module

Metrics history feeds a RandomForest forecaster for short-horizon prediction. IsolationForest marks statistical outliers. A scaling agent compares predictions to thresholds and emits scale-up/down decisions per logical deployment (`<span data-ai-definition="API">API</span>-service`, `worker-pool`, `inference-gateway`). An incident response agent maps recent anomalies to remediation action lists.

**Data flow:**

```text
record(metric, value) → train(forecaster) → predict(next N steps)
                       → detect(anomaly) → create incident → suggest remediation
```

After **Run Demo**, random metric spikes populate history, trigger training, and feed `scaling_agent.evaluate()` — you should see CPU utilization move on the overview card.

### DevOps intelligence module

| Capability | Endpoint pattern | Output |
|------------|------------------|--------|
| Code security | `POST /<span data-ai-definition="API">API</span>/devops/code/analyze` | Issues + score |
| Log anomaly | `POST /<span data-ai-definition="API">API</span>/devops/logs/ingest` | Ingested + anomaly flag |
| Alert correlation | `POST /<span data-ai-definition="API">API</span>/devops/incidents/alerts` | Clustered incident |
| Doc generation | `POST /<span data-ai-definition="API">API</span>/devops/docs/generate` | Markdown <span data-ai-definition="API">API</span> summary |

The demo runner ingests synthetic error logs and creates alerts so the DevOps panel shows non-empty state after one click.

### Governance module

Bias detector computes parity and odds metrics against synthetic cohorts; results persist to SQLite (`data/nexus.db`). Fairness monitor aggregates per-group outcome rates. Explainability service returns normalized feature-importance maps. Governance service manages submit → pending → approved/rejected lifecycle.

**Review workflow:**

```text
POST /<span data-ai-definition="API">API</span>/governance/reviews/submit   → status: pending_review
POST /<span data-ai-definition="API">API</span>/governance/reviews/{id}/approve
POST /<span data-ai-definition="API">API</span>/governance/reviews/{id}/reject
GET  /<span data-ai-definition="API">API</span>/governance/reviews/stats
```

Persistence here is intentional — unlike ephemeral GPU job state, audit records must survive process restarts.

### Automation module

| Subsystem | Responsibility |
|-----------|----------------|
| Orchestration engine | Async DAG task execution |
| Self-healing controller | Resource registration + remediation log |
| Chaos framework | Typed failure injection + recovery flag |
| AI scheduler | Predicted optimal workflow windows |

Chaos types include `pod_failure`, `network_latency`, and `cpu_stress`. Injecting `pod_failure` returns `"recovered": true` when the healing path succeeds — that is your proof that automation is wired, not just listed in a diagram.

### Platform demo orchestrator

`POST /<span data-ai-definition="API">API</span>/platform/demo` runs a coordinated simulation across all modules. Internally, `demo_runner.py` sequences roughly:

1. **GPU** — enable/configure MIG on a random GPU; schedule 1–3 jobs with random memory and checkpoint flags; record cost usage
2. **TPU** — submit 1–2 training jobs with preemptible preference
3. **Edge** — heartbeat every registered device; run fleet sync
4. **Infra** — spike all metric series; train forecaster; evaluate scaling; analyze anomalies
5. **DevOps** — ingest error log; create correlated alert
6. **Governance** — run bias analysis; submit governance review
7. **Automation** — inject chaos; submit workflow with dependent tasks

The response is a **unified snapshot** (GPU active jobs, TPU utilization %, edge online count, infra CPU %) — the same shape as `GET /<span data-ai-definition="API">API</span>/platform/summary`, which the console polls every few seconds.

**Why one button:** Production validation often requires a “synthetic canary” that touches every dependency. The demo endpoint is that canary for the integrated platform.

---

## Operations Console

The React dashboard (`frontend/src/App.jsx`) uses a light theme with color-coded stat cards — indigo for GPU, violet for TPU, cyan for edge, emerald for CPU/health.

**Navigation panels:**

| Panel | Primary data sources |
|-------|---------------------|
| Overview | `/<span data-ai-definition="API">API</span>/platform/summary` |
| Compute | GPU inventory, TPU jobs, cost endpoints |
| Edge Fleet | devices, fleet status, sync |
| Infra AI | dashboard stats, model train, scaling |
| DevOps AI | code analyze, logs, incidents |
| Governance | bias, fairness, reviews |
| Automation | workflows, healing log, chaos |

**Run Demo** calls `POST /<span data-ai-definition="API">API</span>/platform/demo`, shows a toast on success, and briefly flashes updated card values. Without running demo first, overview text reads *“Click Run Demo to simulate live workloads”* — the platform is healthy but idle.

Local development runs Vite on port 3000 with <span data-ai-definition="API">API</span> proxy to 8000. Docker serves the built `dist/` via nginx on the same port.

---

## Context in Distributed Systems

### Control plane vs data plane

| Layer | Nexus implementation | Production analogue |
|-------|---------------------|---------------------|
| Control | Routers, schedulers, governance gates | K8s controllers, internal platform APIs |
| Data | Simulated jobs, heartbeats, metric series | Real training jobs, edge agents, Prometheus |
| Observation | Dashboard, `/metrics`, demo runner | Grafana, PagerDuty, FinOps tooling |

### Why integration matters

ML platforms fail at **handoffs**: training finishes without bias review; inference autoscales without correlating log anomalies; edge models deploy without governance records. Nexus wires these handoffs into one codebase so you practice production integration, not isolated tutorials.

### Consistency and availability trade-offs

| State type | Storage | Rationale |
|------------|---------|-----------|
| GPU/TPU jobs, metrics, workflows | In-memory | Fast demo + test isolation |
| Bias results, governance reviews | SQLite | Audit persistence |
| Edge fleet | In-memory + seed on startup | Partition-tolerant heartbeats |

Edge fleet state tolerates stale heartbeats — the scheduler remains available even when edge links flap. This is **AP** leaning: availability over strict fleet consistency.

---

## Architecture

### Request path (example: schedule GPU job)

```text
Browser or curl
    → POST /<span data-ai-definition="API">API</span>/compute/gpu/schedule
    → gpu router validates body
    → gpu_scheduler_service.schedule(...)
    → returns { id, strategy, status }
    → console polls inventory / summary
```

### Control flow

1. Operator interacts via console or <span data-ai-definition="API">API</span> client.
2. Router validates payload and delegates to a domain service.
3. Service mutates state and returns JSON.
4. Console polls summary and domain endpoints every 3–5 seconds.
5. **Run Demo** triggers `demo_runner`, which sequences cross-domain side effects in one request.

### State machines

**Compute job:**

```text
submitted → scheduled → running → completed
                         └──────→ failed → healing → rescheduled
```

**Governance:**

```text
not_submitted → pending_review → approved | rejected
```

**Edge device:**

```text
online ←── heartbeat ──→ offline (TTL expired)
```

---

## Technology Choices

| Layer | Choice | Why |
|-------|--------|-----|
| <span data-ai-definition="API">API</span> | FastAPI 3.11 | Async lifespan, OpenAPI docs at `/docs`, Pydantic validation |
| ML utilities | scikit-learn | Forecaster + isolation forest without heavy DL deps |
| Persistence | SQLite | Zero-config audit store for governance |
| Metrics | prometheus_client | Industry-standard scrape format |
| UI | React + Vite | Fast dev loop, static production build |
| Containers | Compose | One command for full stack |

## Success Criteria

- Integration test suite passes from project root
- Console renders white theme with color-coded metric cards
- **Run Demo** updates live values and shows confirmation toast
- Fair model bias check returns `passed: true`
- Chaos experiment reports `recovered: true`
- `/metrics` exposes GPU Prometheus text

## Assignment

Extend the integrated platform (not separate repos):

1. Add a **unified cost panel** combining GPU spend and TPU hourly estimates with a 7-day projection.
2. Auto-trigger bias analysis when a TPU job payload includes `model_id`; block queue if `passed` is false.
3. Fire a DevOps incident when two or more edge devices go offline within the heartbeat window.
4. Chaos-fail a workflow task and verify self-healing restarts the dependent resource.

## Solution Hints

1. New route `/<span data-ai-definition="API">API</span>/platform/cost-summary` aggregating GPU analytics and TPU cost estimate; poll every 30s in React.
2. In TPU submit handler, call bias analyzer before allocation; return 403 if not passed.
3. In edge offline marking logic, count stale devices; if ≥ 2, create correlated DevOps incident.
4. Submit workflow with failing task; invoke healing controller on failure event from incident detector.

## Practical Takeaway

Specialized hardware without intelligent operations burns budget silently. Nexus demonstrates how platform teams unify accelerators, edge nodes, ML-driven ops, security signals, ethics gates, and autonomous remediation behind **one control plane** — the shape of system you ship before ML workloads share expensive silicon.

The skills transfer directly:

- **Scheduler policy** → real cluster autoscalers and quota systems
- **Governance gates** → model registry promotion workflows
- **Demo orchestrator** → synthetic monitoring and chaos game days
- **Unified summary <span data-ai-definition="API">API</span>** → single pane of glass for leadership and on-call

Build the integration muscle here; swap simulated backends for real cloud APIs when you deploy to production.

</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[Anjali]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## Lesson 7 — Unified MLOps Control Plane ## What We&#8217;re Building Today Today you build a **unified MLOps control plane** — a single platform that orchestrates the entire machine... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## Lesson 7 — Unified MLOps Control Plane</p>
<p data-ai-summary="true">## What We&#8217;re Building Today</p>
<p data-ai-summary="true">Today you build a **unified MLOps control plane** — a single platform that orchestrates the entire machine learning lifecycle: experiment tracking, data pipelines, distributed training, model serving, production monitoring, security governance, and cost optimization.</p>
<p data-ai-summary="true">This is not a collection of disconnected notebooks. It is an **integrated system** with a shared <span data-ai-definition="API">API</span>, centralized state, real-time observability, and a command-center dashboard — the kind of architecture platform teams deploy when ML stops being a research exercise and becomes a revenue-critical production capability.</p>
<p data-ai-summary="true">### Why It Matters</p>
<p data-ai-summary="true">Most organizations fail at MLOps not because they lack tools, but because their tools are **siloed**:</p>
<p>&#8211; Data engineers own pipelines in Airflow.<br />
&#8211; ML scientists own experiments in scattered notebooks.<br />
&#8211; Platform engineers own Kubernetes clusters nobody else understands.<br />
&#8211; FinOps owns a spreadsheet disconnected from GPU utilization.<br />
&#8211; Security owns a governance process that blocks releases at 4 PM on Fridays.</p>
<p data-ai-summary="true">A unified control plane collapses these silos behind **one <span data-ai-definition="API">API</span> surface**, **one state model**, and **one operational vocabulary**. Netflix&#8217;s Metaflow, Uber&#8217;s Michelangelo, and Spotify&#8217;s ML infrastructure all converge on this pattern: abstract complexity, expose capabilities, reconcile state continuously.</p>
<p data-ai-summary="true">### Practical Outcome</p>
<p data-ai-summary="true">By the end of this lesson you will have:</p>
<p>&#8211; A **FastAPI control plane** exposing seven domain modules through REST and WebSocket APIs.<br />
&#8211; **Real ML training** with MLflow experiment tracking and scikit-learn model registration.<br />
&#8211; A **live dashboard** with module navigation, metrics, and activity streaming.<br />
&#8211; **Docker-packaged deployment** with health checks, Prometheus metrics, and persistent runtime volumes.<br />
&#8211; **Integration tests** validating the full pipeline from ingest to inference to drift detection.</p>
<p data-ai-summary="true">### High-Level Agenda</p>
<p>&#8211; Understand the MLOps lifecycle as a distributed control plane problem<br />
&#8211; Design module boundaries: platform, pipeline, training, serving, monitoring, security, cost<br />
&#8211; Implement shared state reconciliation and event broadcasting<br />
&#8211; Wire real services: MLflow, sklearn, Prometheus, WebSocket feeds<br />
&#8211; Containerize <span data-ai-definition="API">API</span> and UI with Docker Compose<br />
&#8211; Validate end-to-end with integration tests and production checklists<br />
&#8211; Extend the platform with advanced patterns: operators, GitOps, multi-cluster serving</p>
<p data-ai-summary="true">## Core Concepts: The MLOps Lifecycle</p>
<p data-ai-summary="true">### What is MLOps?</p>
<p data-ai-summary="true">MLOps is the **operational discipline** of taking machine learning from experiment to production and keeping it there reliably. It applies the same engineering rigor that transformed software deployment — version control, CI/CD, observability, incident response — to the unique challenges of ML systems.</p>
<p data-ai-summary="true">ML systems are not ordinary applications. They have **three axes of change** that traditional DevOps was never designed for:</p>
<p>1. **Code** — training scripts, feature logic, serving handlers<br />
2. **Data** — distributions shift silently; yesterday&#8217;s training set is today&#8217;s liability<br />
3. **Models** — artifacts with versioned weights, hyperparameters, and <span data-ai-definition="performance">performance</span> envelopes</p>
<p data-ai-summary="true">When any axis changes independently, production behavior drifts. MLOps exists to make that drift **visible, measurable, and recoverable**.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">A model that scores 94% accuracy in a notebook is worthless in production if:</p>
<p>&#8211; Nobody can reproduce the training run<br />
&#8211; The serving endpoint has no traffic splitting for safe rollout<br />
&#8211; Drift goes undetected for three weeks<br />
&#8211; GPU costs triple because nobody tracks spot utilization<br />
&#8211; A governance workflow was never approved before deployment</p>
<p data-ai-summary="true">Production MLOps is not about tools. It is about **contracts**: between teams, between systems, and between desired state and actual state.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">At Uber, Michelangelo handles feature stores, training orchestration, and batch/online serving behind unified APIs. At Airbnb, ML platforms enforce that every model has lineage, monitoring, and rollback paths before receiving production traffic. At Spotify, squad-level autonomy is balanced against platform-wide standards for experiment tracking and model registry access.</p>
<p data-ai-summary="true">The pattern is consistent: **platform team builds the control plane; product teams consume capabilities through APIs and dashboards.**</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: Control Plane vs Data Plane</p>
<p data-ai-summary="true">### What is the Control Plane?</p>
<p data-ai-summary="true">In distributed systems terminology, the **control plane** makes decisions: what should run, where, and under what policy. The **data plane** executes work: moving bits, running inference, processing batches.</p>
<p data-ai-summary="true">In Kubernetes, the <span data-ai-definition="API">API</span> server and controllers are the control plane; kubelet and container runtime are the data plane. In our MLOps platform:</p>
<p>| Layer | Responsibility | Examples |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;-|<br />
| Control plane | <span data-ai-definition="API">API</span> routing, state management, orchestration, governance | FastAPI, `mlops_state`, module builders |<br />
| Data plane | Training execution, inference, ETL, metric emission | MLflow runs, sklearn training, prediction routing |</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">Mixing control and data plane concerns creates systems that are impossible to scale or reason about. If your <span data-ai-definition="API">API</span> server also runs GPU training synchronously, a single long training job blocks health checks, metrics scraping, and governance approvals.</p>
<p data-ai-summary="true">Separating them allows:</p>
<p>&#8211; **Independent scaling** — scale inference replicas without scaling the control <span data-ai-definition="API">API</span><br />
&#8211; **Failure isolation** — a crashed training worker does not take down the registry<br />
&#8211; **Policy enforcement** — governance checks happen in the control plane before data plane work is scheduled</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Netflix&#8217;s ML platform separates experiment orchestration (control) from actual Spark/Metaflow job execution (data). The control plane tracks desired state; workers report actual state. This is the same reconciliation pattern Kubernetes uses — and we implement a simplified version in `mlops_state.py`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: Experiment Tracking and Model Registry</p>
<p data-ai-summary="true">### What is Experiment Tracking?</p>
<p data-ai-summary="true">Experiment tracking records **every training attempt** as a structured run: hyperparameters, metrics, artifacts, lineage, and environment. MLflow is the de facto open-source standard, used by thousands of teams and integrated into Databricks, AWS SageMaker, and Azure ML.</p>
<p data-ai-summary="true">A training run without tracking is technical debt. Six months later, nobody knows which data version, which hyperparameters, or which random seed produced the model currently serving 40% of production traffic.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">The model registry closes the loop between experimentation and deployment. A registered model has:</p>
<p>&#8211; Version history<br />
&#8211; Stage transitions (Staging → Production → Archived)<br />
&#8211; Lineage to the exact MLflow run that produced it</p>
<p data-ai-summary="true">This is the ML equivalent of **immutable container image tags** in CI/CD — you never deploy &#8220;latest&#8221;; you deploy a specific, auditable version.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Spotify teams use experiment tracking to enforce that production models must originate from approved runs with minimum metric thresholds. The registry becomes the **source of truth** for what is allowed to receive traffic — mirroring how container registries gate what images can deploy to production clusters.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: Data Pipeline Integrity</p>
<p data-ai-summary="true">### What is Pipeline Integrity?</p>
<p data-ai-summary="true">ML data pipelines must guarantee **reproducibility** and **quality** before data touches a training job. This means:</p>
<p>&#8211; **Ingestion** with schema validation and record counts<br />
&#8211; **Validation** with statistical quality scores<br />
&#8211; **Versioning** so every training run references an immutable dataset snapshot</p>
<p data-ai-summary="true">Tools like DVC, Delta Lake, and feature stores exist because &#8220;we used the CSV from the shared drive&#8221; is not a lineage story that passes an audit.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">Garbage in, garbage out scales linearly with cluster size. A distributed training job on poisoned data wastes GPU hours, produces a confidently wrong model, and creates a monitoring alert three weeks later when business metrics collapse.</p>
<p data-ai-summary="true">Pipeline health metrics — records ingested, validation score, version count — are **leading indicators**. Model accuracy is a **lagging indicator**. Production teams monitor both.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Airbnb&#8217;s data quality frameworks block downstream training when validation scores drop below thresholds — the same &#8220;fail closed&#8221; philosophy used in security gateways. Pipeline health status (`healthy`, `degraded`, `failed`) drives automated decisions about whether training jobs may proceed.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: Model Serving and Traffic Management</p>
<p data-ai-summary="true">### What is Canary Deployment for Models?</p>
<p data-ai-summary="true">Deploying a new model version to 100% of traffic on day one is reckless. **Canary releases** route a small percentage of traffic to the new version while monitoring latency, error rate, and business metrics on the incumbent.</p>
<p data-ai-summary="true">Our serving module implements **A/B traffic splitting** between model versions (v1/v2), tracking P99 latency and prediction counts per version.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">A model can have excellent offline metrics and terrible online <span data-ai-definition="performance">performance</span> due to training-serving skew, feature pipeline differences, or adversarial input patterns not present in training data. Gradual rollout limits blast radius.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Uber&#8217;s serving infrastructure supports shadow mode (new model receives traffic but responses are discarded), canary mode (partial traffic), and full promotion — the same progression Kubernetes Ingress controllers use for application rollouts. The control plane records traffic percentages as **desired state**; the inference router enforces **actual routing**.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: Drift, Fairness, and Observability</p>
<p data-ai-summary="true">### What is Model Drift?</p>
<p data-ai-summary="true">**Data drift** occurs when production input distributions diverge from training data. **Concept drift** occurs when the relationship between features and labels changes. Both degrade model <span data-ai-definition="performance">performance</span> silently — accuracy dashboards look fine until they don&#8217;t.</p>
<p data-ai-summary="true">Drift detection uses statistical tests (KS test, PSI, population stability) on logged predictions and features. Our monitoring module tracks drift events, 24-hour accuracy, fairness disparity, and explainability request volume.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">A fraud model trained on pre-pandemic transaction patterns will fail when consumer behavior shifts. Without drift monitoring, the model keeps serving predictions with confidence scores that no longer mean anything.</p>
<p data-ai-summary="true">Fairness monitoring ensures model behavior does not systematically disadvantage protected groups — increasingly a **regulatory requirement**, not a nice-to-have.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Netflix monitors model <span data-ai-definition="performance">performance</span> across content categories and regions. Spotify tracks recommendation fairness across listener demographics. These are not academic exercises — they are production SLOs with paging policies attached.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: ML Security and Governance</p>
<p data-ai-summary="true">### What is ML Governance?</p>
<p data-ai-summary="true">ML governance is the **approval workflow** that must complete before a model reaches production: security validation, bias analysis, compliance scoring, and immutable audit trails.</p>
<p data-ai-summary="true">Our security module implements:</p>
<p>&#8211; **Input validation** — adversarial pattern detection on feature vectors<br />
&#8211; **Governance workflows** — submit → review → approve state machine<br />
&#8211; **Audit chain** — append-only event log with hash-linked integrity</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">A compromised model endpoint is an attack surface. Adversarial inputs can extract training data, cause misclassification, or trigger resource exhaustion. Governance ensures human accountability — someone approved this model, and we can prove who and when.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Financial services and healthcare organizations require audit trails for every model deployment. The EU AI Act and similar regulations are making governance workflows mandatory infrastructure, not optional process.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Core Concepts: FinOps for ML</p>
<p data-ai-summary="true">### What is ML Cost Optimization?</p>
<p data-ai-summary="true">GPU instances are expensive. Spot/preemptible instances are cheaper but interruptible. ML FinOps tracks:</p>
<p>&#8211; Daily spend by team and instance type<br />
&#8211; Spot utilization percentage<br />
&#8211; Savings from optimization recommendations<br />
&#8211; ROI on platform investments</p>
<p data-ai-summary="true">Training a model that costs $50,000 in GPU time to improve accuracy by 0.1% when the business impact is $500/month is a failure of **cost-aware engineering**.</p>
<p data-ai-summary="true">### Why This Matters</p>
<p data-ai-summary="true">Cloud bills for ML workloads routinely surprise organizations. Without cost visibility at the control plane level, teams optimize accuracy indefinitely while FinOps discovers the damage in quarterly reviews.</p>
<p data-ai-summary="true">### Real Production Context</p>
<p data-ai-summary="true">Companies running large-scale training on AWS, GCP, or Azure use spot instances, autoscaling, and right-sizing recommendations — the same patterns our cost module simulates with spot utilization tracking and ROI analytics.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## Architecture Deep Dive</p>
<p data-ai-summary="true">The platform follows a **modular monolith** architecture: one deployable unit with clear internal boundaries. This is the right starting point for platform teams — simpler than <span data-ai-definition="microservices">microservices</span>, more structured than a script folder.</p>
<p data-ai-summary="true">### Control Plane Layer</p>
<p data-ai-summary="true">The FastAPI application (`backend/app/main.py`) is the single entry point for all operations:</p>
<p>&#8211; **Route registration** — domain-based paths (`/<span data-ai-definition="API">API</span>/platform/*`, `/<span data-ai-definition="API">API</span>/pipeline/*`, etc.)<br />
&#8211; **Request validation** — Pydantic models enforce schema contracts<br />
&#8211; **Middleware** — Prometheus counters and histograms on every HTTP request<br />
&#8211; **WebSocket hub** — broadcasts training completions, drift alerts, cost ticks to connected dashboards<br />
&#8211; **Lifespan management** — seeds initial state and starts background tick loop on startup</p>
<p data-ai-summary="true">Module builders (`backend/app/modules/*.py`) translate domain operations into <span data-ai-definition="API">API</span> responses with consistent `{ title, metrics, events }` shapes for the dashboard.</p>
<p data-ai-summary="true">### Data Plane Layer</p>
<p data-ai-summary="true">Services in `backend/app/services/` execute actual work:</p>
<p>| Service | Responsibility |<br />
|&#8212;&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;&#8212;|<br />
| `ml_trainer` | sklearn training, MLflow run logging, metric computation |<br />
| `pipeline_engine` | ETL ingest, validation scoring, dataset versioning |<br />
| `training_orchestrator` | GPU job queue, priority scheduling, checkpoint tracking |<br />
| `inference_router` | Model version routing, A/B traffic split, prediction |<br />
| `drift_service` | Prediction logging, statistical drift detection, fairness |<br />
| `security_gateway` | Adversarial detection, governance workflows, audit chain |<br />
| `cost_engine` | Cost collection, spot optimization, ROI calculation |</p>
<p data-ai-summary="true">### Orchestration Layer</p>
<p data-ai-summary="true">`mlops_state.py` is the platform&#8217;s **etcd-equivalent** — centralized state with per-module metric syncers:</p>
<p>&#8211; `MLOpsState` dataclass holds all cross-module counters and event lists<br />
&#8211; `SYNCERS` map module IDs to metric projection functions<br />
&#8211; `record_event()` maintains bounded event buffers for dashboard feeds<br />
&#8211; `apply_demo_module()` / `apply_full_demo()` simulate lifecycle progression for demonstrations</p>
<p data-ai-summary="true">The background `_mlops_tick()` coroutine broadcasts state snapshots every 8 seconds — analogous to a Kubernetes controller resync interval.</p>
<p data-ai-summary="true">### Interaction Flow</p>
<p>1. Dashboard or CLI sends `POST /<span data-ai-definition="API">API</span>/platform/train`<br />
2. Control plane validates `TrainRequest` via Pydantic<br />
3. `platform.submit_training()` delegates to `ml_trainer.train_model()`<br />
4. MLflow creates experiment, logs params/metrics, registers artifact<br />
5. `mlops_state` counters increment; event recorded in `training_jobs`<br />
6. WebSocket broadcasts `training_complete` to all connected clients<br />
7. Dashboard metrics grid re-renders with updated counts</p>
<p data-ai-summary="true">### Failure Handling</p>
<p>&#8211; Invalid module IDs return `{&#8220;error&#8221;: &#8220;invalid module&#8221;, &#8220;valid&#8221;: [&#8230;]}` — fail with guidance, not silent 404<br />
&#8211; WebSocket clients that disconnect are pruned from `_ws_clients` on send failure<br />
&#8211; Docker health checks gate UI container startup until <span data-ai-definition="API">API</span> reports healthy<br />
&#8211; `stop.sh` frees ports 8080/3000 before mode switches prevent address-in-use failures</p>
<p data-ai-summary="true">## Control Flow</p>
<p data-ai-summary="true">The end-to-end ML lifecycle executes in this order:</p>
<p>1. **Bootstrap** — `bootstrap.sh` creates venv, installs dependencies, runs integration tests, optionally builds Docker images<br />
2. **Start** — `start.sh` (local) or `start.sh docker` brings up <span data-ai-definition="API">API</span> + dashboard<br />
3. **Data ingest** — `POST /<span data-ai-definition="API">API</span>/pipeline/ingest` generates validated records, updates pipeline health<br />
4. **Data validation** — `POST /<span data-ai-definition="API">API</span>/pipeline/validate` computes quality score; blocks downstream if degraded<br />
5. **Dataset versioning** — `POST /<span data-ai-definition="API">API</span>/pipeline/version` creates immutable snapshot reference<br />
6. **Model training** — `POST /<span data-ai-definition="API">API</span>/platform/train` runs real sklearn training with MLflow tracking<br />
7. **GPU job submission** — `POST /<span data-ai-definition="API">API</span>/training/jobs` queues distributed training with priority and GPU allocation<br />
8. **Governance approval** — `POST /<span data-ai-definition="API">API</span>/security/governance/submit` → `approve` before production promotion<br />
9. **Traffic split** — `POST /<span data-ai-definition="API">API</span>/serving/traffic-split` sets canary percentage<br />
10. **Inference** — `POST /<span data-ai-definition="API">API</span>/serving/predict` routes through A/B split, records latency<br />
11. **Monitoring** — `POST /<span data-ai-definition="API">API</span>/monitoring/log` + `POST /<span data-ai-definition="API">API</span>/monitoring/drift` detect distribution shift<br />
12. **Cost collection** — `POST /<span data-ai-definition="API">API</span>/cost/collect` aggregates spend; optimization endpoints recommend savings<br />
13. **Full demo** — `POST /<span data-ai-definition="API">API</span>/demo/run` exercises all modules in one coordinated pass</p>
<p data-ai-summary="true">## Data Flow</p>
<p data-ai-summary="true">### Input</p>
<p>&#8211; **HTTP JSON** — structured requests via REST (`TrainRequest`, `PredictRequest`, etc.)<br />
&#8211; **WebSocket** — persistent connection for real-time feed (`/ws`)<br />
&#8211; **Synthetic data** — `ml_trainer` generates reproducible datasets for training demonstrations<br />
&#8211; **Pipeline data** — ingested records stored in `runtime/pipeline_data`</p>
<p data-ai-summary="true">### Processing</p>
<p>&#8211; **Validation layer** — Pydantic models reject malformed requests before service invocation<br />
&#8211; **Training pipeline** — features → train/test split → model fit → metric computation → MLflow log<br />
&#8211; **Inference pipeline** — instances → version router → model selection by traffic % → predictions + latency<br />
&#8211; **Drift pipeline** — logged features → statistical test → severity classification → alert counter<br />
&#8211; **Governance pipeline** — workflow submission → state transitions → audit chain append</p>
<p data-ai-summary="true">### Transformations</p>
<p>&#8211; Raw feature vectors → sklearn predictions<br />
&#8211; Training metrics → MLflow run parameters and artifacts<br />
&#8211; Module operations → normalized metric dictionaries via `SYNCERS`<br />
&#8211; Events → bounded lists with timestamps and UUIDs for dashboard tables</p>
<p data-ai-summary="true">### Outputs</p>
<p>&#8211; **REST responses** — JSON with status, metrics, and domain-specific payloads<br />
&#8211; **WebSocket events** — `training_complete`, `drift_check`, `demo_complete`, `mlops_tick`<br />
&#8211; **Prometheus metrics** — `mlops_http_requests_total`, `mlops_http_request_duration_seconds`<br />
&#8211; **MLflow artifacts** — model binaries and run metadata in `runtime/mlruns`</p>
<p data-ai-summary="true">### State Storage</p>
<p>| Store | Contents | Persistence |<br />
|&#8212;&#8212;-|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8212;&#8212;-|<br />
| `MLOpsState` (in-memory) | Live counters, queues, event buffers | Resets on restart; seeded on startup |<br />
| `runtime/mlruns` | MLflow experiments, runs, model artifacts | Docker volume or local directory |<br />
| `runtime/pipeline_data` | Ingested dataset snapshots | Docker volume or local directory |<br />
| `runtime/logs` | <span data-ai-definition="API">API</span> and UI process logs | Ephemeral |</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## State Changes &#038; Reconciliation</p>
<p data-ai-summary="true">### Resource Lifecycle</p>
<p data-ai-summary="true">Every platform resource follows a defined lifecycle:</p>
<p>| Resource | States | Transitions |<br />
|&#8212;&#8212;&#8212;-|&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8212;-|<br />
| Training job | queued → training → completed/failed | Orchestrator priority queue |<br />
| MLflow run | RUNNING → FINISHED/FAILED | Training service |<br />
| Governance workflow | draft → submitted → performance_approved → deployed | Security gateway |<br />
| Pipeline | healthy → degraded → failed | Validation score thresholds |<br />
| Model version | staging → canary → production | Traffic split <span data-ai-definition="API">API</span> |</p>
<p data-ai-summary="true">### Desired vs Actual State</p>
<p>&#8211; **Desired**: traffic split set to 80/20 via `POST /<span data-ai-definition="API">API</span>/serving/traffic-split`<br />
&#8211; **Actual**: `inference_router` routes predictions according to stored percentages<br />
&#8211; **Desired**: GPU job with priority 9 submitted<br />
&#8211; **Actual**: orchestrator queue position and allocation reflected in `training_queue`</p>
<p data-ai-summary="true">In a full Kubernetes deployment, a ModelServe Custom Resource would declare desired replicas and traffic weights; a controller would reconcile actual pod state. Our in-process state model demonstrates the same **declarative intent → observed state** pattern at application scale.</p>
<p data-ai-summary="true">### Reconciliation Loops</p>
<p data-ai-summary="true">The `_mlops_tick()` background task runs every 8 seconds:</p>
<p>1. Read current cost and drift state from `MLOpsState`<br />
2. Broadcast snapshot to all WebSocket clients<br />
3. Dashboard updates activity stream without polling</p>
<p data-ai-summary="true">This is a simplified **watch/resync** loop — the same primitive that powers Kubernetes informers and Prometheus scrape intervals.</p>
<p data-ai-summary="true">### Drift Detection</p>
<p data-ai-summary="true">When `check_drift()` computes a p-value below threshold:</p>
<p>1. `drift_alerts` counter increments<br />
2. Event appended to `drift_events` list<br />
3. WebSocket broadcasts `drift_check` event<br />
4. Dashboard metric card highlights changed value</p>
<p data-ai-summary="true">Production systems attach **automated responses**: retrain triggers, traffic rollback, or paging on-call ML engineers.</p>
<p data-ai-summary="true">### Self-Healing</p>
<p>&#8211; Docker Compose `depends_on: condition: service_healthy` ensures UI starts only after <span data-ai-definition="API">API</span> health check passes<br />
&#8211; `stop.sh` kills orphaned processes on ports 8080/3000 before restart<br />
&#8211; WebSocket dead client pruning prevents memory leaks from stale connections</p>
<p data-ai-summary="true">### Failure Recovery</p>
<p>&#8211; `cleanup.sh` stops all services, prunes Docker resources, removes build artifacts — full environment reset<br />
&#8211; `bootstrap.sh` recreates venv and validates system integrity through integration tests<br />
&#8211; MLflow file-based tracking survives <span data-ai-definition="API">API</span> restarts when `runtime/mlruns` is on a persistent volume</p>
<p data-ai-summary="true">## Production Considerations</p>
<p data-ai-summary="true">### Multi-Tenancy</p>
<p data-ai-summary="true">Production platforms isolate teams via namespaces, <span data-ai-definition="API">API</span> keys, and resource quotas. Each team sees only their experiments, models, and cost records. Our platform uses a single `MLOpsState` — production would shard state by `team_id`.</p>
<p data-ai-summary="true">### Security</p>
<p>&#8211; Enable authentication middleware on all `/<span data-ai-definition="API">API</span>/*` routes<br />
&#8211; Restrict CORS to known dashboard origins<br />
&#8211; Encrypt MLflow artifact storage (S3 SSE, GCS CMEK)<br />
&#8211; Run containers as non-root users<br />
&#8211; Scan model artifacts for serialized pickle vulnerabilities</p>
<p data-ai-summary="true">### RBAC</p>
<p data-ai-summary="true">Map roles to <span data-ai-definition="API">API</span> capabilities:</p>
<p>| Role | Permissions |<br />
|&#8212;&#8212;|&#8212;&#8212;&#8212;&#8212;|<br />
| `ml-engineer` | train, submit jobs, view experiments |<br />
| `ml-lead` | approve governance workflows, promote models |<br />
| `platform-admin` | traffic split, cost policies, system config |<br />
| `auditor` | read-only access to audit chain and compliance scores |</p>
<p data-ai-summary="true">### <span data-ai-definition="scalability">scalability</span></p>
<p>&#8211; Horizontal <span data-ai-definition="API">API</span> scaling behind a load balancer (stateless control plane)<br />
&#8211; External state store (Redis/PostgreSQL) replacing in-memory `MLOpsState`<br />
&#8211; Async job queue (Celery, SQS, Kafka) for training and pipeline work<br />
&#8211; CDN for static dashboard assets</p>
<p data-ai-summary="true">### Observability</p>
<p>&#8211; Prometheus metrics (implemented)<br />
&#8211; Structured JSON logging with trace IDs<br />
&#8211; Grafana dashboards for module-level SLOs<br />
&#8211; Alertmanager rules on drift alerts, error rates, cost anomalies</p>
<p data-ai-summary="true">### Failure Handling</p>
<p>&#8211; Circuit breakers on downstream service calls<br />
&#8211; Retry with exponential backoff on training failures<br />
&#8211; Dead letter queues for failed pipeline stages<br />
&#8211; Graceful degradation: dashboard shows stale metrics with timestamp when <span data-ai-definition="API">API</span> is degraded</p>
<p data-ai-summary="true">### Disaster Recovery</p>
<p>&#8211; MLflow tracking server backed by S3 with cross-region replication<br />
&#8211; <span data-ai-definition="database">database</span> backups for governance and audit data<br />
&#8211; Documented runbook for full platform restore<br />
&#8211; Regular restore drills (not just backup verification)</p>
<p data-ai-summary="true">### Cost Optimization</p>
<p>&#8211; Spot/preemptible instances for fault-tolerant training<br />
&#8211; Autoscaling inference based on request rate<br />
&#8211; Model quantization to reduce serving infrastructure<br />
&#8211; Cost allocation tags per team and experiment</p>
<p data-ai-summary="true">### <span data-ai-definition="performance">performance</span> Bottlenecks</p>
<p>| Bottleneck | Mitigation |<br />
|&#8212;&#8212;&#8212;&#8211;|&#8212;&#8212;&#8212;&#8211;|<br />
| Synchronous training in <span data-ai-definition="API">API</span> process | Offload to job queue + worker pool |<br />
| In-memory state | External store with connection pooling |<br />
| WebSocket fan-out | Redis pub/sub for multi-instance broadcast |<br />
| MLflow file store at scale | Remote tracking server (PostgreSQL + S3) |</p>
<p data-ai-summary="true">## Advanced Patterns</p>
<p data-ai-summary="true">### Controller Pattern</p>
<p data-ai-summary="true">Implement a Kubernetes Operator that watches `ModelDeployment` CRDs and reconciles:</p>
<p>&#8211; Desired model version and traffic percentage<br />
&#8211; Actual Deployment/Service/Ingress state<br />
&#8211; Status conditions reported back to the CRD</p>
<p data-ai-summary="true">Our `mlops_state` SYNCERS are a microcosm of this pattern.</p>
<p data-ai-summary="true">### Async Workflows</p>
<p data-ai-summary="true">Replace synchronous training calls with:</p>
<p>1. <span data-ai-definition="API">API</span> accepts job, returns `job_id` immediately<br />
2. Worker picks up job from queue<br />
3. Status updates via WebSocket or polling<br />
4. Completion triggers governance workflow automatically</p>
<p data-ai-summary="true">Argo Workflows and Kubeflow Pipelines implement this at cluster scale.</p>
<p data-ai-summary="true">### Event-Driven Architecture</p>
<p data-ai-summary="true">Publish domain events to Kafka:</p>
<p>&#8211; `model.training.completed`<br />
&#8211; `pipeline.validation.failed`<br />
&#8211; `monitoring.drift.detected`</p>
<p data-ai-summary="true">Downstream consumers trigger retraining, alerting, or cost analysis without tight coupling.</p>
<p data-ai-summary="true">### Multi-Cluster Strategies</p>
<p>&#8211; Control plane in a management cluster<br />
&#8211; Training workloads on GPU clusters<br />
&#8211; Inference on regional edge clusters<br />
&#8211; GitOps (ArgoCD) syncs model deployment manifests across clusters</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[systemdesign02]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[## 1. Introduction A scheduler can fire on time, but the work still has to land somewhere. In production systems, that landing zone is almost always a message broker —... Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3><p data-ai-summary="true">## 1. Introduction</p>
<p data-ai-summary="true">A scheduler can fire on time, but the work still has to land somewhere. In production systems, that landing zone is almost always a message broker — RabbitMQ, Kafka, or something similar — and a fleet of consumer processes that pull execution requests off a queue and run them. When consumers are too slow, queues back up and SLA windows slip. When they acknowledge messages too early, work is lost on crash. When they never acknowledge, the same message replays forever. When you add instances without understanding prefetch and fair dispatch, one consumer hogs the backlog while others sit idle.</p>
<p data-ai-summary="true">The `week7_integrated_project` (under `week_7_part1_taskscheduler_integrated_project/`) is a single Spring Boot application that walks through three successive layers of consumer engineering across Days 33–35. Day 33 establishes the baseline: consume `Day33TaskExecutionRequest` messages from RabbitMQ, process them asynchronously, persist lifecycle state in H2, and publish status updates. Day 34 shifts the reliability conversation to Kafka — manual offset acknowledgment, typed failure outcomes, exponential-backoff retries, and a dead-letter topic. Day 35 closes the arc with horizontal scaling: multiple consumer threads (and optionally multiple JVM instances) competing on a shared `task.queue`, with Redis tracking per-instance throughput.</p>
<p data-ai-summary="true">This article is written for backend engineers who already understand REST APIs and basic messaging and want to see how consumer correctness and scale are implemented in real Spring Boot code — not as abstract patterns, but as named classes you can step through in a debugger.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 2. From Fundamentals to a Unified System</p>
<p data-ai-summary="true">### Day 33 — Consuming Task Execution Requests from a Message Queue</p>
<p data-ai-summary="true">Day 33 answers the first question every consumer team faces: *how do I pull work off a broker and actually execute it without blocking the listener thread?* The entry point is `Day33TaskConsumer`, annotated with `@RabbitListener` on the `task-execution-queue` (configured in `Day33RabbitConfig` with 3–10 concurrent consumers via `day33RabbitListenerContainerFactory`). When a `Day33TaskExecutionRequest` arrives, the listener delegates to `Day33TaskProcessor.processTask()`, which is marked `@Async` and runs on virtual threads (`spring.threads.virtual.enabled: true` in `application.yml`).</p>
<p data-ai-summary="true">State is not ephemeral. `Day33TaskProcessor` loads or creates a `Day33Task` entity and persists transitions through `Day33TaskRepository` — `QUEUED` → `PROCESSING` → `COMPLETED` or `FAILED`. After each transition, `Day33StatusPublisher` emits a JSON status update to the `task-status-exchange` topic exchange. Publishers enter through `Day33TaskPublisherService`, which is also exposed via `Day33ApiController` at `POST /<span data-ai-definition="API">API</span>/day33/tasks`.</p>
<p data-ai-summary="true">In production, the failure mode this solves is the &#8220;fire-and-forget consumer&#8221; — a service that logs a message and returns without tracking whether execution actually succeeded. When the process dies mid-task, operations has no record of what was in flight. JPA-backed `Day33Task` rows give you an auditable execution ledger tied to a `workerId` from `InstanceIdProvider`.</p>
<p data-ai-summary="true">### Day 34 — Acknowledging Messages and Handling Failures in Consumers</p>
<p data-ai-summary="true">Day 34 tackles the harder problem: *when is it safe to tell the broker I&#8217;m done?* Kafka auto-commit is convenient and dangerous — a crash after processing but before commit replays the message; a commit before processing loses it. `Day34KafkaConfig` disables auto-commit (`ENABLE_AUTO_COMMIT_CONFIG: false`) and sets `AckMode.MANUAL_IMMEDIATE` on `day34KafkaListenerContainerFactory`. `Day34KafkaTaskConsumerService` listens on `task-execution` and `task-retry` topics, deserializes JSON into `Day34Task`, and routes to pluggable processors (`Day34EmailTaskProcessor`, `Day34ReportTaskProcessor`) via the `Day34TaskProcessor` interface.</p>
<p data-ai-summary="true">Each processor returns a `Day34ProcessingResult` with status `SUCCESS`, `RETRYABLE_FAILURE`, or `PERMANENT_FAILURE`. `Day34MessageAcknowledgmentHandler` is the decision engine: success calls `acknowledgment.acknowledge()` immediately; retryable failures with retries remaining invoke `Day34RetryHandler.scheduleRetry()` (exponential backoff capped at 60 seconds) and then ack the original offset; exhausted retries or permanent failures route through `Day34DeadLetterHandler` to the `task-dead-letter` topic before acking. `Day34ConsumerMetrics` records counters and processing timers for the dashboard at `/day34/dashboard`.</p>
<p data-ai-summary="true">The production failure mode here is the **infinite redelivery loop** — a poison message that crashes the consumer on every attempt, wedging the partition. Explicit acknowledgment only after a defined outcome, plus DLQ routing, breaks that loop while preserving the message for inspection.</p>
<p data-ai-summary="true">### Day 35 — Scaling Task Consumers Horizontally</p>
<p data-ai-summary="true">Day 35 asks: *given a growing backlog, how do I add capacity without breaking fairness or losing observability?* `Day35TaskProducerService` publishes `Day35ScalingTask` messages to the durable `task.queue` (declared in `Day35RabbitConfig`). `Day35ScalingConsumerService` competes for messages with a listener factory tuned for scaling: `prefetch-count: 1` and `max-concurrent-consumers: 3` prevent one thread from hoarding messages. Each consumer instance identifies itself via `week7.day35.consumer-instance-name` (overridable with `CONSUMER_ID`), and writes per-instance counters to Redis keys like `stats:{consumerId}:processed`.</p>
<p data-ai-summary="true">`Day35MetricsService` aggregates Redis stats and queries RabbitMQ queue depth through `day35RabbitAdmin.getQueueInfo()` for the dashboard at `/day35/dashboard`. You can simulate true horizontal scaling by launching a second JVM on port 8081 with `CONSUMER_ID=consumer-2`.</p>
<p data-ai-summary="true">The production failure mode is **false scaling** — running three processes on one machine but configuring prefetch so high that only one ever receives work, or scaling out without per-instance metrics so you cannot tell which consumer is lagging.</p>
<p data-ai-summary="true">Together, the three days form a progression inside one JVM. Day 33 proves you can consume and track work. Day 34 proves you can consume *correctly* under failure. Day 35 proves you can consume *at scale* with observable distribution. They share `Week7ModuleProperties` for feature toggles, `JacksonConfig` for JSON serialization, and `Week7IntegratedApplication` as the single entry point — but intentionally use separate brokers queues/topics so each lesson&#8217;s boundaries remain visible in package names (`day33`, `day34`, `day35`).</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 3. Architecture Overview</p>
<p data-ai-summary="true">The system has five conceptual layers, all hosted inside one Spring Boot process unless you explicitly launch additional instances for Day 35 scaling experiments.</p>
<p data-ai-summary="true">**Entry and routing layer.** `Week7IntegratedApplication` bootstraps the JVM with `@EnableAsync`, `@EnableKafka`, and `@EnableRetry`. `RootController` serves the landing page at `/`, and each day&#8217;s `Day33DashboardController`, `Day34DashboardController`, and `Day35DashboardController` expose module-specific dashboards and REST APIs.</p>
<p data-ai-summary="true">**Configuration layer.** `Week7ModuleProperties` centralizes per-day settings (`week7.day33.*`, `week7.day34.*`, `week7.day35.*`). Broker-specific beans live in `Day33RabbitConfig`, `Day34KafkaConfig`, `Day35RabbitConfig`, and `Day35RedisConfig`. `SharedRabbitConfig` provides the shared `Jackson2JsonMessageConverter`.</p>
<p data-ai-summary="true">**Consumer processing layer.** Day 33: `Day33TaskConsumer` → `Day33TaskProcessor` → type-specific execution methods. Day 34: `Day34KafkaTaskConsumerService` → `Day34TaskProcessor` implementations → `Day34MessageAcknowledgmentHandler`. Day 35: `Day35ScalingConsumerService` with simulated work via `Thread.sleep(task.getComplexityMs())`.</p>
<p data-ai-summary="true">**Persistence and observability layer.** Day 33 uses H2 via JPA (`Day33Task` / `Day33TaskRepository`). Day 34 uses in-memory `Day34TaskHistoryService` plus Micrometer via `Day34ConsumerMetrics`. Day 35 uses Redis (`Day35MetricsService`, `Day35RedisConfig`).</p>
<p data-ai-summary="true">**External broker boundary.** RabbitMQ (Day 33 execution queue + status exchange; Day 35 work queue), Kafka (Day 34 execution/retry/DLQ topics), Redis (Day 35 metrics), and H2 (Day 33 task ledger) sit outside the JVM and are started via `docker/docker-compose.yml`.</p>
<p data-ai-summary="true">Dependency flow is inward: controllers and listeners call services; services call repositories, Redis templates, or broker templates; all broker connections flow through Spring Boot auto-configuration pointed at `localhost` by default.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 4. Data / Control Flow</p>
<p data-ai-summary="true">The primary unit of work across this project is a **task execution message** — a broker-delivered payload describing work to perform. The richest control-flow path is Day 34&#8217;s Kafka pipeline, because it encodes the acknowledgment decision that Days 33 and 35 simplify or defer.</p>
<p data-ai-summary="true">**Entry.** A task is published to the `task-execution` topic by `Day34TaskProducerService` (triggered via `POST /day34/<span data-ai-definition="API">API</span>/tasks/email` or `/day34/<span data-ai-definition="API">API</span>/tasks/batch`). The message is a JSON-serialized `Day34Task` with fields `id`, `type`, `payload`, `retryCount`, and `maxRetries`.</p>
<p data-ai-summary="true">**Consumption.** `Day34KafkaTaskConsumerService.consumeTask()` receives the raw string, deserializes it, and selects a `Day34TaskProcessor` where `canProcess(task.getType())` returns true. The processor executes simulated I/O (email send, report generation) and returns a `Day34ProcessingResult`.</p>
<p data-ai-summary="true">**Decision.** `Day34MessageAcknowledgmentHandler.handleProcessingResult()` branches on `result.getStatus()`:</p>
<p>&#8211; **SUCCESS** → `acknowledgment.acknowledge()`, increment success metric, record processing time.<br />
&#8211; **RETRYABLE_FAILURE** with `task.hasRetriesLeft()` → `Day34RetryHandler.scheduleRetry()` publishes to `task-retry` after `2^retryCount` seconds (plus jitter), then acks the current offset.<br />
&#8211; **RETRYABLE_FAILURE** with exhausted retries, or **PERMANENT_FAILURE** → `Day34DeadLetterHandler.sendToDeadLetterQueue()` publishes to `task-dead-letter`, then acks.</p>
<p data-ai-summary="true">**Cyclical outcome.** Retried tasks re-enter through the same listener (both `task-execution` and `task-retry` topics are subscribed). **Terminal outcomes** are a successfully acked offset (work done) or a record on the dead-letter topic (work abandoned with audit trail).</p>
<p data-ai-summary="true">Day 33 follows a simpler path: `Day33TaskPublisherService` → `task-execution-queue` → `Day33TaskConsumer` → async `Day33TaskProcessor` → H2 status update + `Day33StatusPublisher`. Day 35 follows: `Day35TaskProducerService` → `task.queue` → `Day35ScalingConsumerService` → Redis status keys.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 5. State Management</p>
<p data-ai-summary="true">The most completely modeled execution state in the integrated project is the `Day33Task` JPA entity (`@Table(name = &#8220;day33_tasks&#8221;)`). It tracks the lifecycle of a single execution request from the broker through to completion.</p>
<p data-ai-summary="true">**Fields that constitute state:**</p>
<p>| Field | Role |<br />
|&#8212;&#8212;-|&#8212;&#8212;|<br />
| `status` (`TaskStatus` enum) | Current lifecycle phase |<br />
| `workerId` | Which consumer instance claimed the task |<br />
| `startTime` / `completedTime` | Processing window |<br />
| `errorMessage` | Populated on `FAILED` |<br />
| `retryCount` | Incremented on failure |<br />
| `taskType` / `payload` | Immutable work descriptor |</p>
<p data-ai-summary="true">**Lifecycle transitions:**</p>
<p>1. `Day33TaskPublisherService.publish()` creates a row with `status = QUEUED` before the message hits RabbitMQ.<br />
2. `Day33TaskProcessor.processTask()` sets `PROCESSING`, assigns `workerId`, records `startTime`.<br />
3. On success → `COMPLETED` + `completedTime`.<br />
4. On exception → `FAILED`, `errorMessage` set, `retryCount` incremented.</p>
<p data-ai-summary="true">Day 35 mirrors a lighter version in Redis: `task:{id}:status` cycles through `QUEUED` (set by producer), `PROCESSING`, `COMPLETED` or `FAILED`. Day 34 does not persist task rows; instead `Day34ProcessingResult.Status` represents the *message handling outcome* for a single delivery attempt, while `Day34Task.retryCount` travels with the message across retry publications.</p>
<p data-ai-summary="true">The state machine diagram below focuses on `Day33Task.TaskStatus` because it is the persisted, queryable execution ledger exposed on `/day33/dashboard` and `/<span data-ai-definition="API">API</span>/day33/stats`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 6. Key Engineering Insights</p>
<p data-ai-summary="true">**Acknowledge after outcome, not after receipt.** Day 34&#8217;s `MANUAL_IMMEDIATE` ack mode means the offset is not committed until `Day34MessageAcknowledgmentHandler` explicitly calls `acknowledge()` — and that call happens only after the handler knows whether to retry, dead-letter, or complete. This is the difference between at-least-once processing with controlled redelivery and silent message loss.</p>
<p data-ai-summary="true">**Separate concurrency from correctness.** Day 33 scales listener threads (`min-concurrent-consumers: 3`, `max-concurrent-consumers: 10`) while Day 34 scales Kafka container concurrency (`consumer-concurrency: 3`). Both increase throughput, but neither substitutes for acknowledgment discipline. You can run ten threads and still lose messages if you ack too early.</p>
<p data-ai-summary="true">**Prefetch governs fair horizontal distribution.** Day 35 sets `prefetch-count: 1` in `Day35RabbitConfig`&#8217;s listener factory so no single consumer thread buffers a large batch of unacked messages. Without this, adding instances does not reliably increase throughput — one greedy consumer starves the rest.</p>
<p data-ai-summary="true">**Persist execution state outside the message.** Day 33&#8217;s `Day33Task` entity survives broker redelivery and gives the dashboard queryable history. Day 35&#8217;s Redis keys serve a similar observability role at higher volume. Relying solely on in-flight message state makes crash recovery a guessing game.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 7. Success Metrics</p>
<p>### Modularity<br />
&#8211; [ ] Each day&#8217;s code lives in its own package (`day33`, `day34`, `day35`) with no cross-day service imports.<br />
&#8211; [ ] `Week7ModuleProperties` toggles days independently via `week7.day33.enabled`, `week7.day34.enabled`, `week7.day35.enabled`.<br />
&#8211; [ ] Broker configs are isolated in `Day33RabbitConfig`, `Day34KafkaConfig`, `Day35RabbitConfig` with `@Qualifier`-disambiguated templates.</p>
<p>### Readability<br />
&#8211; [ ] Listener entry points are one class per day: `Day33TaskConsumer`, `Day34KafkaTaskConsumerService`, `Day35ScalingConsumerService`.<br />
&#8211; [ ] Day 34 failure routing is centralized in `Day34MessageAcknowledgmentHandler`, not scattered across processors.<br />
&#8211; [ ] Dashboard routes follow a consistent pattern: `/day33/dashboard`, `/day34/dashboard`, `/day35/dashboard`.</p>
<p>### Correctness<br />
&#8211; [ ] Day 34 disables Kafka auto-commit and uses manual ack in `Day34KafkaConfig`.<br />
&#8211; [ ] Retryable failures increment `retryCount` on the `Day34Task` message before republication to `task-retry`.<br />
&#8211; [ ] Day 33 `Day33Task` transitions are persisted before status is published via `Day33StatusPublisher`.</p>
<p>### Extensibility<br />
&#8211; [ ] New Day 34 task types require only a new `Day34TaskProcessor` implementation with `canProcess()` — no listener changes.<br />
&#8211; [ ] Day 35 consumer identity is externalized to `CONSUMER_ID` / `week7.day35.consumer-instance-name` for multi-JVM scaling.<br />
&#8211; [ ] Day 33 task types extend via the `switch` in `Day33TaskProcessor.executeTaskByType()` without touching `Day33TaskConsumer`.</p>
<p data-ai-summary="true">&#8212;</p>
<p data-ai-summary="true">## 8. Conclusion</p>
<p data-ai-summary="true">The underlying principle Week 7 teaches is **consumer responsibility**: a task consumer is not merely a message handler — it is a stateful worker that must decide when work is truly complete, how to behave under failure, and how to scale without hiding bottlenecks. The integrated project makes that responsibility concrete across RabbitMQ and Kafka, with JPA and Redis providing the observability layer operations teams actually need.</p>
<p data-ai-summary="true">From here, a natural next step is consumer group rebalancing under partition migration (Kafka), dead-letter replay tooling, and idempotent processing guards — ensuring that the at-least-once delivery guarantees you have carefully built do not corrupt downstream state when messages arrive more than once.</p>
</div>]]></content:encoded>
                                </item>
                <item>
            <title> - Hands-On Tutorial</title>
            <link></link>
            <comments>#respond</comments>
            <pubDate></pubDate>
            <dc:creator><![CDATA[systemdr5]]></dc:creator>
                        <guid isPermaLink="false"></guid>
            <description><![CDATA[ Hands-On System Design tutorial with practical examples and real-world applications.]]></description>
            <content:encoded><![CDATA[<div class="lesson-rss-content"><h3>Hands-On System Design Tutorial</h3></div>]]></content:encoded>
                                </item>
                
    </channel>
    </rss>
    