The Quantitative Reality of 100M RPS. (Day 1)
Welcome, engineers. Today, we're not just talking about big numbers; we're peeling back the layers to understand what "100 million requests per second" actually means for a system and the engineers building it. This isn't theoretical fluff; it's the bedrock you need to stand on before writing a single line of production code for ultra-high-scale systems.
Many engineers, even seasoned ones, often think of 100M RPS as a distant, abstract goal. But when you break it down, it transforms from a daunting number into a series of concrete, often brutal, constraints. My goal today is to give you that mental model, grounded in Go, that will inform every design decision you make in this series.
Agenda for Day 1:
Deconstructing 100M RPS: What does this colossal number imply for latency, CPU, and memory?
The Hidden Costs: Why seemingly trivial operations become bottlenecks.
Go's Role: How Go helps, and where its abstractions can still hurt.
Hands-On: Quantifying a Micro-Operation: Building a simple Go tool to measure and extrapolate.
Assignment: Deepening your understanding of the numbers.
Core Concepts: The Unforgiving Math of Scale
At 100M RPS, every nanosecond counts. Let's start with the basics:
Latency Budget: If your system needs to respond within, say, 100 milliseconds (ms), and you're handling 100 million requests per second, that means you have an average of 10 nanoseconds (ns) per request for the entire system to do its work, considering the total processing capacity. Think about that: 10 ns. A typical CPU clock cycle is in the order of 0.3-0.5 ns. This means your entire request processing path, from network card to application logic and back, has to be incredibly efficient.
CPU Cycles: Modern CPUs operate in gigahertz (GHz), meaning billions of cycles per second. If a CPU runs at 3 GHz, it performs 3 10^9 cycles/second. At 100M RPS, each request gets an average of
(3 * 10^9 cycles/sec) / (100 * 10^6 req/sec) = 30 CPU cyclesper core per second. This is a gross oversimplification, as requests are distributed, but it highlights how few cycles a single* request can consume on average if it were processed sequentially on one core. In reality, it's about aggregate throughput across many cores and machines, but the point is: you don't have many cycles to spare for any single operation.Memory Access: A cache miss (fetching data from main memory instead of CPU cache) can cost hundreds of CPU cycles. A single page fault (accessing data not in RAM, hitting disk) can cost millions of cycles. At 100M RPS, even a 0.01% cache miss rate can translate into tens of thousands of requests experiencing significant delays, leading to cascading failures.
Insight: The "cost" of an operation isn't just its average latency; it's its worst-case latency and its resource consumption pattern. A function that takes 100ns on average but occasionally triggers a 1ms garbage collection pause or a cache miss is a non-starter at 100M RPS. The system will drown in tail latencies.
Architecture & Control Flow: From Micro to Macro
Today, our "system" is a single Go process. The goal is to understand how the performance of this micro-component dictates the architecture of the macro-system.
Our rps-calculator program will simulate a minimal HTTP handler. We'll measure its performance locally and then extrapolate what it would take to reach 100M RPS. This gives us a tangible reference point.
Control Flow (within our simple tool):
Initialization: Set up a lightweight HTTP server in-memory using Go's
httptestpackage.Handler Definition: Create a basic HTTP handler function that does minimal work (e.g.,
w.Write([]byte("OK"))). We'll also add a configurable artificial delay to simulate real-world processing.Benchmarking Loop: A client repeatedly calls this handler for a set number of iterations.
Measurement: Record the total time taken for all iterations.
Calculation: Determine the requests per second (RPS) achieved by this single, isolated handler.
Extrapolation: Calculate how many such instances would be needed to hit 100M RPS.
Reporting: Display these crucial numbers.
Data Flow:Client (internal to Go app) -> HTTP Request -> HTTP Handler -> HTTP Response -> Client -> Measurement Aggregator.
State Changes:
The "system" (our rps-calculator) transitions from Idle to Measuring, then to Calculating, and finally to Reporting. The core insight here is that the state of a single request must be incredibly lean and transient to not overwhelm the system at scale.
Size Real-Time Production System Application
Imagine our simple "OK" handler is the core of a critical internal service – say, a health check endpoint, or a metadata lookup. If that minimal operation consumes too many CPU cycles or causes even minor memory allocations, replicating it across thousands of machines to hit 100M RPS becomes prohibitively expensive or simply impossible due to resource contention.
This exercise reveals the minimum resource footprint for a single request. Any real-world handler will do more than just return "OK" – it will parse JSON, query a database, call other services, perform business logic. Each of these adds latency and resource consumption. Understanding the baseline helps us budget for these additions.
Insight: Optimizing for 100M RPS often means pushing as much work as possible out of the critical path of the request handler itself. Batching, asynchronous processing, pre-computation, and aggressive caching become not just optimizations, but fundamental architectural requirements.
Assignment: The Latency Multiplier
Your task is to modify the provided rps-calculator to explore the impact of a small, seemingly insignificant operation.
Steps:
Introduce a
time.Sleep: In thesimpleHandlerfunction, add atime.Sleepcall. Start withtime.Sleep(10 * time.Microsecond)(10 microseconds).Re-run the Calculator: Execute
start.shand observe the new "Observed RPS" and "Instances for 100M RPS".Experiment: Try increasing the sleep duration to
50 * time.Microsecond, then100 * time.Microsecond.Analyze: Write down your observations. How drastically does a small, seemingly innocent delay multiply the number of instances required? What does this tell you about the "cost" of abstraction or even a simple database query taking a few microseconds?
This assignment is crucial. It makes the abstract numbers concrete and demonstrates the brutal reality of tail latencies and resource budgets.