Day 2: The Frame
Parsing WebSocket Binary Frames from Scratch
The Wrong Way: Taking the Easy Road
When you need WebSocket support, the tempting choice is grabbing a framework. Add @ServerEndpoint or Spring's WebSocketHandler, write five lines of code, and you're done. Ship it.
But here's what's hiding under the hood: the framework creates a new TextMessage object for every frame, wraps payloads in immutable containers, and validates UTF-8 even when you're sending binary data. At 10,000 connections sending pings every 30 seconds, that's 333 frames per second. Easy. At 1 million connections? 33,333 frames per second. Now the abstraction becomes your enemy.
The real killer isn't throughput—it's jitter. When the Young Gen heap fills with short-lived frame objects, the garbage collector pauses your entire program. For 200 milliseconds, your gateway freezes. Mobile clients on sketchy WiFi think they've disconnected. They reconnect. Now you have thousands trying to connect at once.
Core principle: If you don't understand the bytes, you can't optimize the crashes.
Death by a Thousand Small Objects
Let's trace a simple text message through a typical framework:
That's 4 objects per frame before your code runs. At 50,000 frames/sec, you're creating 200,000 objects/sec. Even if each is just 64 bytes, you're allocating 12.8 MB/sec for protocol overhead. The JVM's allocation rate spikes. When the heap fills, everything stops.
WebSocket is elegant—14 bytes of header, then raw payload. But frameworks hide this. They parse, validate, and box everything "for safety." At scale, safety kills performance.
Our Approach: Zero-Copy Frame Parsing
[INSERT: frame_architecture.svg]
We're building four components:
1. Direct Buffer Recycling
Keep a pool of off-heap ByteBuffer instances. Each connection borrows a buffer, parses frames, then returns it. No allocations.
2. State Machine Parser
WebSocket frames aren't guaranteed to arrive complete in one TCP read. You might get 6 bytes of a 14-byte header. The parser remembers where it left off:
READING_HEADER: Accumulate bytes until we have the full header (2-14 bytes)READING_MASK: If the MASK bit is set, read 4 masking bytesREADING_PAYLOAD: Read the exact number of payload bytesCOMPLETE: Unmask if needed, build frame, reset for next one
3. Virtual Threads for Blocking I/O
Java 21's Virtual Threads let us write simple blocking code without sacrificing scalability. Each connection gets its own thread that blocks on SocketChannel.read(). The OS handles scheduling. This eliminates complex event loops—the parser handles frame state, not threading.
4. Bitwise Protocol Parsing
The first byte packs 4 flags:
We extract with bit shifts, not objects:
Zero allocations. The JVM compiles this to native CPU instructions.
Understanding the WebSocket Frame
[INSERT: parserstateflow.svg]
Every WebSocket message is wrapped in a frame. Here's the exact byte structure:
Payload Length Encoding
The protocol uses variable-length encoding to save space:
If the 7-bit length is 0-125: that's the actual size
If 126: next 2 bytes contain the real length (16-bit)
If 127: next 8 bytes contain the real length (64-bit)
This means headers range from 2 to 14 bytes.
The Masking Dance
Clients MUST mask their payloads (servers MUST NOT). The mask is 4 random bytes. Each payload byte gets XORed:
This prevents cache poisoning attacks through transparent proxies. We unmask on the server.
[INSERT: parserstatemachine.svg]
The ByteBuffer Trap
Here's a mistake that crashes production systems:
After read(), the buffer's position is at the last byte read. You must flip:
Forgetting flip() throws BufferUnderflowException or reads garbage. This single mistake causes more incidents than memory leaks.
What to Watch in Production
1. Off-Heap Memory Leaks
Direct ByteBuffers live outside the heap. If you allocate but never release, you exhaust process memory. The JVM's -XX:MaxDirectMemorySize throws OutOfMemoryError.
Monitor: BufferPoolMXBean via JMX. Track getMemoryUsed() for "direct" pool. If it grows unbounded, you have a leak.
2. Parser State Distribution
In steady state, 99% of connections should be in READING_HEADER (idle). If many are stuck in READING_PAYLOAD, either clients are sending huge frames (possible attack) or network backpressure is slowing reads.
Metric: Histogram of connection states. Alert if READING_PAYLOAD > 5%.
3. Frame Throughput vs GC Pressure
Goal: stable allocation rate under 10MB/sec even at 100k connections.
Watch in VisualVM:
Heap usage should be flat (sawtooth indicates churn)
GC pauses should be under 10ms
Thread count should be constant
4. Partial Frame Accumulation
A slowloris attack: client sends 1 byte every 5 seconds. Your buffer accumulates but never completes.
Defense: Per-connection timeout. If a frame isn't complete within 30 seconds, disconnect.
Why This Matters at Scale
At 1 million concurrent WebSocket connections:
1 byte per connection = 1MB of parser state
1 String allocation per frame = 10GB/sec of garbage at 100 frames/sec/connection
1ms GC pause = every connection stalls, triggering cascading timeouts
Discord learned this the hard way. Their 2020 outage was caused by GC pauses in the Gateway fleet. After rewriting the core loop to eliminate allocations in the frame parser, latency dropped from p99=800ms to p99=20ms.
The lesson: You can't scale what you don't measure in bytes.