Day 2: The Frame: Parsing WebSocket binary frames (Fin bit, Opcodes, Masking) manually.

Lesson 2 60 min

Day 2: The Frame

Parsing WebSocket Binary Frames from Scratch

The Wrong Way: Taking the Easy Road

Component Architecture

When you need WebSocket support, the tempting choice is grabbing a framework. Add @ServerEndpoint or Spring's WebSocketHandler, write five lines of code, and you're done. Ship it.

But here's what's hiding under the hood: the framework creates a new TextMessage object for every frame, wraps payloads in immutable containers, and validates UTF-8 even when you're sending binary data. At 10,000 connections sending pings every 30 seconds, that's 333 frames per second. Easy. At 1 million connections? 33,333 frames per second. Now the abstraction becomes your enemy.

The real killer isn't throughput—it's jitter. When the Young Gen heap fills with short-lived frame objects, the garbage collector pauses your entire program. For 200 milliseconds, your gateway freezes. Mobile clients on sketchy WiFi think they've disconnected. They reconnect. Now you have thousands trying to connect at once.

Core principle: If you don't understand the bytes, you can't optimize the crashes.

Death by a Thousand Small Objects

Let's trace a simple text message through a typical framework:

Code

1. NIO Selector detects readable bytes → allocates ReadEvent
2. Framework reads into new byte[4096] → heap allocation
3. Copies bytes into WebSocketFrame object → heap allocation  
4. Decodes UTF-8 into String → heap allocation
5. Wraps in TextMessage → heap allocation
6. Calls your handler

That's 4 objects per frame before your code runs. At 50,000 frames/sec, you're creating 200,000 objects/sec. Even if each is just 64 bytes, you're allocating 12.8 MB/sec for protocol overhead. The JVM's allocation rate spikes. When the heap fills, everything stops.

WebSocket is elegant—14 bytes of header, then raw payload. But frameworks hide this. They parse, validate, and box everything "for safety." At scale, safety kills performance.

Our Approach: Zero-Copy Frame Parsing

[INSERT: frame_architecture.svg]

We're building four components:

1. Direct Buffer Recycling
Keep a pool of off-heap ByteBuffer instances. Each connection borrows a buffer, parses frames, then returns it. No allocations.

2. State Machine Parser
WebSocket frames aren't guaranteed to arrive complete in one TCP read. You might get 6 bytes of a 14-byte header. The parser remembers where it left off:

READING_HEADER: Accumulate bytes until we have the full header (2-14 bytes)
READING_MASK: If the MASK bit is set, read 4 masking bytes
READING_PAYLOAD: Read the exact number of payload bytes
COMPLETE: Unmask if needed, build frame, reset for next one

3. Virtual Threads for Blocking I/O
Java 21's Virtual Threads let us write simple blocking code without sacrificing scalability. Each connection gets its own thread that blocks on SocketChannel.read(). The OS handles scheduling. This eliminates complex event loops—the parser handles frame state, not threading.

4. Bitwise Protocol Parsing
The first byte packs 4 flags:

Code

FIN  RSV1 RSV2 RSV3 Opcode(4 bits)
1    0    0    0    0001 (text frame, final fragment)

We extract with bit shifts, not objects:

java

byte b0 = buffer.get();
boolean fin = (b0 & 0x80) != 0;
int opcode = b0 & 0x0F;

Zero allocations. The JVM compiles this to native CPU instructions.

Understanding the WebSocket Frame

Flowchart

[INSERT: parserstateflow.svg]

Every WebSocket message is wrapped in a frame. Here's the exact byte structure:

Code

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
|     Extended payload length continued, if payload len == 127  |
+ - - - - - - - - - - - - - - - +-------------------------------+
|                               | Masking-key, if MASK set to 1 |
+-------------------------------+-------------------------------+
| Masking-key (continued)       |          Payload Data         |
+-------------------------------- - - - - - - - - - - - - - - - +
:                     Payload Data continued ...                :
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
|                     Payload Data continued ...                |
+---------------------------------------------------------------+

Payload Length Encoding

The protocol uses variable-length encoding to save space:

If the 7-bit length is 0-125: that's the actual size
If 126: next 2 bytes contain the real length (16-bit)
If 127: next 8 bytes contain the real length (64-bit)

This means headers range from 2 to 14 bytes.

The Masking Dance

Clients MUST mask their payloads (servers MUST NOT). The mask is 4 random bytes. Each payload byte gets XORed:

java

for (int i = 0; i < payloadLength; i++) {
    payload[i] ^= maskKey[i % 4];
}

This prevents cache poisoning attacks through transparent proxies. We unmask on the server.

[INSERT: parserstatemachine.svg]

The ByteBuffer Trap

Here's a mistake that crashes production systems:

java

// WRONG
ByteBuffer buffer = ByteBuffer.allocate(1024);
channel.read(buffer);
byte b0 = buffer.get(); // ERROR: position is at END

After read(), the buffer's position is at the last byte read. You must flip:

java

channel.read(buffer);
buffer.flip(); // position=0, limit=bytes_read
byte b0 = buffer.get(); // Correct

Forgetting flip() throws BufferUnderflowException or reads garbage. This single mistake causes more incidents than memory leaks.

What to Watch in Production

State Machine

1. Off-Heap Memory Leaks
Direct ByteBuffers live outside the heap. If you allocate but never release, you exhaust process memory. The JVM's -XX:MaxDirectMemorySize throws OutOfMemoryError.

Monitor: BufferPoolMXBean via JMX. Track getMemoryUsed() for "direct" pool. If it grows unbounded, you have a leak.

2. Parser State Distribution
In steady state, 99% of connections should be in READING_HEADER (idle). If many are stuck in READING_PAYLOAD, either clients are sending huge frames (possible attack) or network backpressure is slowing reads.

Metric: Histogram of connection states. Alert if READING_PAYLOAD > 5%.

3. Frame Throughput vs GC Pressure
Goal: stable allocation rate under 10MB/sec even at 100k connections.

Watch in VisualVM:

Heap usage should be flat (sawtooth indicates churn)
GC pauses should be under 10ms
Thread count should be constant

4. Partial Frame Accumulation
A slowloris attack: client sends 1 byte every 5 seconds. Your buffer accumulates but never completes.

Defense: Per-connection timeout. If a frame isn't complete within 30 seconds, disconnect.

Why This Matters at Scale

At 1 million concurrent WebSocket connections:

1 byte per connection = 1MB of parser state
1 String allocation per frame = 10GB/sec of garbage at 100 frames/sec/connection
1ms GC pause = every connection stalls, triggering cascading timeouts

Discord learned this the hard way. Their 2020 outage was caused by GC pauses in the Gateway fleet. After rewriting the core loop to eliminate allocations in the frame parser, latency dropped from p99=800ms to p99=20ms.

The lesson: You can't scale what you don't measure in bytes.

Learning Objectives

✓ Understand WebSocket frame structure
✓ Interpret FIN bit and fragmentation
✓ Identify opcodes (text, binary, control)
✓ Apply and reverse masking
✓ Parse raw binary frames manually
✓ Handle different payload lengths
✓ Distinguish control vs data frames

💬 Discuss this topic