Day 2: The Frame: Parsing WebSocket binary frames (Fin bit, Opcodes, Masking) manually.

Lesson 2 60 min

Day 2: The Frame

Parsing WebSocket Binary Frames from Scratch


The Wrong Way: Taking the Easy Road

Component Architecture

Flux Gateway: Component Architecture Client 1 WebSocket Client 2 WebSocket Client N WebSocket "1M+ concurrent connections" GatewayServer (Java 21) ServerSocketChannel accept() loop Port: 9001 Virtual Thread Executor newVirtualThreadPer TaskExecutor() DashboardServer HttpServer Port: 8080 Per-Connection Virtual Threads ConnectionHandler Frame Parser ByteBuffer (off-heap) ConnectionHandler Frame Parser ConnectionHandler Frame Parser MetricsCollector • AtomicLong: activeConnections • AtomicLong: totalFrames • AtomicLong: bytesReceived Lock-free thread-safe counters increment metrics HTTP /api/metrics Key Design Principles: 1. One Virtual Thread per connection (blocking I/O model) 2. Off-heap ByteBuffer for zero-copy parsing 3. Lock-free atomic metrics (no synchronized blocks) 4. Stateful parser handles partial TCP reads

When you need WebSocket support, the tempting choice is grabbing a framework. Add @ServerEndpoint or Spring's WebSocketHandler, write five lines of code, and you're done. Ship it.

But here's what's hiding under the hood: the framework creates a new TextMessage object for every frame, wraps payloads in immutable containers, and validates UTF-8 even when you're sending binary data. At 10,000 connections sending pings every 30 seconds, that's 333 frames per second. Easy. At 1 million connections? 33,333 frames per second. Now the abstraction becomes your enemy.

The real killer isn't throughput—it's jitter. When the Young Gen heap fills with short-lived frame objects, the garbage collector pauses your entire program. For 200 milliseconds, your gateway freezes. Mobile clients on sketchy WiFi think they've disconnected. They reconnect. Now you have thousands trying to connect at once.

Core principle: If you don't understand the bytes, you can't optimize the crashes.


Death by a Thousand Small Objects

Let's trace a simple text message through a typical framework:

Code
1. NIO Selector detects readable bytes → allocates ReadEvent
2. Framework reads into new byte[4096] → heap allocation
3. Copies bytes into WebSocketFrame object → heap allocation  
4. Decodes UTF-8 into String → heap allocation
5. Wraps in TextMessage → heap allocation
6. Calls your handler

That's 4 objects per frame before your code runs. At 50,000 frames/sec, you're creating 200,000 objects/sec. Even if each is just 64 bytes, you're allocating 12.8 MB/sec for protocol overhead. The JVM's allocation rate spikes. When the heap fills, everything stops.

WebSocket is elegant—14 bytes of header, then raw payload. But frameworks hide this. They parse, validate, and box everything "for safety." At scale, safety kills performance.


Our Approach: Zero-Copy Frame Parsing

[INSERT: frame_architecture.svg]

We're building four components:

1. Direct Buffer Recycling
Keep a pool of off-heap ByteBuffer instances. Each connection borrows a buffer, parses frames, then returns it. No allocations.

2. State Machine Parser
WebSocket frames aren't guaranteed to arrive complete in one TCP read. You might get 6 bytes of a 14-byte header. The parser remembers where it left off:

  • READING_HEADER: Accumulate bytes until we have the full header (2-14 bytes)

  • READING_MASK: If the MASK bit is set, read 4 masking bytes

  • READING_PAYLOAD: Read the exact number of payload bytes

  • COMPLETE: Unmask if needed, build frame, reset for next one

3. Virtual Threads for Blocking I/O
Java 21's Virtual Threads let us write simple blocking code without sacrificing scalability. Each connection gets its own thread that blocks on SocketChannel.read(). The OS handles scheduling. This eliminates complex event loops—the parser handles frame state, not threading.

4. Bitwise Protocol Parsing
The first byte packs 4 flags:

Code
FIN  RSV1 RSV2 RSV3 Opcode(4 bits)
1    0    0    0    0001 (text frame, final fragment)

We extract with bit shifts, not objects:

java
byte b0 = buffer.get();
boolean fin = (b0 & 0x80) != 0;
int opcode = b0 & 0x0F;

Zero allocations. The JVM compiles this to native CPU instructions.


Understanding the WebSocket Frame

Flowchart

WebSocket Frame Processing Flow From raw bytes to parsed WebSocketFrame record Client Socket Channel Connection Handler ByteBuffer (off-heap) Frame Parser TCP Packet 0x81 0x85 [mask] [payload] FIN=1, OPCODE=TEXT(0x1) MASKED=1, Length=5 read(buffer) clear() Bytes written position: 0 → N bytes limit: capacity flip() position: 0 limit: N bytes parse(buffer) State Machine Loop 1. READING_HEADER → extract FIN, opcode 2. READING_MASK → read 4 mask bytes 3. READING_PAYLOAD → accumulate bytes 4. Unmask: payload[i] ^= mask[i%4] WebSocketFrame record(fin=true, opcode=0x1, masked=true, payload=ByteBuffer@slice) handleFrame() ! CRITICAL: flip() before reading Zero allocations Virtual Thread blocks on channel.read() → OS scheduler handles context switching → No explicit event loop needed

[INSERT: parserstateflow.svg]

Every WebSocket message is wrapped in a frame. Here's the exact byte structure:

Code
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
|     Extended payload length continued, if payload len == 127  |
+ - - - - - - - - - - - - - - - +-------------------------------+
|                               | Masking-key, if MASK set to 1 |
+-------------------------------+-------------------------------+
| Masking-key (continued)       |          Payload Data         |
+-------------------------------- - - - - - - - - - - - - - - - +
:                     Payload Data continued ...                :
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
|                     Payload Data continued ...                |
+---------------------------------------------------------------+

Payload Length Encoding

The protocol uses variable-length encoding to save space:

  • If the 7-bit length is 0-125: that's the actual size

  • If 126: next 2 bytes contain the real length (16-bit)

  • If 127: next 8 bytes contain the real length (64-bit)

This means headers range from 2 to 14 bytes.

The Masking Dance

Clients MUST mask their payloads (servers MUST NOT). The mask is 4 random bytes. Each payload byte gets XORed:

java
for (int i = 0; i < payloadLength; i++) {
    payload[i] ^= maskKey[i % 4];
}

This prevents cache poisoning attacks through transparent proxies. We unmask on the server.

[INSERT: parserstatemachine.svg]


The ByteBuffer Trap

Here's a mistake that crashes production systems:

java
// WRONG
ByteBuffer buffer = ByteBuffer.allocate(1024);
channel.read(buffer);
byte b0 = buffer.get(); // ERROR: position is at END

After read(), the buffer's position is at the last byte read. You must flip:

java
channel.read(buffer);
buffer.flip(); // position=0, limit=bytes_read
byte b0 = buffer.get(); // Correct

Forgetting flip() throws BufferUnderflowException or reads garbage. This single mistake causes more incidents than memory leaks.


What to Watch in Production

State Machine

FrameParser State Machine Handles partial TCP reads and variable-length headers START READING_ HEADER init Length Size? 2 bytes read READING_EXTENDED _LENGTH_16 len==126 READING_EXTENDED _LENGTH_64 len==127 READING_ MASK len < 126 if MASK bit set 2 bytes read 8 bytes read READING_ PAYLOAD 4 mask bytes read No MASK bit More bytes needed COMPLETE All payload bytes read ERROR Payload > 10MB Invalid opcode reset() - Ready for next frame Key Invariants: • Parser must handle partial reads (TCP doesn't guarantee frame boundaries) • State persists across parse() calls until COMPLETE or ERROR • Zero allocations during state transitions (all buffers pre-allocated) Performance Critical: • State checks are simple enum comparisons (JIT compiles to jump table) • AtomicReference for thread-safe state transitions (lock-free) • Payload buffer allocated once when length known (not incrementally) WebSocket Frame Structure Byte 0: FIN(1) RSV(3) OPCODE(4) Byte 1: MASK(1) LENGTH(7) Bytes 2-N: Extended length (if needed)

1. Off-Heap Memory Leaks
Direct ByteBuffers live outside the heap. If you allocate but never release, you exhaust process memory. The JVM's -XX:MaxDirectMemorySize throws OutOfMemoryError.

Monitor: BufferPoolMXBean via JMX. Track getMemoryUsed() for "direct" pool. If it grows unbounded, you have a leak.

2. Parser State Distribution
In steady state, 99% of connections should be in READING_HEADER (idle). If many are stuck in READING_PAYLOAD, either clients are sending huge frames (possible attack) or network backpressure is slowing reads.

Metric: Histogram of connection states. Alert if READING_PAYLOAD > 5%.

3. Frame Throughput vs GC Pressure
Goal: stable allocation rate under 10MB/sec even at 100k connections.

Watch in VisualVM:

  • Heap usage should be flat (sawtooth indicates churn)

  • GC pauses should be under 10ms

  • Thread count should be constant

4. Partial Frame Accumulation
A slowloris attack: client sends 1 byte every 5 seconds. Your buffer accumulates but never completes.

Defense: Per-connection timeout. If a frame isn't complete within 30 seconds, disconnect.


Why This Matters at Scale

At 1 million concurrent WebSocket connections:

  • 1 byte per connection = 1MB of parser state

  • 1 String allocation per frame = 10GB/sec of garbage at 100 frames/sec/connection

  • 1ms GC pause = every connection stalls, triggering cascading timeouts

Discord learned this the hard way. Their 2020 outage was caused by GC pauses in the Gateway fleet. After rewriting the core loop to eliminate allocations in the frame parser, latency dropped from p99=800ms to p99=20ms.

The lesson: You can't scale what you don't measure in bytes.


Need help?