Day 1: The Handshake: Coding a raw HTTP-to-WebSocket upgrade handler from scratch.

Lesson 1 60 min

The Spring Boot Trap

Component Architecture

WebSocket Connection Lifecycle - State Machine TCP Accept AWAITING_HEADERS • Selector: OP_READ registered • Reading into ByteBuffer • Timeout: 10 seconds Headers complete WS-Key extracted COMPUTING_KEY • Virtual thread executing • SHA-1 + Base64 computation • Selector continues polling Accept key compu. writeQueue.offer READY_FOR_UPGRADE • Selector: OP_WRITE registered • Waiting for socket writable • 101 response bytes ready 101 response writt. Upgrade complet WEBSOCKET_ACTIVE • Ready for frame parsing • Bidirectional messaging • (Day 2 content) CLOSED State Transition Rules Timing Constraints: • AWAITING_HEADERS: Max 10 seconds • COMPUTING_KEY: ~1-2 microseconds • READY_FOR_UPGRADE: Instant (buffered) Thread Ownership: • States 1,3,4: Selector thread • State 2: Virtual thread Memory Profile: • 8KB direct buffer allocated at State 1 ⚡ Performance Impact 100,000 concurrent connections = 1 selector thread + ~10 carrier threads 🧹 Reaper Thread (Virtual) Runs every 5 seconds. Closes connections stuck in AWAITING_HEADERS > 10s. Non-Blocking Crypto Virtual thread doesn't block selector. Thousands of concurrent SHA-1 computations on ~10 carrier threads. TIMEOUT (>10s) Read error / Invalid headers Crypto failure Write error

####

A junior engineer approaches the WebSocket problem like this:

@ServerEndpoint("/gateway")
public class GatewayEndpoint {
    @OnOpen
    public void onConnect(Session session) {
        // Magic happens here!
    }
}

They deploy it. It works for 100 users. Maybe 1,000. Then they hit 50,000 concurrent connections and the application starts exhibiting 5-second GC pauses. Thread dumps show 50,000 blocked threads waiting on InputStream.read(). The heap grows to 12GB despite each connection only holding a few kilobytes of state. The abstraction has hidden three critical failures:

  1. Thread-per-connection model: Each @ServerEndpoint typically spawns a platform thread that blocks on socket reads. At 100k connections, you’re asking the OS to context-switch between 100k threads. The scheduler collapses.

  2. Hidden allocations: The framework parses HTTP headers into String objects, allocates HashMap instances for header storage, and boxes primitive values. For 100k handshakes per minute, this creates gigabytes of short-lived garbage per second.

  3. No visibility into the protocol: When a client sends a malformed handshake or exploits Slowloris-style attacks (sending headers byte-by-byte), you can’t see it because you’re operating above the socket layer.

Discord’s Gateway doesn’t use @ServerEndpoint. WhatsApp doesn’t use JSR 356. They operate at the NIO layer where one thread can multiplex 65,536 connections using Selector, and where every byte allocation is explicit.

The Failure Mode: Death by a Thousand Handshakes

Flowchart

Flux Gateway - Day 1: Handshake Architecture Client Connections C1 C2 C3 10,000+ concurrent Selector Thread (Single OS Thread) while(running) { selector.select(); // BLOCK Event Dispatch OP_ACCEPT → new ConnectionState() OP_READ → parse handshake headers OP_WRITE → send 101 response Virtual Thread Pool (newVirtualThreadPerTaskExecutor) VThread-1 SHA-1 hash VThread-2 SHA-1 hash VThread-N SHA-1 hash Thousands of VThreads → ~10 Carrier Threads TCP SYN Offload SHA-1 wakeup + Write queue Reaper Thread (Virtual) Every 5 sec: close connections stuck in AWAITING_HEADERS > 10s Connection State Registry ConcurrentHashMap<SelectionKey, ConnectionState> Phase: AWAITING_HEADERS | COMPUTING_KEY | READY_FOR_UPGRADE Direct ByteBuffer Pool ByteBuffer.allocateDirect(8KB) Per-connection read buffer Off-heap memory (not GC'd) 10k connections = 80MB

####

The WebSocket handshake is deceptively simple. The client sends:

GET /gateway HTTP/1.1
Host: flux.chat
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

The server must respond with:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

That Sec-WebSocket-Accept value is Base64(SHA-1(Sec-WebSocket-Key + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11")). Simple, right?

Here’s what kills naive implementations at scale:

The Heap Explosion: Parsing each header line with BufferedReader.readLine() allocates a new String. For a typical handshake with 8 headers, that’s 8 String allocations + 1 HashMap + 8 Map.Entry objects per connection. At 10,000 handshakes/sec, you’re allocating 170,000 objects per second just to read headers. The young generation collector runs every 200ms.

The Thread Wall: If you block a thread per connection during the handshake phase (waiting for the client to send all headers), you need 10,000 threads to handle 10,000 concurrent handshakes. Linux defaults to 8MB stack per thread. That’s 80GB of virtual memory just for stacks before you’ve stored a single byte of application data.

The Crypto Bottleneck: SHA-1 computation isn’t free. On a modern CPU, it takes ~1-2 microseconds. If you’re doing this on the selector thread (the single thread handling all I/O), you’ve just introduced 10-20ms of latency for every 10,000 concurrent handshakes because the selector can’t poll for new events while it’s computing hashes.

The Flux Architecture: Reactor Pattern + Virtual Threads

State Machine

WebSocket Handshake Sequence Flow Client Selector Thread HandshakeProcessor Virtual Thread ConnectionState TCP SYN (connect) OP_ACCEPT new ConnectionState(channel) Phase: AWAITING_HEADERS register OP_READ HTTP Handshake Request GET /gateway HTTP/1.1 Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ== channel .read() buffer Direct ByteBuffer parse(buffer) Scan for rnrn Extract WS-Key HandshakeResult(true, key) state.phase(COMPUTING_KEY) Phase: COMPUTING_KEY virtualExecutor.submit(() -> ...) (Returns immediately - non-blocking) SHA-1 digest + Base64 ~1-2 μs acceptKey(result) phase(READY) writeQueue.offer() + selector.wakeup() Critical Pattern Selector thread NEVER blocks on crypto computation. It immediately submits to virtual thread and continues handling other events.

####
Our architecture separates concerns:

  1. The Selector Thread: A single OS thread running a tight loop with Selector.select(). It handles three events:

  • OP_ACCEPT: New client connected

  • OP_READ: Client sent handshake data

  • OP_WRITE: Ready to send 101 response

  1. The Handshake Processor: A zero-allocation state machine that parses HTTP headers directly from a ByteBuffer using index arithmetic (no String splits, no regex). It extracts the Sec-WebSocket-Key as a byte range, not a String.

  2. The Crypto Workers: Virtual threads (Project Loom) handle the SHA-1 computation. When handshake headers are complete, we submit the key bytes to a virtual thread executor. This offloads blocking work without spawning OS threads.

  3. The Connection Registry: A lock-free ConcurrentHashMap<SelectionKey, ConnectionState> tracking each socket’s phase (AWAITINGHEADERS, COMPUTINGKEY, READYFORUPGRADE).

Need help?