A million sockets and Raspberry Pi 4 (updated)
WebSockets are everywhere on the modern web. They deliver live financial updates, betting, social feeds, chat, collaborative editing, gaming, notifications and pretty much everything that is not a client-initiated request. If something updates on its own, chances are you stumbled upon a WebSocket. Even distributed tech like WebRTC requires WebSockets for peer discovery and signaling.
This luxury of having web apps react to live events and updates does however not come for free. Because WebSockets are persistent, they last for the entire user session and potentially longer than this. This means a popular web app could see very large amounts of long-held connections, requiring significant amounts of RAM.
To cope with this pressure, some companies see no other solution than to go on a spending spree, paying for more powerful hardware at increased cost, tackling the problem with brute force as the customer base grows. I want to present alternative strategies which may or may not apply to your business needs, but nonetheless could be informative.
Contrary to popular belief, the kind of software your back-end runs can significantly alter your overall hardware requirements. Take the extremely popular Socket.IO for Node.js as an example. Its current v2.3.0 will require at least 6x the reasonable amount of total RAM per individual socket. Not only is this very wasteful, the fact it is a non-standard protocol makes it impossible to replace without breaking existing APIs.
Both HTTP and WebSocket connections are TCP sockets. The exact memory footprint of a TCP socket will depend greatly on its throughput (data/sec). Because of the reliable nature of TCP, every TCP socket needs to hold sent data in a retransmission buffer until acknowledged by the receiver. Acknowledgement happens every half round trip time (delay back and forth), so for a TCP socket to maintain a constant throughput of 25 MB/sec at 100ms RTT, it will need 1.25 MB of retransmission buffer.
Typical use of HTTP is to open, serve a few requests, close down. High throughout is something we want here; we aim to transfer data, potentially lots of it, from server to client as fast as possible. IP packets are sent, not guided by any real-time event but rather network congestion. As packets reach the receiver, new ones flow.
Luckily, WebSockets are typically used to transfer sporadic messages of only a few bytes each. If anything, the measurement we care about with WebSockets is latency, not throughput. Since messages are sent on the basis of real-time events rather than network congestion, the connection is idle for the most part. I mean “sporadic” and “idle” from the point of view of the computer — here a second is an eternity.
Measuring a few well known websites utilizing WebSockets, I get a spectra of anything from 2 KB/sec up to 17 KB/sec of throughput, where 17 is measured from a very active trading site.
With a RTT of 100ms, which is quite high, we would need a retransmission buffer of only 870 bytes.
As a proof-of-concept, we aim for at least a million sockets on this Raspberry Pi 4. With good margins we can select a 2 KB retransmission buffer and still fit in RAM.
There are many options for TCP in the Linux kernel, including various buffer sizes. However, I haven’t been successful in setting very small sizes. It seems there’s been no real reason to support very small buffer sizes, and I don’t blame anyone for it. It makes sense that the Linux kernel will require, in my testing, at least ~4.5 KB per TCP socket.
With this memory footprint, utilizing µWebSockets as server implementation I was only able to establish and keep 789k WebSockets on this Raspberry Pi 4. More than this and I would run out of RAM.
But because I wanted to prove a point, and to experiment with my idea, I wrote a quick and dirty TCP implementation in user space and tested its viability like so:
- The Pi was connected via Ethernet port and cable to my router, which in turn was connected via Ethernet port and cable to my Linux laptop serving as the client.
- The laptop running the Linux kernel TCP implementation established TCP connections to the listening Raspberry Pi 4 in rapid succession.
- Every 16 seconds the Linux laptop would send a “ping” message for every socket, expecting a response within a reasonable time or else failing the test.
- The Raspberry Pi 4 would accept connections and make sure to respond to “pings” in time, measuring maximum latency.
This ran until reaching a million sockets, continuing for about an hour without failure. Here are some statistics:
- Receiving traffic was constant at roughly 3.5–3.8 Mbit/sec, as was the sending traffic. This of course from the constant pinging/responses. One million sockets sending a ping every 16 seconds means 62500 messages every second, in both directions.
- CPU usage of the Raspberry was reported from 60% to 80%. The test ran for about an hour, stable.
- Maximum latency, through the entire test, for every socket, was 2 seconds. Anything slower than 16 seconds would break down, as the next ping would be appended.
- Increasing the ping interval from 16 to 32 seconds brings down the overall maximum latency to less than one second, and everything below 800k connections with 16 seconds ping interval was also below 1 second.
- Reaching one million sockets took roughly 15 minutes, this by making one connection a time. I’m sure this time could be decreased by connecting in batches of 10 instead, but never mind for now.
A million TCP sockets communicating with the Linux kernel over Ethernet cable without issues for an hour, all served from this one Raspberry Pi 4!
While this was an over the top proof-of-concept experiment, I think the overall idea is reasonable. Even using the Linux kernel TCP sockets you may hold an incredible amount of WebSockets using no more than a Raspberry Pi 4, 789k to be exact.
Adding TLS to the picture will of course alter the expectations by measurable amounts, but not by a landslide. OpenSSL will require about 30kb per session by default, going down to 12kb with memory savings. WolfSSL as an alternative implementation will bring requirements down to 8kb.
That’s about 12 KB per TLS WebSocket, leaving room for at least quarter of a million on one Raspberry Pi 4!
That’s some serious bang for the buck either way if you ask me.