Problem - Increase in latencies ( p90 > 30s ) for a simple WebSocket server hosted on VM.
Repro
Run a simple websocket server on a single VM. The server simply receives a request and then upgrades it to websocket without any logic. The client will continuously send 50 parallel requests for a period of 5 mins ( so approximately 3000 requests ).
Issue
Most requests have a latency of the range 100ms-2s. However, for 300-500 requests, we observe that latencies are high ( 10-40s with p90 greater than 30s ) while some have TCP timeouts (Linux default timeout of 127s).
When analyzing the VM processes, we observe that when requests are taking a lot of time, the node process loses it CPU share in favor of some processes started by the VM.
Further Debugging
Increasing process priority (renice) and i/o priority (ionice) did not solve the problem
Increasing cores and memories to 8 core, 32 GiB memory did not solve the problem.
Edit - 1
Repro Code ( clustering enabled ) - https://gist.github.com/Sid200026/3b506a9f77cfce3fa4efdd1ec9dd29bc
When monitoring active processes via htop, we find that the processes started by the following commands are causing an issue
python3 -u bin/WALinuxAgent-2.9.0.4-py2.7.egg -run-exthandlers
/usr/lib/linux-tools/5.15.0-1031-azure/hv_kvp_daemon -n
I developed a server, which accepts many incoming TCP client connections, and transfers data between them. Client A may send some data via connection A targeting client B. The server forwards the data to connection B so that client B gets the data.
I am doing my testing on a PC with Intel Core i7-8700 CPU # 3.2GHz and 32 GB RAM.
I launched two of these servers, and have 50 threads constantly running.
Each thread repeats the following steps at an interval of 150~300 ms:
Randomly disconnects and connects to one of the servers
When connected to a server, the server creates a new record in a SQL Server database.
When disconnected from a server, the server updates that record in the database.
Sends 16 KB of data to another randomly picked connection, and receives the 16 KB back.
This is what I observed about the two servers on the Windows Task Manager:
There is no memory leak, stabilizes around 28 MB.
CPU usage varies between 8% to 18%.
Power usage varies between "moderate" to "very high".
My questions is:
I have not worked with apps with TCP connections and threads before. So I have no feeling on whether these figures look unreasonably high - if they are, then I must have done the coding in a very inefficient way.
Can you tell?
I am using cassandra according to the following struct:
21 nodes , AWS EC2 i3.2xlarge , version 3.11.4 .
The application is opening about 5000 connection per node (so its 100k connections per cluster) using the datastax java connection driver.
Application is using autoscale and frequently opens/close connections.
Number of connections to open at once by app servers can reach up to 500 per node (opens simultaneously on all nodes at once - so its 10k connections opens at the same time across the cluster)
This cause spikes of load on cassandra and cause reads and writes latency.
I have noticed each time connections opens/close there are high number of reads from system_auth.roles and system_auth.role_permissions.
How can I prevent the load and resolve this issue ?
You need to modify your application to work with as small number of connections as possible. You need to have following in mind:
Create Cluster/Session object, once at start and keep it. Initialization of session is very expensive operation, it adds a load to Cassandra, and to your application as well
you may increase the number of the simultaneous requests per connection, instead of opening new connections. Protocol allows to have up to 32k requests per connection. Although, if you have too many requests in-flight, then it's a sign that your Cassandra doesn't keep with workload and can't answer fast enough. See documentation on connection pooling
I'm implementing a websocket-secure (wss://) service for an online game where all users will be connected to the service as long they are playing the game, this will use a high number of simultaneous connections, although the traffic won't be a big problem, as the service is used for chat, storage and notifications... not for real-time data synchronization.
I wanted to use Alchemy-Websockets, but it doesn't support TLS (wss://), so I have to look for another service like Fleck (or other).
Alchemy has been tested with high number of simultaneous connections, but I didn't find similar tests for Fleck, so I need to get some real info from users of fleck.
I know that Fleck is non-blocking and uses Async calls, but I need some real info, cuz it might be abusing threads, garbage collector, or any other aspect that won't be visible to lower number of connections.
I will use c# for the client as well, so I don't need neither hybiXX compatibility, nor fallback, I just need scalability and TLS support.
I finally added Mono support to WebSocketListener.
Check here how to run WebSocketListener in Mono.
10K connections is not little thing. WebSocketListener is asynchronous and it scales well. I have done tests with 10K connections and it should be fine.
My tests shows that WebSocketListener is almost as fast and scalable as the Microsoft one, and performs better than Fleck, Alchemy and others.
I made a test on a Windows machine with Core2Duo e8400 processor and 4 GB of ram.
The results were not encouraging as it started delaying handshakes after it reached ~1000 connections, i.e. it would take about one minute to accept a new connection.
These results were improved when i used XSockets as it reached 8000 simultaneous connections before the same thing happened.
I tried to test on a Linux VPS with Mono, but i don't have enough experience with Linux administration, and a few system settings related to TCP, etc. needed to change in order to allow high number of concurrent connections, so i could only reach ~1000 on the default settings, after that he app crashed (both Fleck test and XSocket test).
On the other hand, I tested node.js, and it seemed simpler to manage very high number of connections, as node didn't crash when reached the limits of tcp.
All the tests where echo test, the servers send the same message back to the client who sent the message and one random other connected client, and each connected client sends a random ~30 chars text message to the server on a random interval between 0 and 30 seconds.
I know my tests are not generic enough and i encourage anyone to have their own tests instead, but i just wanted to share my experience.
When we decided to try Fleck, we have implemented a wrapper for Fleck server and implemented a JavaScript client API so that we can send back acknowledgment messages back to the server. We wanted to test the performance of the server - message delivery time, percentage of lost messages etc. The results were pretty impressive for us and currently we are using Fleck in our production environment.
We have 4000 - 5000 concurrent connections during peak hours. On average 40 messages are sent per second. Acknowledged message ratio (acknowledged messages / total sent messages) never drops below 0.994. Average round-trip for messages is around 150 miliseconds (duration between server sending the message and receiving its ack). Finally, we did not have any memory related problems due to Fleck server after its heavy usage.
I have a app of chat using node.js, but NetworkOut of AWS Amazon this increasing dramatically, this is image of graph:
http://www.pictureshoster.com/files/jillq3urhlq7cgqwoyt.png
When the NetworkOut increased the number acesses is low.
My server is AWS Amazon:
Type: m1.small
ECUs: 1
vCPUs: 1
Memory: 1.7
Network Performance: Low
More graph using Nodetime.com
Garbage Collection / Full GCs per minute
Free Memory Server
Process / Node RSS (MB)
URL: http://www.pictureshoster.com/files/70lhz327kvm9xvuicx9.jpg
note that the graphs has increase dramatically at 12:00, moment that the server was slow.
The chat has 250 sockets.
What is a basic configuration to host a chat which receives about 250 requests per minute, each server request is sent to approximately 200 sockets, generating approximately 250x200 = 50000 requests sent between server and client
The app has 250 sockets.
Users update the page whenever, creating and removing socket always.
Help me.