Efficiency of multithreading and TCP socket coding - multithreading

I developed a server, which accepts many incoming TCP client connections, and transfers data between them. Client A may send some data via connection A targeting client B. The server forwards the data to connection B so that client B gets the data.
I am doing my testing on a PC with Intel Core i7-8700 CPU # 3.2GHz and 32 GB RAM.
I launched two of these servers, and have 50 threads constantly running.
Each thread repeats the following steps at an interval of 150~300 ms:
Randomly disconnects and connects to one of the servers
When connected to a server, the server creates a new record in a SQL Server database.
When disconnected from a server, the server updates that record in the database.
Sends 16 KB of data to another randomly picked connection, and receives the 16 KB back.
This is what I observed about the two servers on the Windows Task Manager:
There is no memory leak, stabilizes around 28 MB.
CPU usage varies between 8% to 18%.
Power usage varies between "moderate" to "very high".
My questions is:
I have not worked with apps with TCP connections and threads before. So I have no feeling on whether these figures look unreasonably high - if they are, then I must have done the coding in a very inefficient way.
Can you tell?

Related

Node process often loses CPU time on Linux VM which increases latencies for client requests

Problem - Increase in latencies ( p90 > 30s ) for a simple WebSocket server hosted on VM.
Repro
Run a simple websocket server on a single VM. The server simply receives a request and then upgrades it to websocket without any logic. The client will continuously send 50 parallel requests for a period of 5 mins ( so approximately 3000 requests ).
Issue
Most requests have a latency of the range 100ms-2s. However, for 300-500 requests, we observe that latencies are high ( 10-40s with p90 greater than 30s ) while some have TCP timeouts (Linux default timeout of 127s).
When analyzing the VM processes, we observe that when requests are taking a lot of time, the node process loses it CPU share in favor of some processes started by the VM.
Further Debugging
Increasing process priority (renice) and i/o priority (ionice) did not solve the problem
Increasing cores and memories to 8 core, 32 GiB memory did not solve the problem.
Edit - 1
Repro Code ( clustering enabled ) - https://gist.github.com/Sid200026/3b506a9f77cfce3fa4efdd1ec9dd29bc
When monitoring active processes via htop, we find that the processes started by the following commands are causing an issue
python3 -u bin/WALinuxAgent-2.9.0.4-py2.7.egg -run-exthandlers
/usr/lib/linux-tools/5.15.0-1031-azure/hv_kvp_daemon -n

Socket.io hangs after 2k concurrent connections

I have purchased 2 vCPU core and 4 GB Ram memory VPS server and deployed nodejs Socket.io server. its working fine without any issue upto 2k concurrent connection. But this limit is very small according to me. when connection is reached at 3k socketio server hang and stopped working.
Normally memory usage is 300mb but after 3k connection memory usage is reaching upto 2.5 GB and not emitting packets for several seconds and after that works for very few second and server hang again.
My server is not very small for this amount of connection.
Is there any suggestion for optimisations how to increase concurrent connection without hang after few thousand clients connected simultaneously. for few clients its working fine.

In JMeter Connect time apperas for certain records results in high elapsed time

I am running a JMeter test with 20 concurrent users (use kee alive is enabled). All 20 users with different login ids trying to login (1st test) and create a record(2nd test). While creating a record i observed most of the create records has connect time of '0' but certain records (say 5/20) has connect time of 21000 ms, so due to that elapsed time of 5 requests alone is so high compared to other 15 requests. Why its happening for 5 users alone ?
We don't know, according to JMeter Glossary
Connect Time. JMeter measures the time it took to establish the connection, including SSL handshake. Note that connect time is not automatically subtracted from latency. In case of connection error, the metric will be equal to the time it took to face the error, for example in case of Timeout, it should be equal to connection timeout.
So it's more a network metric which indicates how long did it take for JMeter to establish connection with the server.
You need to check:
Your network adapter statistics, it might be the case it doesn't have enough bandwidth to send 20 concurrent requests
Your application connection pool settings, for example if it has 15 connections in the pool the remaining 5 will be put into queue and wait until a connection becomes available.
Check your server baseline health metrics like CPU, RAM, etc. (it can be done using JMeter PerfMon Plugin) as it might be the case your server lacks resources to serve all connections at the same time
Make sure to follow JMeter Best Practices as it might be the case JMeter is overloaded and cannot send requests fast enough

Node.js Server Timeout Problems (EC2 + Express + PM2)

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.
Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.
I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.
Any clue as to what might be happening and how I can solve this problem?
Here's my stack:
Node.js on AWS EC2 Micro Instance (using Express 4.0 + PM2)
Database on AWS RDS volume running MySQL (using node-mysql)
Sessions stored w/ Redis on same EC2 instance as the node.js app
Clients are iPhones accessing the server via AFNetworking
Once again no errors are firing with any of the modules mentioned above.
First of all you need to be a bit more specific about timeouts.
TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.
HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.
Now, there are many, many possible causes for this... from more trivial to less trivial:
Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.
Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...
Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.
Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.
Node event loop starvation: I will get to that at the end.
For memory leaks, the symptoms would be:
the node process would be using an increasing amount of memory.
To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.
cat /proc/sys/vm/swappiness
will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10
Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js
For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.
Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.
The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.
Finally, another way to debug the problem is:
env NODE_DEBUG=tls,net node <...arguments for your app>
node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.
Source: Experience of maintaining large deployments of node services for years.

Using Fleck Websocket for 10k simultaneous connections

I'm implementing a websocket-secure (wss://) service for an online game where all users will be connected to the service as long they are playing the game, this will use a high number of simultaneous connections, although the traffic won't be a big problem, as the service is used for chat, storage and notifications... not for real-time data synchronization.
I wanted to use Alchemy-Websockets, but it doesn't support TLS (wss://), so I have to look for another service like Fleck (or other).
Alchemy has been tested with high number of simultaneous connections, but I didn't find similar tests for Fleck, so I need to get some real info from users of fleck.
I know that Fleck is non-blocking and uses Async calls, but I need some real info, cuz it might be abusing threads, garbage collector, or any other aspect that won't be visible to lower number of connections.
I will use c# for the client as well, so I don't need neither hybiXX compatibility, nor fallback, I just need scalability and TLS support.
I finally added Mono support to WebSocketListener.
Check here how to run WebSocketListener in Mono.
10K connections is not little thing. WebSocketListener is asynchronous and it scales well. I have done tests with 10K connections and it should be fine.
My tests shows that WebSocketListener is almost as fast and scalable as the Microsoft one, and performs better than Fleck, Alchemy and others.
I made a test on a Windows machine with Core2Duo e8400 processor and 4 GB of ram.
The results were not encouraging as it started delaying handshakes after it reached ~1000 connections, i.e. it would take about one minute to accept a new connection.
These results were improved when i used XSockets as it reached 8000 simultaneous connections before the same thing happened.
I tried to test on a Linux VPS with Mono, but i don't have enough experience with Linux administration, and a few system settings related to TCP, etc. needed to change in order to allow high number of concurrent connections, so i could only reach ~1000 on the default settings, after that he app crashed (both Fleck test and XSocket test).
On the other hand, I tested node.js, and it seemed simpler to manage very high number of connections, as node didn't crash when reached the limits of tcp.
All the tests where echo test, the servers send the same message back to the client who sent the message and one random other connected client, and each connected client sends a random ~30 chars text message to the server on a random interval between 0 and 30 seconds.
I know my tests are not generic enough and i encourage anyone to have their own tests instead, but i just wanted to share my experience.
When we decided to try Fleck, we have implemented a wrapper for Fleck server and implemented a JavaScript client API so that we can send back acknowledgment messages back to the server. We wanted to test the performance of the server - message delivery time, percentage of lost messages etc. The results were pretty impressive for us and currently we are using Fleck in our production environment.
We have 4000 - 5000 concurrent connections during peak hours. On average 40 messages are sent per second. Acknowledged message ratio (acknowledged messages / total sent messages) never drops below 0.994. Average round-trip for messages is around 150 miliseconds (duration between server sending the message and receiving its ack). Finally, we did not have any memory related problems due to Fleck server after its heavy usage.

Resources