Thousands of concurrent http requests in node - linux

I have a list of thousands of URLs. I want to get a health check (healt.php) with an http request.
This is my problem:
I've wrote an application in node. It makes the requests in a pooled way. I use a variable to control how many concurrent connections I open. 300, ie.
One by one, each request is so fast, no more than 500ms.
But when I run the application, the result is:
$ node agent.js
200ms url1.tld
250ms url4.tld
400ms url2.tld
530ms url8.tld
800ms url3.tld
...
2300ms urlN.tld
...
30120ms urlM.tld
It seems that there is a limit in concurrency. When I execute
$ ps axo nlwp,cmd | grep node
The result is:
6 node agent.js
There are 6 threads to manage all concurrent connections. I found an evn variable to control concurrency in node: UV_THREADPOOL_SIZE
$ UV_THREADPOOL_SIZE=300 node agent.js
200ms url1.tld
210ms url4.tld
220ms url2.tld
240ms url8.tld
400ms url3.tld
...
800ms urlN.tld
...
1010ms urlM.tld
The problem is still there, but the results are much better. With the ps command:
$ ps axo nlwp,cmd | grep node
132 node agent.js
Next step: Looking in the source code of node, I've found a constant in deps/uv/src/unix/threadpool.c:
#define MAX_THREADPOOL_SIZE 128
Ok. I've changed that value to 2048, compiled and installed node and run once the command
$ UV_THREADPOOL_SIZE=300 node agent.js
All seems ok. Response times are not incrementing gradually. But when I try with a bigger concurrency number the problema appears. But this time it's not related to the number of threads, because with the ps command I see there are enough of them.
I tried to write the same application in golang, but the results are the same. The time is increasing gradually.
So, my question is: Where is the concurrence limit? memory and cpu load and bandwith are not out of bounds. And I tuned sysctl.conf and limits.conf to avoid some limits (files, ports, memory, ...).

You may be throttled by http.globalAgent's maxSockets. Depending on whether you're using http or https, see if this fixes your problem:
require('http').globalAgent.maxSockets = Infinity;
require('https').globalAgent.maxSockets = Infinity;

If you're using request or request-promise you can set the pool size:
request({
url: url,
json: true,
pool: {maxSockets: Infinity},
timeout: 2000
})
More info here: https://github.com/request/request

Related

MariaDB connection very slow in NodeJS

I'm currently using MariaDB version 10.5.12 and the mariadb-connector-nodejs package to interact with it from NodeJS
The pool is created as such
const pool = MariaDB.createPool({
host: process.env.DB_HOST,
user: process.env.DB_USER,
password: process.env.DB_PASS,
database: process.env.DB_DATABASE,
connectionLimit: 25
});
A simple request such as connection.Query("SELECT * FROM Users"); Where the users table only contains 5 users can take anywhere between 90ms and 650ms
profiling the query in MariaDB says the query itself only takes a few nanoseconds.
The time a request takes seems to be pretty random, the first request could be 100ms and the second one right after using the same connection could take 600ms which I feel is way too high for a query that takes less than a millisecond to process.
But I'm unable to find where the issue is coming from, I have tried doing all the changes mysqltuner recommended, I've tried disabling name resolve with no visible speedup (I assume the requests are being cached as all requests come from the same ip)
top indicates there is NO Swap space available.
Consider enabling 6G of swap space to survive busy situations with minimal delay and a surviving system.
Please share the code generally used to Connect, Process, Close connections.
Your threads_connected count indicates the Close is being missed and has left 83 threads connected in 10 days.
Suggestions to consider for your my.cnf [mysqld] section,
log_error=/var/log/mysql/mariadb-error.log # from 0 to allow viewing ERRORS detected.
innodb_max_dirty_pages_pct_lwm=1 # from 0 percent to enable pre-flushing
innodb_max_dirty_pages_pct=1 # from 90 percent to minimize innodb_buffer_pool_pages_dirty of 367.
innodb_adaptive_hash_index=ON # from OFF to minimize deadlocks
max_connect_errors=10 # from 100 to frustrate hackers/crackers after 10 attempts.
connect_timeout=30 # from 10 seconds to be more tolerant of connection attempts.
Observations,
connections were 120,604 in 10 days and you had 28984 aborted_connects about 1/4 of the events. You must have a lot of unhappy people trying to use your system or they are hackers trying to break in.
View profile for contact info and free downloadable Utility Scripts to assist with performance tuning.

Give reads priority over writes in Elasticsearch

I have an EC2 server running Elasticsearch 0.9 with a nginx server for read/write access. My index has about 750k small-medium documents. I have a pretty continuous stream of minimal writes (mainly updates) to the content. The speeds/consistency I receive with search is fine with me, but I have some sporadic timeout issues with multi-get (/_mget).
On some pages in my app, our server will request a multi-get of a dozen to a few thousand documents (this usually takes less than 1-2 seconds). The requests that fail, fail with a 30,000 millisecond timeout from the nginx server. I am assuming this happens because the index was temporarily locked for writing/optimizing purposes. Does anyone have any ideas on what I can do here?
A temporary solution would be to lower the timeout and return a user friendly message saying documents couldn't be retrieved (however they still would have to wait ~10 seconds to see an error message).
Some of my other thoughts were to give read priority over writes. Anytime someone is trying to read a part of the index, don't allow any writes/locks to that section. I don't think this would be scalable and it may not even be possible?
Finally, I was thinking I could have a read-only alias and a write-only alias. I can figure out how to set this up through the documentation, but I am not sure if it will actually work like I expect it to (and I'm not sure how I can reliably test it in a local environment). If I set up aliases like this, would the read-only alias still have moments where the index was locked due to information being written through the write-only alias?
I'm sure someone else has come across this before, what is the typical solution to make sure a user can always read data from the index with a higher priority over writes. I would consider increasing our server power, if required. Currently we have 2 m2x-large EC2 instances. One is the primary and the replica, each with 4 shards.
An example dump of cURL info from a failed request (with an error of Operation timed out after 30000 milliseconds with 0 bytes received):
{
"url":"127.0.0.1:9200\/_mget",
"content_type":null,
"http_code":100,
"header_size":25,
"request_size":221,
"filetime":-1,
"ssl_verify_result":0,
"redirect_count":0,
"total_time":30.391506,
"namelookup_time":7.5e-5,
"connect_time":0.0593,
"pretransfer_time":0.059303,
"size_upload":167002,
"size_download":0,
"speed_download":0,
"speed_upload":5495,
"download_content_length":-1,
"upload_content_length":167002,
"starttransfer_time":0.119166,
"redirect_time":0,
"certinfo":[
],
"primary_ip":"127.0.0.1",
"redirect_url":""
}
After more monitoring using the Paramedic plugin, I noticed that I would get timeouts when my CPU would hit ~80-98% (no obvious spikes in indexing/searching traffic). I finally stumbled across a helpful thread on the Elasticsearch forum. It seems this happens when the index is doing a refresh and large merges are occurring.
Merges can be throttled at a cluster or index level and I've updated them from the indicies.store.throttle.max_bytes_per_sec from the default 20mb to 5mb. This can be done during runtime with the cluster update settings API.
PUT /_cluster/settings HTTP/1.1
Host: 127.0.0.1:9200
{
"persistent" : {
"indices.store.throttle.max_bytes_per_sec" : "5mb"
}
}
So far Parmedic is showing a decrease in CPU usage. From an average of ~5-25% down to an average of ~1-5%. Hopefully this can help me avoid the 90%+ spikes I was having lock up my queries before, I'll report back by selecting this answer if I don't have any more problems.
As a side note, I guess I could have opted for more balanced EC2 instances (rather than memory-optimized). I think I'm happy with my current choice, but my next purchase will also take more CPU into account.

Website Benchmarking using ab

I am trying my hand at various benchmarking tools for the website I am working on and have found Apache Bench (ab) to be an excellent tool for load testing. It is a command line tool and is very easy to use, apparently. However I have a doubt about two of its basic flags. The site I was reading says:
Suppose we want to see how fast Yahoo can handle 100 requests, with a maximum of 10 requests running concurrently:
ab -n 100 -c 10 http://www.yahoo.com/
and the explanation for the flags states:
Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
-n requests Number of requests to perform
-c concurrency Number of multiple requests to make
I guess I am just not able to wrap my head around number of requests to perform and number of multiple requests to make. What happens when I give them both together like in the example above?
Can anyone give me a simpler explanation of what these two flags do together?
In your example ab will create 10 connections to yahoo.com and request a page using each of them simultaneously.
If you omit -c 10 ab will create only one connection and create next only when the first completes(when we have the whole main page downloaded).
If we pretend that server's response time does not depend on the number of requests it is simultaneously handling, your example will complete 10 times faster than without -c 10.
Also: What is concurrent request (-c) in Apache Benchmark?
-n 100 -c 10 means "issue 100 requests, 10 at a time."

Node.js struggling with lots of concurrent connections

I'm working on a somewhat unusual application where 10k clients are precisely timed to all try to submit data at once, every 3 mins or so. This 'ab' command fairly accurately simulates one barrage in the real world:
ab -c 10000 -n 10000 -r "http://example.com/submit?data=foo"
I'm using Node.js on Ubuntu 12.4 on a rackspacecloud VPS instance to collect these submissions, however, I'm seeing some very odd behavior from Node, even when I remove all my business logic and turn the http request into a no-op.
When the test gets about 90% done, it hangs for a long period of time. Strangely, this happens consistently at 90% - for c=n=10k, at 9000; for c=n=5k, at 4500; for c=n=2k, at 1800. The test actually completes eventually, often with no errors. But both ab and node logs show continuous processing up till around 80-90% of the test run, then a long pause before completing.
When node is processing requests normally, CPU usage is typically around 50-70%. During the hang period, CPU goes up to 100%. Sometimes it stays near 0. Between the erratic CPU response and the fact that it seems unrelated to the actual number of connections (only the % complete), I do not suspect the garbage collector.
I've tried this running 'ab' on localhost and on a remote server - same effect.
I suspect something related to the TCP stack, possibly involving closing connections, but none of my configuration changes have helped. My changes:
ulimit -n 999999
When I listen(), I set the backlog to 10000
Sysctl changes are:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_orphans = 20000
net.ipv4.tcp_max_syn_backlog = 10000
net.core.somaxconn = 10000
net.core.netdev_max_backlog = 10000
I have also noticed that I tend to get this msg in the kernel logs:
TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
I'm puzzled by this msg since the TCP backlog queue should be deep enough to never overflow. If I disable syn cookies the "Sending cookies" goes to "Dropping connections".
I speculate that this is some sort of linux TCP stack tuning problem and I've read just about everything I could find on the net. Nothing I have tried seems to matter. Any advice?
Update: Tried with tcp_max_syn_backlog, somaxconn, netdev_max_backlog, and the listen() backlog param set to 50k with no change in behavior. Still produces the SYN flood warning, too.
Are you running ab on the same machine running node? If not do you have a 1G or 10G NIC? If you are, then aren't you really trying to process 20,000 open connections?
Also if you are changing net.core.somaxconn to 10,000 you have absolutely no other sockets open on that machine? If you do then 10,000 is not high enough.
Have you tried to use nodejs cluster to spread the number of open connections per process out?
I think you might find this blog post and also the previous ones useful
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/

Does WGET timeout?

I'm running a PHP script via cron using Wget, with the following command:
wget -O - -q -t 1 http://www.example.com/cron/run
The script will take a maximum of 5-6 minutes to do its processing. Will WGet wait for it and give it all the time it needs, or will it time out?
According to the man page of wget, there are a couple of options related to timeouts -- and there is a default read timeout of 900s -- so I say that, yes, it could timeout.
Here are the options in question :
-T seconds
--timeout=seconds
Set the network timeout to seconds
seconds. This is equivalent to
specifying --dns-timeout,
--connect-timeout, and
--read-timeout, all at the same
time.
And for those three options :
--dns-timeout=seconds
Set the DNS lookup timeout to seconds
seconds. DNS lookups that don't
complete within the specified time
will fail. By default, there is no
timeout on DNS lookups, other than
that implemented by system libraries.
--connect-timeout=seconds
Set the connect timeout to seconds
seconds. TCP connections that take
longer to establish will be aborted.
By default, there is no connect
timeout, other than that implemented
by system libraries.
--read-timeout=seconds
Set the read (and write) timeout to
seconds seconds. The "time" of
this timeout refers to idle time: if,
at any point in the download, no data
is received for more than the
specified number of seconds, reading
fails and the download is restarted.
This option does not directly
affect the duration of the entire
download.
I suppose using something like
wget -O - -q -t 1 --timeout=600 http://www.example.com/cron/run
should make sure there is no timeout before longer than the duration of your script.
(Yeah, that's probably the most brutal solution possible ^^ )
The default timeout is 900 second. You can specify different timeout.
-T seconds
--timeout=seconds
The default is to retry 20 times. You can specify different tries.
-t number
--tries=number
link: wget man document
Prior to version 1.14, wget timeout arguments were not adhered to if downloading over https due to a bug.
Since in your question you said it's a PHP script, maybe the best solution could be to simply add in your script:
ignore_user_abort(TRUE);
In this way even if wget terminates, the PHP script goes on being processed at least until it does not exceeds max_execution_time limit (ini directive: 30 seconds by default).
As per wget anyay you should not change its timeout, according to the UNIX manual the default wget timeout is 900 seconds (15 minutes), whis is much larger that the 5-6 minutes you need.
None of the wget timeout values have anything to do with how long it takes to download a file.
If the PHP script that you're triggering sits there idle for 5 minutes and returns no data, wget's --read-timeout will trigger if it's set to less than the time it takes to execute the script.
If you are actually downloading a file, or if the PHP script sends some data back, like a ... progress indicator, then the read timeout won't be triggered as long as the script is doing something.
wget --help tells you:
-T, --timeout=SECONDS set all timeout values to SECONDS
--dns-timeout=SECS set the DNS lookup timeout to SECS
--connect-timeout=SECS set the connect timeout to SECS
--read-timeout=SECS set the read timeout to SECS
So if you use --timeout=10 it sets the timeouts for DNS lookup, connecting, and reading bytes to 10s.
When downloading files you can set the timeout value pretty low and as long as you have good connectivity to the site you're connecting to you can still download a large file in 5 minutes with a 10s timeout. If you have a temporary connection failure to the site or DNS, the transfer will time out after 10s and then retry (if --tries aka -t is > 1).
For example, here I am downloading a file from NVIDIA that takes 4 minutes to download, and I have wget's timeout values set to 10s:
$ time wget --timeout=10 --tries=1 https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
--2021-07-02 16:39:21-- https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3057439068 (2.8G) [application/octet-stream]
Saving to: ‘cuda_11.2.2_460.32.03_linux.run.1’
cuda_11.2.2_460.32.03_linux.run.1 100%[==================================================================================>] 2.85G 12.5MB/s in 4m 0s
2021-07-02 16:43:21 (12.1 MB/s) - ‘cuda_11.2.2_460.32.03_linux.run.1’ saved [3057439068/3057439068]
real 4m0.202s
user 0m5.180s
sys 0m16.253s
4m to download, timeout is 10s, everything works just fine.
In general, timing out DNS, connections, and reads using a low value is a good idea. If you leave it at the default value of 900s you'll be waiting 15m every time there's a hiccup in DNS or your Internet connectivity.

Resources