I'm writing a backend app using nodejs which execute a lot of http requests to external services and s3.
I have reached to roughly 800 requests per second on a single kubernetes pod.
The pod is limited to a single vcpu and it has reached to 100% usage.
I can scale it to tens of pods to handle the execution of thousands of requests,
but it seems that this limit has reached too soon.
I have tested it in my real backend app and then on a demo pod which does nothing but to send http request using axios.
Does it make sense that a single vcpu kubernetes pod can only handle 800 req / sec? (as client and not as a server).
It's quite hard to propose any advice for the best approach with choosing a proper capacity for the compute resources affordable to your specific needs. However, when you use 1x vCPU in Pod limit requests it equivalents 1 CPU unit for most widely used Cloud providers VM resources.
Thus, I would bet here for adding more CPU units into your Pod than spinning more Pods with a same number of vCPU by Kubernetes scheduler using HPA (Horizontal Pod Autoscaler) feature. Therefore, if you don't have enough capacity on your node, it's very easy to push lots of Pod to be overloaded; and indeed this would not give positive influence on Node compute engine.
In your example, there are two key metric parameters to analyze: latency (time for sending requests and receiving answer) and throughput (requests per second) of HTTP requests; here is always the rule on the top: Increasing the latency will decrease the overall throughput for your requests.
You can also read about Vertical Pod Autoscaler as an option for managing compute resources in Kubernetes cluster.
Related
We are in the process of transitioning from a self-managed kubernetes cluster to Google's GKE Autopilot. All of our incoming API requests are handled by an "API Gateway" server that routes requests to various internal services. When testing this API Gateway on the new GKE Autopilot cluster, we noticed sporadic EAI_AGAIN DNS resolution errors for the internal services.
These errors occur even at low load (50-100 requests per second), but appear to increase when the number of concurrent requests increases. They also occur when rolling out a new image to downstream pods, despite having multiple replicas and a rolling update strategy.
Our API Gateway is written in NodeJS. Researching online (1, 2), I found that the issue might be related to one of (i) overloading the kubernetes-internal DNS server, (ii) overloading the nodejs event loop since getaddrinfo is blocking, (iii) an issue with MUSL in alpine-based NodeJS images, or (iv) a race condition in earlier linux kernel versions.
All but (iv) can probably be ruled out in our case:
(i) kubedns is at very low CPU usage, and GKE Autopilot implements node-local DNS caching by default.
(ii) Our API Gateway is at low CPU usage. The event loop does appear to lag sporadically for up to 100ms, but there is not an overwhelming correlation to the EAI_AGAIN error rate.
(iii) We are running debian-based NodeJS images.
(iv) I'm not sure what linux kernel version GKE Autopilot pods are running on, but at our low load I don't think we should be hitting this error.
It is strange to me that we are seeing these errors given that our load is not high compared to what other companies are running on kubernetes. Does someone have any pointers for where to look further?
From the google cloud run docs page:
Concurrency is configurable. By default each Cloud Run container instance can receive up to 80 requests at the same time; you can increase this to a maximum of 1000.
If my app is written in Node.js express/fastify, it could easily support well beyond 1000. See the benchmark.
Fastify: 56457 req/sec (50x Cloud Run's max)
Express: 11883 req/sec (11x Cloud Run's max)
I understand that practical results may be lower than the above results. But still it could support well beyond a single 1000, I hope.
While the server frameworks support a higher concurrency, why do Google Cloud Run throttle it to a maximum of single 1000?
(Same is the case with Firebase Functions v2 which runs in Google Cloud Run. Hence tagging firebase as well here in the question)
You made a mistake.
Take the Cloud Run limitation: can handle up to 1000 requests concurrently
Take your test results: 56457 requests per seconds.
Now, the mistake. Imagine your request is processed in 20ms, so 1/50 of second. If you handle 1000 concurrent requests each 1/50 of second, you can handle 50000 request per seconds.
There is no limitation on Cloud Run on the number of request per second, but the number of concurrent requests in the same time, on the same instance (1000 should be a limit due to Google load balancing and traffic routing infrastructure)
We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Or:
Could not acquire distributed lock: Conflicted (RedLock.net error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.
I have a situation where I create a Node.js cluster using PM2. A single request fired at a worker would take considerable time (2+ mins) as it's doing intensive computations (in a pipeline of steps) with a couple of I/O operations at different stages (step 1 is 'download over HTTP', an intermediate and last step are 'write to disk'). The client that sends requests to the cluster throttle the requests it sends by two factors:
Frequency (how many requests per second), we use a slow pace (1 per second)
How many open requests it can make, we make this less than or equal to the number nodes we have in the cluster
For example, if the cluster is 10 nodes, then the client will only send 10 requests to the cluster at a speed of 1 per second, and won't send any more requests until one or more requests returns with either success or failure, which means that a worker or more should be free now to do more work, then the client will send more work to the cluster.
While watching the load on the server, it seems that the load balancer does not distribute work evenly as one would expect from a classic round-robin distribution schema. What happens is that a single worker (usually the 1st one) will receive a lot of requests while there're free workers in the cluster. This eventually will cash the worker to malfunction.
We implemented a mechanism to prevent a worker from proceeding with new requests if it's still working on a previous one. This prevented malfunctioning, but still, a lot of requests are denied service although the cluster has vacant workers!
Can you think of the reason why this behavior is happening, or how can improve the way PM2 does work?
In brief, I am having trouble supporting more than 5000 read requests per minute from a data API leveraging Postgresql, Node.js, and node-postgres. The bottleneck appears to be in between the API and the DB. Here are the implmentation details.
I'm using an AWS Postgresql RDS database instance (m4.4xlarge - 64 GB mem, 16 vCPUs, 350 GB SSD, no provisioned IOPS) for a Node.js powered data API. By default the RDS's max_connections=5000. The node API is load-balanced across two clusters with 4 processes each (2 Ec2s with 4 vCPUs running the API with PM2 in cluster-mode). I use node-postgres to bind the API to the Postgresql RDS, and am attempting to use it's connection pooling feature. Below is a sample of my connection pool code:
var pool = new Pool({
user: settings.database.username,
password: settings.database.password,
host: settings.database.readServer,
database: settings.database.database,
max: 25,
idleTimeoutMillis: 1000
});
/* Example of pool usage */
pool.query('SELECT my_column FROM my_table', function(err, result){
/* Callback code here */
});
Using this implementation and testing with a load tester, I can support about 5000 requests over the course of one minute, with an average response time of about 190ms (which is what I expect). As soon as I fire off more than 5000 requests per minute, my response time increases to over 1200ms in the best of cases and in the worst of cases the API begins to frequently timeout. Monitoring indicates that for the EC2s running the Node.js API, CPU utilization remains below 10%. Thus my focus is on the DB and the API's binding to the DB.
I have attempted to increase (and decrease for that matter) the node-postgres "max" connections setting, but there was no change in the API response/timeout behavior. I've also tried provisioned IOPS on the RDS, but no improvement. Also, interestingly, I scaled the RDS up to m4.10xlarge (160 GB mem, 40 vCPUs), and while the RDS CPU utilization dropped greatly, the overall performance of the API worsed considerably (couldn't even support the 5000 requests per minute that I was able to with the smaller RDS).
I'm in unfamilar territory in many respects and am unsure of how to best determine which of these moving parts is bottlenecking API performance when over 5000 requests per minute. As noted I have attempted a variety of adjustments based on the review of Postgresql configuration documentation and node-postgres documentation, but to no avail.
If anyone has advice on how to diagnose or optimize I would greatly appreciate it.
UPDATE
After scaling up to m4.10xlarge, i performed a series of load-tests, varying the number of request/min and the max number of connections in each pool. Here are some screen captures of monitoring metrics:
In order to support more then 5k requests, while maintaining the same response rate, you'll need better hardware...
The simple math states that:
5000 requests*190ms avg = 950k ms divided into 16 cores ~ 60k ms per core
which basically means your system was highly loaded.
(I'm guessing you had some spare CPU as some time was lost on networking)
Now, the really interesting part in your question comes from the scale up attempt: m4.10xlarge (160 GB mem, 40 vCPUs).
The drop in CPU utilization indicates that the scale up freed DB time resources - So you need to push more requests!
2 suggestions:
Try increasing the connection pool to max: 70 and look at the network traffic (depending on the amount of data you might be hogging the network)
also, are your requests to the DB a-sync from the application side? make sure your app can actually push more requests.
The best way is to make use of a separate Pool for each API call, based on the call's priority:
const highPriority = new Pool({max: 20}); // for high-priority API calls
const lowPriority = new Pool({max: 5}); // for low-priority API calls
Then you just use the right pool for each of the API calls, for optimum service/connection availability.
Since you are interested in read performance can set up replication between two (or more) PostgreSQL instances, and then use pgpool II to load balance between the instances.
Scaling horizontally means you won't start hitting the max instance sizes at AWS if you decide next week you need to go to 10,000 concurrent reads.
You also start to get some HA in your architecture.
--
Many times people will use pgbouncer as a connection pooler even if they have one built into their application code already. pgbouncer works really well and is typically easier to configure and manage that pgpool, but it doesn't do load balancing. I'm not sure if it would help you very much in this scenario though.