How to optimize Postgresql max_connections and node-postgres connection pool? - node.js

In brief, I am having trouble supporting more than 5000 read requests per minute from a data API leveraging Postgresql, Node.js, and node-postgres. The bottleneck appears to be in between the API and the DB. Here are the implmentation details.
I'm using an AWS Postgresql RDS database instance (m4.4xlarge - 64 GB mem, 16 vCPUs, 350 GB SSD, no provisioned IOPS) for a Node.js powered data API. By default the RDS's max_connections=5000. The node API is load-balanced across two clusters with 4 processes each (2 Ec2s with 4 vCPUs running the API with PM2 in cluster-mode). I use node-postgres to bind the API to the Postgresql RDS, and am attempting to use it's connection pooling feature. Below is a sample of my connection pool code:
var pool = new Pool({
user: settings.database.username,
password: settings.database.password,
host: settings.database.readServer,
database: settings.database.database,
max: 25,
idleTimeoutMillis: 1000
});
/* Example of pool usage */
pool.query('SELECT my_column FROM my_table', function(err, result){
/* Callback code here */
});
Using this implementation and testing with a load tester, I can support about 5000 requests over the course of one minute, with an average response time of about 190ms (which is what I expect). As soon as I fire off more than 5000 requests per minute, my response time increases to over 1200ms in the best of cases and in the worst of cases the API begins to frequently timeout. Monitoring indicates that for the EC2s running the Node.js API, CPU utilization remains below 10%. Thus my focus is on the DB and the API's binding to the DB.
I have attempted to increase (and decrease for that matter) the node-postgres "max" connections setting, but there was no change in the API response/timeout behavior. I've also tried provisioned IOPS on the RDS, but no improvement. Also, interestingly, I scaled the RDS up to m4.10xlarge (160 GB mem, 40 vCPUs), and while the RDS CPU utilization dropped greatly, the overall performance of the API worsed considerably (couldn't even support the 5000 requests per minute that I was able to with the smaller RDS).
I'm in unfamilar territory in many respects and am unsure of how to best determine which of these moving parts is bottlenecking API performance when over 5000 requests per minute. As noted I have attempted a variety of adjustments based on the review of Postgresql configuration documentation and node-postgres documentation, but to no avail.
If anyone has advice on how to diagnose or optimize I would greatly appreciate it.
UPDATE
After scaling up to m4.10xlarge, i performed a series of load-tests, varying the number of request/min and the max number of connections in each pool. Here are some screen captures of monitoring metrics:

In order to support more then 5k requests, while maintaining the same response rate, you'll need better hardware...
The simple math states that:
5000 requests*190ms avg = 950k ms divided into 16 cores ~ 60k ms per core
which basically means your system was highly loaded.
(I'm guessing you had some spare CPU as some time was lost on networking)
Now, the really interesting part in your question comes from the scale up attempt: m4.10xlarge (160 GB mem, 40 vCPUs).
The drop in CPU utilization indicates that the scale up freed DB time resources - So you need to push more requests!
2 suggestions:
Try increasing the connection pool to max: 70 and look at the network traffic (depending on the amount of data you might be hogging the network)
also, are your requests to the DB a-sync from the application side? make sure your app can actually push more requests.

The best way is to make use of a separate Pool for each API call, based on the call's priority:
const highPriority = new Pool({max: 20}); // for high-priority API calls
const lowPriority = new Pool({max: 5}); // for low-priority API calls
Then you just use the right pool for each of the API calls, for optimum service/connection availability.

Since you are interested in read performance can set up replication between two (or more) PostgreSQL instances, and then use pgpool II to load balance between the instances.
Scaling horizontally means you won't start hitting the max instance sizes at AWS if you decide next week you need to go to 10,000 concurrent reads.
You also start to get some HA in your architecture.
--
Many times people will use pgbouncer as a connection pooler even if they have one built into their application code already. pgbouncer works really well and is typically easier to configure and manage that pgpool, but it doesn't do load balancing. I'm not sure if it would help you very much in this scenario though.

Related

OpenSearch (ElasticSearch) latency issue under multi-threading using RestHighLevelClient

We use RestHighLevelClient to query AWS OpenSearch in our service. Recently we have seen some latency issues related to OpenSearch calls so I'm doing stress test to troubleshoot but observed some unexpected behaviors.
In our service when a request is received, we start 5 threads and make one OpenSearch call within each thread in parallel in order to achieve the latency performance similar to one call. During load tests even when I send traffic with 1TPS, for the same request I'm seeing very different latency numbers for different threads, specifically there's usually one or two threads seeing huge latency compared to others, which seems like that thread is being blocked by something, for example 390 ms, 300ms, 1.1 sec, 520ms, 30ms for each thread while in the mean time I don't see any search latency spike reported on OpenSearch service, with the max SearchLatency being under 350ms all the time.
I read that the low level rest client used in the RestHighLevelClient is managing a conn pool with very small default maxConn values so I've override both the DEFAULT_MAX_CONN_PER_ROUTE to be 100 and DEFAULT_MAX_CONN_TOTAL to be 200 when creating the client but it doesn't seem working based on the test results I saw before and after updating these two values.
I'm wondering if anyone has seen similar issues or has any ideas on what could be the reason for this behavior. Thanks!

Configuring the ideal maxPoolSize on MongoDB Native driver

The default value for maxPoolSize on the MongoClient is 100.
How would one know if the default value, 100, is the optimal value? Is there a formula, a way of calculating to determine the ideal maxPoolSize?
Thanks a lot.
The pool size is solely dependent on the driver you are using, for Node.js minimum is 5 which is default and the maximum is 100(default).
Pool size helps to make the concurrent requests to the DB and it depends upon how much concurrent connection(queries you can run on) you need to create from the application. You can increase it based on your usage- that you have to need more than 100 connections at a time with DB or you need some connections for long-running tasks etc.
And these connections are of course based on the server that you are running the DB on. Mongo atlas does provide certain clusters that can have up to 1500 connections at a time.
You should check this post to see how does it impact application performance.

cassandra connections spikes load issue

I am using cassandra according to the following struct:
21 nodes , AWS EC2 i3.2xlarge , version 3.11.4 .
The application is opening about 5000 connection per node (so its 100k connections per cluster) using the datastax java connection driver.
Application is using autoscale and frequently opens/close connections.
Number of connections to open at once by app servers can reach up to 500 per node (opens simultaneously on all nodes at once - so its 10k connections opens at the same time across the cluster)
This cause spikes of load on cassandra and cause reads and writes latency.
I have noticed each time connections opens/close there are high number of reads from system_auth.roles and system_auth.role_permissions.
How can I prevent the load and resolve this issue ?
You need to modify your application to work with as small number of connections as possible. You need to have following in mind:
Create Cluster/Session object, once at start and keep it. Initialization of session is very expensive operation, it adds a load to Cassandra, and to your application as well
you may increase the number of the simultaneous requests per connection, instead of opening new connections. Protocol allows to have up to 32k requests per connection. Although, if you have too many requests in-flight, then it's a sign that your Cassandra doesn't keep with workload and can't answer fast enough. See documentation on connection pooling

Bursts of Redis errors

We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Or:
Could not acquire distributed lock: Conflicted (RedLock.net error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.

Load testing bottleneck on nodejs with Google Compute Engine

I cannot figure out what is the cause of the bottleneck on this site, very bad response times once about 400 users reached. The site is on Google compute engine, using an instance group, with network load balancing. We created the project with sailjs.
I have been doing load testing with Google container engine using kubernetes, running the locust.py script.
The main results for one of the tests are:
RPS : 30
Spawn rate: 5 p/s
TOTALS USERS: 1000
AVG(res time): 27500!! (27,5 seconds)
The response time initially is great, below one second, but when it starts reaching about 400 users the response time starts to jump massively.
I have tested obvious factors that can influence that response time, results below:
Compute engine Instances
(2 x standard-n2, 200gb disk, ram:7.5gb per instance):
Only about 20% cpu utilization used
Outgoing network bytes: 340k bytes/sec
Incoming network bytes: 190k bytes/sec
Disk operations: 1 op/sec
Memory: below 10%
MySQL:
Max_used_connections : 41 (below total possible)
Connection errors: 0
All other results for MySQL also seem fine, no reason to cause bottleneck.
I tried the same test for a new sailjs created project, and it did better, but still had terrible results, 5 seconds res time for about 2000 users.
What else should I test? What could be the bottleneck?
Are you doing any file reading/writing? This is a major obstacle in node.js, and will always cause some issues. Caching read files or removing the need for such code should be done as much as possible. In my own experience, serving files like images, css, js and such trough my node server would start causing trouble when the amount of concurrent requests increased. The solution was to serve all of this trough a CDN.
Another proble could be the mysql driver. We had some problems with connection not being closed correctly (Not using sails.js, but I think they used the same driver at the time I encountered this), so they would cause problems on the mysql server, resulting in long delays when fetching data from the database. You should time/track the amount of mysql queries and make sure they arent delayed.
Lastly, it could be some special issue with sails.js and Google compute engine. You should make sure there arent any open issues on either of these about the same problem you are experiencing.

Resources