OpenSearch (ElasticSearch) latency issue under multi-threading using RestHighLevelClient - multithreading

We use RestHighLevelClient to query AWS OpenSearch in our service. Recently we have seen some latency issues related to OpenSearch calls so I'm doing stress test to troubleshoot but observed some unexpected behaviors.
In our service when a request is received, we start 5 threads and make one OpenSearch call within each thread in parallel in order to achieve the latency performance similar to one call. During load tests even when I send traffic with 1TPS, for the same request I'm seeing very different latency numbers for different threads, specifically there's usually one or two threads seeing huge latency compared to others, which seems like that thread is being blocked by something, for example 390 ms, 300ms, 1.1 sec, 520ms, 30ms for each thread while in the mean time I don't see any search latency spike reported on OpenSearch service, with the max SearchLatency being under 350ms all the time.
I read that the low level rest client used in the RestHighLevelClient is managing a conn pool with very small default maxConn values so I've override both the DEFAULT_MAX_CONN_PER_ROUTE to be 100 and DEFAULT_MAX_CONN_TOTAL to be 200 when creating the client but it doesn't seem working based on the test results I saw before and after updating these two values.
I'm wondering if anyone has seen similar issues or has any ideas on what could be the reason for this behavior. Thanks!


very high max response and error when submit looping form submission

so my requirement is to run 90 concurrent user doing mutiple scenario (15 scenario)simultenously for 30 minutes in virtual some of the threads i use concurrent thread group and normal thread group.
now my issue is
1)after i execute all 15 scenarios, my max response for each scenario displayed very high (>40sec). is there any suggestion to reduce this high max response?
2)one of the scenario is submit web form, there is no issue if submit only one, however during the 90 concurrent user execution, some of submit web form will get 500 error code. is the error is because i use looping to achieve 30 min duration?
In order to reduce the response time you need to find the reason for this high response time, the reasons could be in:
lack of resources like CPU, RAM, etc. - make sure to monitor resources consumption using i.e. JMeter PerfMon Plugin
incorrect configuration of the middleware (application server, database, etc.), all these components need to be properly tuned for high loads, for example if you set maximum number of connections on the application server to 10 and you have 90 threads - the 80 threads will be queuing up waiting for the next available executor, the same applies to the database connection pool
use a profiler tool to inspect what's going on under the hood and why the slowest functions are that slow, it might be the case your application algorithms are not efficient enough
If your test succeeds with single thread and fails under the load - it definitely indicates the bottleneck, try increasing the load gradually and see how many users application can support without performance degradation and/or throwing errors. HTTP Status codes 5xx indicate server-side errors so it also worth inspecting your application logs for more insights

Limiting number of requests in cassandra without causing starting timeout ticking

The DataStax Cassandra driver of version 4 has got a feature of the throttling.
The documentation states:
Similarly, the request timeout encompasses throttling: the timeout starts ticking before the
throttler has started processing the request; a request may time out while it is still in the
throttler's queue, before the driver has even tried to send it to a node.
Great. However, let's say I have a dynamic list of some ids and I want to execute select requests to cassandra in parallel (using executeAsync()) for all ids in the list. Having list too large I will eventually face timeouts if requests are residing in the throttler's queue too long.
How can I overcome this issue? Is there any built-in rate limiting technique so I can do not care about how many requests in parallel I can execute, but just throw all of them to cassandra and then wait until they all are completed??
UPD: I am not interested in custom code solutions, as ofc we are capable to implement our own rate limit solution. I am asking precisely about driver's built-in mechanisms to achieve this.

Is there a way to read a database link in Cosmos DB Java V4 API?

For example, reading "dbs/colls/document" instead of getting a container, then calling read on the container.
I've been having an issue where the first readItem on a container (after calling database.getContainer(x)) is extremely slow (like 1 second or longer) and was thinking using a database link could be faster.
I'm guessing a read after getting the container is slow because it doesn't make a service call until I call read.
Is there a way I can have this preloaded when reading in a database?
I have an application with a read(collectionName, key) method, and my approach was to use getContainer(collectionName) and then call read on that, but this method needs to be fast.
As discussed, the best practice is to keep an instance of your container alive between requests and call readItem on each request. This should resolve the primary issue.
As for the secondary concern, the "high latency every 50 requests or so", this is a known issue however it should only occur in the first minute or so of operation. If you can tolerate the initial slow requests, the solution is to wait for performance to stabilize. How long do you have to run your app for before you no longer see these high-latency requests?
FYI, if latency is a concern, run your client application in a geographically colocated Azure VM. Also a good rule of thumb is to allocate client CPU cores such that CPU utilization is not more than 40% or 50%.

Bursts of Redis errors

We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Could not acquire distributed lock: Conflicted ( error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.

Improving Amazon SQS Performance

Everything I can find about performance of Amazon Simple Queue Service (SQS), including their own documentation, suggests that getting high throughput requires multiple threads. And I've verified this myself using the JS API with Node 12. If I create multiple threads, I get about the same throughput on each thread, so the total throughput increase is pretty much linear. But I'm running this on a nice machine with lots of cores. When I run in Lambda on a single core, multiple threads don't improve the performance, and generally this is what I would expect of multi-threaded apps.
But here's what I don't understand - there should be very little going on here in the way of CPU, most of the time is spent waiting on web requests. The AWS SQS API appears to be asynchronous in that all of the methods use callbacks for the responses, and I'm using Promises to "asyncify" all of the API calls, with multiple tasks running concurrently. Normally doing this with any kind of async IO is handled great by Node, and improves throughput hugely, I do it all the time with database APIs, multiple streams, etc. But SQS definitely isn't behaving that way, it's behaving as though its IO is actually synchronous and blocking threads on the network calls, which would be outrageous for any modern API.
Has anyone had success getting high SQS message throughput in a single Node thread? The max I'm seeing is about 50 to 100 messages/sec for FIFO queues (send, receive, and delete, all of which are calling the batch methods with the max batch size of 10). And this is running in lambda, i.e. on their own network, which is only slightly faster than running it on my laptop over the Internet, another surprising find. Amazon's documentation says FIFO queues should support up to 3000 messages per second when batching, which would be just fine for me. Does it really take multiple threads on multiple cores or virtual CPUs to achieve this? That would be ridiculous, I just can't believe that much CPU would be used, it should be mostly IO time, which should be asynchronous.
As I continued to test, I found that the linear improvement with the number of threads only happened when each thread was processing a different queue. If the threads are all processing the same queue, there is no improvement by adding threads. So it behaves as though each queue is throttled by Amazon. But the throughput to which it seems to be throttling is way below what I found documented as the max throughput. Really confused and disappointed right now!
Michael's comments to the original question were right on. I was sending all messages to the same message group. I had previously been working with AMQP message queues, in which messages will be ordered in the queue in the order they're sent, and they'll be distributed to subscribers in that order. But when multiple listeners are consuming the AMQP queue, because of varying network latencies, there is no guarantee that they'll be received in that order chronologically.
So that's actually a really cool feature of SQS, the guarantee that messages will be chronologically received in the order they were sent within the same message group. In my case, I don't care about the receipt order. So now I'm setting a unique message group ID on each message, and scaling up performance by increasing the number of async message receive loops, still just in one thread, and the throughput is amazing!
So the bottom line: If exact receipt order of messages isn't important for your FIFO queue, set the message group ID to a unique value on each message, and scale out with more receiver tasks to get the best throughput performance. If you do need guaranteed message ordering, it looks like around 50 messages per second is about the best you'll do.
