Azure Redis timeouts - azure-web-app-service

I have a Asp.Net Azure web app calling a Azure Redis instance. I keep getting timeouts on Redis. The message I get is below.
inst: 1, mgr: Inactive, err: never, queue: 120, qu: 0, qs: 120, qc: 0, wr: 0, wq: 0, in: 65536, ar: 0, clientName: RD0004FFA37AA4, serverEndpoint: Unspecified/server.redis.cache.windows.net:6380, keyHashSlot: 11524, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=21,Free=8170,Min=200,Max=8191)
Both the app and redis are in the same region (East US 2).

Having "in: 65536" in your error message indicates that there are 65536 bytes sitting in the client kernel socket receive buffer ready to be processed but have not yet been parsed by the client app.
I have seen this in two cases:
ThreadPool settings need to be adjusted (as described here). However, the error message provided above does not show this as being a problem. Check a sampling of more errors to ensure that this is not happening in other cases.
Client side CPU is running hot. When the CPU is high, the code that handles the socket receive events is not triggered in a timely manner, so the processing code does not parse the waiting data in the kernel buffer. Check your client CPU history. Be careful though - a short-lived spike in CPU might not show up in azure metrics because the CPU is captured on a periodic cycle (I think every 20 seconds). If the spike happens between those samples and is short, it might not be noticed by the metrics gathering system.
Other common client-side causes are documented here and common server-side errors are documented here.

Related

AppEngine nodejs app sporadically sends 502s and restarts

We have a nodejs app that gets successfully deployed to a standard environment. Something happens after about two hours (or sooner depending on traffic): our downstream clients start receiving a bunch of 502 responses and then the service stabilizes. We think this has been happening for at least a few months.
When investigating the cause of the 502s, I see that:
There are no unhandled exception/promise rejection logs to indicate that the node app has crashed
I console.log when receiving SIGTERM and that, too, does not appear in the logs
The logs of the nginx sidecar include the following:
2020/06/16 23:11:11 [error] 35#35: *1149 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /api/redacted HTTP/1.1", upstream: "http://127.0.0.1:8081/api/redacted", host: "redacted.appspot.com""
I'm assuming that the 502s are coming from nginx because the upstream has disappeared. Are there other explanations I should explore?
If GAE is replacing my app containers intentionally, shouldn't that process prevent these types of 502s?
Should I expect something other than SIGTERM to be sent by the environment when the application/container is getting replaced?
Update #1 (2020-06-22)
I investigated and found evidence that we might be exceeding memory quota so I changed our instance_class from F1 to F2. As I write this our instances are sitting at ~200M of memory usage (F2s have 512M available). Additionally, I use the --max-old-space-size switch to set nodes memory usage to 496M.
The 502s are still happening.
I suspect that the 502s are happening as a result of the autoscaler terminating instances. Our app never receives SIGTERM (even during deployments). That means I can't close http keepalive connections gracefully and might explain why nginx raises Connection reset by peer.
Update #2 (2020-06-24)
Our service is just standard REST type stuff, no heavy loops.
I'll post another update with some memory graphs but I don't see any spikes. Perhaps a small memory leak.
Here's our app.yaml:
service: redacted
runtime: nodejs12
instance_class: F2
handlers:
- url: /.*
secure: always
redirect_http_response_code: 301
script: auto
We had a very similar problem with our Node.js app deployed on App Engine Flexible.
In our case, we ultimately determined that we had memory pressure that was causing the Node.js garbage collector to sometimes delay the processing of a request for hundreds of milliseconds (sometimes more). This caused our health check URLs to sporadically timeout, prompting GAE to remove the instance from the active pool.
Because we typically had just two instances handling the steady traffic, removing one instance quickly overloaded the remaining instance, and it would soon suffer the same fate.
We were surprised to find that it could take two minutes or longer before App Engine assigned traffic to a newly-created instance. Between the time our original instances were declared unhealthy, and when new instance(s) were online, 502s would be returned (presumably by GAE's nginx) to the client.
We were able to stabilize the environment simply by adding:
automatic_scaling:
min_num_instances: 4
To our app.yaml. Because two instances were generally sufficient for the traffic, ensuring we always had four running apparently kept our memory usage low enough to prevent the GC from stalling request handling, and even if it did, we had enough excess capacity to handle one instance being removed.
The scaling settings for GAE standard are slightly different.
In retrospect, we could see that our latency/response times would get a little "jittery" before the real problems started. Most responses had typical response times ~30ms, but increasingly we would see outlier requests in the x00ms range. You may want to check your request logs to see if you see something similar.
New Relic's Node.js VM data was helpful in detecting that garbage collection was taking an increasing amount of time.
Usually, 502 messages are errors on nginx side, as you have mentioned. The detailed logs related to this errors are not surfaced to Cloud Logging, yet.
According to your behavior, it seems a workload, so we can relate this case to an issue with running out of resources.
There are somethings that are well worth to take a look:
Check your metrics. The memory and CPU usage should be under healthy limits.
Check whether your scaling metrics are being enough to your workload.
Is there a chance to share these metrics near to the restart event?
Also, i t would be goo if you share your resources and scaling in the app.yaml.

Bursts of Redis errors

We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Or:
Could not acquire distributed lock: Conflicted (RedLock.net error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.

What should be done when the provisioned throughput is exceeded?

I'm using AWS SDK for Javascript (Node.js) to read data from a DynamoDB table. The auto scaling feature does a great job during most of the time and the consumed Read Capacity Units (RCU) are really low most part of the day. However, there's a programmed job that is executed around midnight which consumes about 10x the provisioned RCU and since the auto scaling takes some time to adjust the capacity, there are a lot of throttled read requests. Furthermore, I suspect my requests are not being completed (though I can't find any exceptions in my error log).
In order to handle this situation, I've considered increasing the provisioned RCU using the AWS API (updateTable) but calculating the number of RCU my application needs may not be straightforward.
So my second guess was to retry failed requests and simply wait for auto scale increase the provisioned RCU. As pointed out by AWS docs and some Stack Overflow answers (particularlly about ProvisionedThroughputExceededException):
The AWS SDKs for Amazon DynamoDB automatically retry requests that receive this exception. So, your request is eventually successful, unless the request is too large or your retry queue is too large to finish.
I've read similar questions (this one, this one and this one) but I'm still confused: is this exception raised if the request is too large or the retry queue is too large to finish (therefore after the automatic retries) or actually before the retries?
Most important: is that the exception I should be expecting in my context? (so I can catch it and retry until auto scale increases the RCU?)
Yes.
Every time your application sends a request that exceeds your capacity you get ProvisionedThroughputExceededException message from Dynamo. However your SDK handles this for you and retries. The default Dynamo retry time starts at 50ms, the default number of retries is 10, and backoff is exponential by default.
This means you get retries at:
50ms
100ms
200ms
400ms
800ms
1.6s
3.2s
6.4s
12.8s
25.6s
If after the 10th retry your request has still not succeeded, the SDK passes the ProvisionedThroughputExceededException back to your application and you can handle it how you like.
You could handle it by increasing throughput provision but another option would be to change the default retry times when you create the Dynamo connection. For example
new AWS.DynamoDB({maxRetries: 13, retryDelayOptions: {base: 200}});
This would mean you retry 13 times, with an initial delay of 200ms. This would give your request a total of 819.2s to complete rather than 25.6s.

Azure Function Queue Trigger falling way behind

I'm working on a demo for azure functions using queue triggers. I created a recursive Sudoku solver to show how to take depth first search and convert to using queued recursion. The code is on github.
I was expecting it to scale out and process an insane number of messages per second, but it is barely processing 30/s. The queue is filling up and the utilization seems minimal.
How can I get better performance from this? I tried increasing the batch size in the host.json, but didn't seem to help. I have over 200k messages in the queue and it's growing.
Update 1
I tried setting the host.json file as
{
"queues": {
"visibilityTimeout": "00:00:10",
"batchSize": 32,
"maxDequeueCount": 5,
"newBatchThreshold": 100
}
}
but request per second remained the same.
I deployed the same function to another instance, but tied it to S4 service plan. This is able to process about 64 requests per second, but still seems slow.
I can serial process the messages locally way faster than this.
Update 2
I scaled the S4 to 10 instances and each instance is handling about 60-70 requests per second. But that's insanely expensive to still not be able to process as fast as I can with a single core locally. The queue used with the service plan functions has 500k messages piled up.
Azure functions do not listen for an item to be added to a queue, they actually pole the queue using a polling algorithm which you can over ride with the maxPollingInterval property.
Adding "maxPollingInterval": "00:00:01" to the options you have already mentioned above should solve your problem.
maxPollingInterval azure documentaiton

Azure service bus queue only receiving 450 messages at a time

Does Azure have some kind of limitation in terms of batch items received? The following code is only retrieving 450 messages despite being asked for more:
QueueConnector.MyQueueClient.ReceiveBatch(1000, new TimeSpan(0, 0, 10));
I've tried increased times, but it doesn't have any impact--450, every time. This appears to be the recommended way in the Azure SDK docs of batch receiving.
Note: there are tens of thousands of items in the queue.
The count passed to the ReceiveBatch is an upper bound and that is also mentioned right in the docs, so this is expected behavior. Service Bus will release a batch based on message availability or batch size. Batches are capped at 256 kByte for send and receive. For SendBatch, that's also said in the docs.

Resources