Azure DataCache MaxConnectionToServer - azure

I am using the AppFabricCacheSessionStoreProvider and occasionally get the error
ErrorCode:SubStatus:There is a temporary failure. Please retry later.
(The request failed, because you exceeded quota limits for this hour.
If you experience this often, upgrade your subscription to a higher
one). Additional Information : Throttling due to resource :
Connections.
I am using a basic 128mb cache with a web role which has two instances. What is the default MaxConnectionToServer value if it is not set? I think when I fire up a staging instance as well it can cause this error (4 simultaneous instances). Will setting MaxConnectionToServer to a higher value make it better or worse? I believe the 128mb cache has limit of 5 connections so should I set it to 1 which would mean only 4 connections could be used. The cache is not used elsewhere in the app.

The default for MaxConnectToServer is 1, so you shouldn't have to change this setting, but if you do set it to 1, it will avoid anyone else looking at your config from getting confused as well. If you set it to a higher value then you will see this problem more often.
The cache session provider seems to be a little slow at disposing of its connections to the cache when it doesn't need them any more. This means that if you're running a number of instances which is close to the limit for you cache size you do seem to see this error. You're correct a 128MB cache does only allow 5 concurrent connections. If you want to avoid this problem at the moment the only solution I'm aware of is to buy the next cache size up.

Related

How to recover from redis running out-of-memory and crashing due to retained keys by Bull for completed jobs?

It seems that using Bull 3.21.1 as a work queue with default configuration will cause Redis to retain keys indefinitely under successful operation, eventually exhausting memory in Redis and causing a crash. Here is an example experience, described. I had the same experience. Here's another, in which it is explained that although the default behavior of Bull cannot be memory-leak-free without causing breaking changes for existing users, it could be documented better that the default behavior will retain redis keys for completed jobs indefinitely, and what the configuration is to obtain operation without leaking memory in Redis. Bull's documentation at the writing of this question still contains no mention of this behavior, configuration point, nor solution.
Having had a production (or, luckier, pre-production) crash due to the undocumented default behavior of Bull to retain redis keys of completed jobs forever, how do I recover?
If possible, increase memory available to redis to relieve the immediate memory pressure. This is what we did.
A potentialy quick manual fix to remove old job's keys is to use this method: Bull.Queue#clean(1000 * 60 * 60 * 24) within an NPM script specified in package.json, and run against your prod node instance. (The argument is a grace period in milliseconds that completed jobs will not be reaped, so that value will be jobs older than 24 hours.) We only did this after the below, to purge all old jobs, but it could have been employed earlier to deflate the balloon and buy more time.
Fix the memory leak by providing the non-defaut configuration to Bull: defaultJobOptions: { removeOnComplete: true, removeOnFail: true }. This will end the ramp-up of redis bull key count and memory consumption, and provide the least-astonishing behavior of not-having a memory leak under default configuration and successful operation.

Is there a way to read a database link in Cosmos DB Java V4 API?

For example, reading "dbs/colls/document" instead of getting a container, then calling read on the container.
I've been having an issue where the first readItem on a container (after calling database.getContainer(x)) is extremely slow (like 1 second or longer) and was thinking using a database link could be faster.
I'm guessing a read after getting the container is slow because it doesn't make a service call until I call read.
Is there a way I can have this preloaded when reading in a database?
I have an application with a read(collectionName, key) method, and my approach was to use getContainer(collectionName) and then call read on that, but this method needs to be fast.
As discussed, the best practice is to keep an instance of your container alive between requests and call readItem on each request. This should resolve the primary issue.
As for the secondary concern, the "high latency every 50 requests or so", this is a known issue however it should only occur in the first minute or so of operation. If you can tolerate the initial slow requests, the solution is to wait for performance to stabilize. How long do you have to run your app for before you no longer see these high-latency requests?
FYI, if latency is a concern, run your client application in a geographically colocated Azure VM. Also a good rule of thumb is to allocate client CPU cores such that CPU utilization is not more than 40% or 50%.

Bursts of Redis errors

We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Or:
Could not acquire distributed lock: Conflicted (RedLock.net error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.

HTTP Error 503.2 - Service Unavailable. The serverRuntime#appConcurrentRequestLimit setting is being exceeded

I have a intranet SiteCore website set up on IIS 7 which randomly throws the following error message
HTTP Error 503.2 - Service Unavailable
The serverRuntime#appConcurrentRequestLimit setting is being exceeded.
To fix this issue, I have made following changes
Increased the Queue Length of application pool myrjetAppPool from 1000 to 65535.
Modified Machine.Config to increase requestQueueLimit property of ProcessModel element to 100000
Increased appConcurrentRequestLimit to 10000 by running
C:\Windows\System32\inetsrv\appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000
But I'm still getting the same error. ANy help is greatly appreaciated.
You might check to see where all your threads are going. We had occurrences where threads for Media Library assets were hanging and blocking up the queue.
In IIS Manager, select the server node from the tree, then the "Worker Processes" feature icon, then right-click the application pool of interest and select "View current requests". You might find something is getting stuck. I sometimes hit F5 on this screen a few dozen times in very quick succession to see the rate the requests are going through (of course Performance Monitor is better for viewing metrics but it won't tell you what URLs are being processed).
Investigate references in the linked url to 'MaxConcurrentReqeustsPerCPU' which you may need to set by creating a new registry key, depending on your OS and framework.
https://learn.microsoft.com/en-us/archive/blogs/tmarq/asp-net-thread-usage-on-iis-7-5-iis-7-0-and-iis-6-0
As already commented - check the actual concurrent request count using performance counters to determine which limit you're hitting i.e. it could be a limit of 5000 or maybe 12 (per cpu).
Edit: I realise this may look like I'm talking about a different setting entirely, but I believe there is overlap here.
We got this problem after an installation of an IIS plugin. After long investigating we saw that the config-file C:\Windows\System32\inetsrv\config\applicationHost.config had an extra location tag for the site with the problem. After removing the extra entry and an iisreset, the site/server worked normally againg. So something must went wrong during the installation....

Azure Table Storage RetryPolicy questions

A couple questions on using RetryPolicy with Table Storage,
Is it best practice to use RetryPolicy whenever you can, hence use ctx.SaveChangeWithRetries() instead of ctx.SaveChanges() accordingly whenever you can?
When you do use RetryPolicy, for example,
ctx.RetryPolicy = RetryPolicies.Retry(5, TimeSpan.FromSeconds(1));
What values do people normally use for the retryCount and the TimeSpan? I see 5 retries and 1 second TimeSpan are a popular choice, but would 5 retries 1 second each be too long?
Thank you,
Ray.
I think this is highly dependent on your application and requirements. The timeout errors to ATS happen so rarely that if a retry policy will not hurt to have in place and would be rarely utilized anyway. But if something fishy is happening, it may save yourself from having to debug weird errors.
Now, I would suggest that in the beginning you do not enable the RetryPolicy at all and have tracing instead so that you can see any issues with persistence to ATS. Once you're stabilized, putting a RetryPolicy maybe good idea to work around some runtime glitches on the ATS side. Just make sure you're not masking your own problems with RetryPolicy.
If your client is user facing like a web page you would probably like to use a linear retry with short waits (milliseconds) in between each retry, if your client is actually a non user facing backend service etc. then you would most likely want to use Exponential retries in order not to overload the table storage service in case it is already giving 5xx errors due to high load for instance.
Using the latest Azure Storage client SDK, if you do not define any retry policy in your table requests via the TableRequestOptions, then the default retry policy is used which is the Exponential retry. The sdk makes 3 retries in total for the errors that it deems retriable and this in total takes more or less 20 seconds if all retries fail.

Resources