Multiple Redis connection exception (No Connection available to service) during App service swap slots - azure

I have a web app in production (.Net Core), I deployed it in Azure as App service which is in premium tier p2v2 4 instances. I am also using Azure Redis cache (Premium Tier) which my app is using it as cache. I have two app services (primary and secondary) configured Traffic Manager for load balancing.
Whenever I am trying to deploy my app into production using swap slot feature, Both the app service response time goes up to 20 secs and it is down for around 1 minute and my CPU utilization goes close to 90%. And I am seeing multiple exceptions from Redis client (For ex: No connection is available to service this operation: EVAL; It was not possible to connect to the Redis server(s). To create a disconnected multiplexer, disable AbortOnConnectFail. ConnectTimeout; IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=452,Free=32315,Min=8,Max=32767), Local-CPU: n/a) and my HttpQueue length goes above 10
I can infer from the above image is that worker thread has been overloaded, Donno why it is happening
I am using .Net StackExchange Redis client version 2.0.601, recently did an update from version 1.2.4
Note:
I didn't use slot specific app setting.
It keeps happening for every swap slots during deployment
I didn't find any app service restart in the logs.
I want to know any of you guys are facing this issue, if yes please suggest me where is the problem or how to debug and it would also better if you can share any of things you tried.
I tried to find any error logs in AZure Redis cache server but couldn't find any.
I am trying to figure out what is causing this issue, how to debug this kind of issues with azure, and whether anybody encountered the same and have implemented any resolution for the same?
Please let me know if you need any additional details.

Here is something which might be worth trying :
Cache metrics are reported using several reporting intervals, including Past hour, Today, Past week, and Custom. The Metric blade for each metrics chart displays the average, minimum, and maximum values for each metric in the chart, and some metrics display a total for the reporting interval.
Each metric includes two versions. One metric measures performance for the entire cache, and for caches that use clustering, a second version of the metric that includes (Shard 0-9) in the name measures performance for a single shard in a cache. For example if a cache has 4 shards, Cache Hits is the total amount of hits for the entire cache, and Cache Hits (Shard 3) is just the hits for that shard of the cache.
Try looking for the Error metric while monitoring.
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-how-to-monitor#available-metrics-and-reporting-intervals
Additionally , we need to retry for TimeoutException, RedisConnectionException or SocketException even which ensure it will try to connect in case of any exception, you can read about all the best practises arouns Redis Cache usage in below doc:
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices#when-is-it-safe-to-retry
Hope it helps.

Related

Azure AppService Performance Issue

We have ASP.Net Core 2.1 Web API hosted in AppService (S1) that talks to Azure SQL DB (S1-20DTUs). Both are in same region. During load testing we found that some API instances are taking too much time to return the result.
We tried to troubleshoot the performance issue and below are our observations.
API responds within 0.5 secs most of the time.
API methods are all async methods.
Sometimes it takes around 50 secs to over a minute.
CPU & Memory utilization are below 60%
Database has 20 DTU capacity, out of which 6 DTUs are used during load testing.
In the below example snapshot from Application Insights, we see total duration of the request was 27.4 secs. But the database dependency duration was just 97ms. There is no activity till the database was called. Please refer below example.
Can someone please help me to understand what was happening in this 27 secs of wait time. What could be the reason for this?
I would recommend checking the Application Map on Application Insights resource as shown below to double check the dependencies.
Verify the CPU and Memory metrics by going to the "Diagnose and solve problems" link on App service as shown below and run the Availability and Performance report to find out if there were any issues during your load testing.
Use Async methods on your API to maximize the CPU usage. It may be that the worker process threads are hitting the limits and your app is the bottleneck. You should get some insights when you run the report mentioned in point 2 above.
The S1 tier will support no more than 900 concurrent session. If you request per second (RPS rate) during the load test is very high you may face issues.
Also S3 and above are recommended for intensive workloads. Checking if all the connections are closed properly also helps
You can find details about different pricing tiers and their capabilities in the below link
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-dtu-resource-limits-single-databases

Azure App Service "Local Written Bytes"

I have an app service running that has 8 instances running in the service plan.
The app is written in asp dotnet core, it's an older version than is currently available.
Occasionally I have an issue where the servers start returning a high number of 5xx errors after a period of sustained load.
It appears that only one instance is having an issue - which is causing the failed request rate to climb.
I've noticed that there is a corresponding increase in the "locally written bytes" on the instance that is having problems - I am not writing any data locally so I am confused as to what this metric is actually measuring. In addition the number of open connections goes high and then stays high - rebooting the problematic instance doesn't seem to achieve anything.
The only thing I suspect is that we are copying data from a user's request straight into Azure Blob Store using the UploadFromStreamAsync from the HttpRequest.Body - with the data coming from a mobile phone app.
Microsoft support suggested we swapped to using local cache as an option to reduce issues with storage, however this has not resolved the issue.
Can anyone tell me what is the "locally written bytes" actually measuring? There is little documentation on this metric that I can find in google.

AspNet Core on Azure App Service peaks to 100% CPU and app gateway load balancer not working

We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.

Sudden dropoff in Azure queue performance

Short version: What reasons could there be for a sudden, dramatic, and seemingly permanent increase in the rate of timing-out Azure queue requests?
It's going to be difficult to provide all of the details that could possibly be relevant here, but here's a start:
This is an Azure application (SDK v2.0) with a WCF service placing work requests on a queue (roughly 100k calls a day) and a couple of worker roles which process the queue. We've got New Relic monitoring with the latest .NET agent (3.3.38).
We've run into an issue in our latest release, deployed a few days ago -- after it ran normally for about 24 hours, all of a sudden we started seeing a greatly increased rate of timeouts when our worker roles fetch messages from the queue, along with a catastrophic drop in throughput (our application can now barely keep up with its own queue using 40 workers, whereas it usually gets by with just 2!) Ever since the timeouts started, they show no signs of letting up, keeping up at the same rate since it started happening.
A couple images from New Relic to illustrate:
While this isn't nearly enough information to provide a good answer, I'm just trying to figure out where I might start looking. I've got support tickets open with New Relic and Microsoft, but we're trying to investigate on our own as well. Could this be throttling? Some kind of resource exhaustion in my queue processor worker role? We don't see increased load on the WCF service, and we haven't changed Azure client libraries or changed much of anything in the code that processes the queue.
I suggest you enable analytics on your storage account to determine if the bottleneck is server side or client side/network related. Specifically, you can look at Storage Analytics Metrics table - AverageE2ELatency and AverageServerLatency properties to check if the issue is server side or client side.
You can learn more about Azure storage analytics from links below
Overview:
http://msdn.microsoft.com/en-us/library/hh343270.aspx
How to enable in portal:
http://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
Metrics table Schema:
http://msdn.microsoft.com/en-us/library/hh343264.aspx
Blog post:
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/08/03/windows-azure-storage-analytics.aspx

Connecting to Azure Cache Service takes about 3.3 seconds

In our still-in-development project we have noticed sudden delays when accessing our ASP.NET Web API services. Using the awesome Mini Profiler we nailed it that these delays are caused when connections to the Azure Data Cache (Preview) services are dropped and they have to be reestablished. This process takes about 3.3 seconds. After reconnecting, getting an object from the cache takes 1.4 ms.
When I increased maxConnectionsToServer from 1 to 20, I noticed another thing. If I don't make requests to the Web API for 1 or 2 minutes (that's usually when the connections are dropped) and then start making calls, next 20 requests are delayed for 3.3 seconds, which is how connection pooling works I guess (round-tripping the connections from the pool).
Both the Web API and Caching service are hosted in the East US region, we have disabled local cache, SSL is disabled, auto discover is enabled.
So, I'm wondering if something is wrong with our configuration or is this a thing because Azure Cache is still in preview?
Any information will be valued.
Thanks!
It sounds like your shared cache is being offloaded due to inactivity. One way to test this would be to add an In-Role Cache to an existing service (if available) and swap your cache usage to this new cache. In-Role cache is described here.
Once the cache is moved off of a shared offering, wait the requisite 1-2 minutes for idle time out and retry the connection, the delay should not be present.
Assuming you want to stick with the shared cache option after isolating the problem, the only current workaround that I am aware of is running a background task that will periodically ping the cache to keep it alive.
If you are running a full Web role you can launch a background task on application start up.
If you deploying via Mobile Services, then you can run the "ping" via Scheduled Jobs. The only issue you may run into here is that the minimum time for a scheduled job is 1 minute, which may not be aggressive enough to keep your cache alive 100% of the time.
Nothing that I see points to you doing anything wrong per se. It may be the Azure is genuinely having problems getting the cache connections up and running quickly. According to several best practices documents and MSDN posts, you want to increase your number of connections to caches to allow for a failover to an active connection, which you've effectively done with your configuration change.
Try making sure that your cache accessor is a static object (another MSDN recommendation) and this may be a long shot but consider using the Sliding Window option for object expiration and see if that not only tells the countdown for the object store to reset, but also prompts the cache service to reset the connection.

Resources