AspNet Core on Azure App Service peaks to 100% CPU and app gateway load balancer not working - azure

We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks

It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.

Related

Understand why Azure App Service has delay to start processing requests (AppInsight Tracing)

We have a doubt about Azure because in some cases we have some dead times when we received requests in one of our AppServices or when a Service Bus triggers, for example, an Azure Functions.
If you see this image, you will see an example:
AppInsight Example Image
We execute a Request and at 5 seconds, but Azure delays more than 30 seconds to start the execution. We made a lot of optimizations in our apps, but we have no visibility about this delay.
Did someone face the same issue and found some solution? We believe it is a performance issue in the Workers, but, this happens also when the Workers are with a low load of memory and CPU. So we don't know how to scale horizontally automatically the resource if it is without load.
This happens also in our AZF, but we believe it's an issue between the Service Bus and the container of the AZF. In these cases we found the AZF has a higher consumption of CPU, but we don't why, because in the local environment we process a lot of messages with multithreading without any problem.

App services on azure seems to be very slow

I am trying to track down when our frontend started to work that slow. Recently I created new app services within the same service plan.
so now I have six apps (2 frontend, 4 backend) running under same App Service plan using Basic pricing tier. Also, we use Kudu for deployments.
Could that be the reason? or how to look for the reason?
this is overview of that service plan
appreciating any ideas and suggestions
#user122222 This is a high CPU issue and not a slow request issue as others have pointed out.
An immediate action you can take is to scale up. If you are using a B1 instance in the basic tier, try to scale up to a B3, which will provide you with more CPU cores and RAM. See if that provides you relief. If so, then you likely need to remain at this instance level. At this point it would also be worth while to analyze your number of requests. You should scale up when you are running many sites or resource intensive sites and you should scale out when you are receiving a high number of requests.
My money is on the fact that you likely have an issue with your code that is causing a deadlock or similar. Your CPU usage graph is stuck at 100% usage over many hours. Even an overloaded ASP will see a few dips over the course of a few hours.
To troubleshoot high CPU usage, start by using the diagnose and solve problems blade in your app service plan. This is the same troubleshooting tool that a support engineer would use in a paid technical support case. Use it to troubleshoot high CPU (not slow requests as based on your screenshot, it would appear the CPU is the culprit of the slow requests).
This can tell you what app in the ASP is causing the issue and sometimes even tell you the process in that app that is causing the issue. Beyond this, I'd suggest creating and analyzing a memory dump of the problematic web app. More steps on how to do that here.
Please try to restart the worker instance.
https://learn.microsoft.com/en-us/rest/api/appservice/app-service-plans/reboot-worker#code-try-0

Diagnosing ASP.NET Azure WebApp issue

since a month one of our web application hosted as WebApp on Azure is having some kind of problem and I cannot find the root cause of that.
This WebApp is hosted on Azure on a 2 x B2 App Service Plan. On the same App Service Plan there is another WebApp that is currently working without any issue.
This WebApp is an ASP.NET WebApi application and exposes a REST set of API.
Effect: without any apparent sense (at least for what I know by now), the ThreadCount metric starts to spin up, sometimes very slowly, sometimes in few minutes. What happens is that no requests seems to be served and the service is dead.
Solution: a simple restart of the application (an this means a restart of the AppPool) causes an immediate obvious drop of the ThreadCount and everything starts as usual.
Other observations: there is no "periodicity" in this event. It happened in the evening, in the morning and in the afternoon. It seems that evening is a preferred timeframe, but I won't say there is any correlation.
What I measured through Azure Monitoring Metric:
- Request Count seems to oscillate normally. There is no peak that causes that increase in ThreadCount
- CPU and Memory seems to be normal, nothing strange.
- Response time, like the others metrics
- Connections (that should be related to sockets) oscillates normally. So I'd exclude something related to DB connections.
What may I do in order to understand what's going on?
After a lot of research, this happened to be related to a wrong usage of Dependency Injection (using Ninject) and an application that wasn't designed to use it.
In order to diagnose, I discovered a very helpful feature in Azure. You can reach it by entering into the app that is having the problem, click on "Diagnose and solve problems" then click on "Diagnostic tools" and then select "Collect .NET profiler report". In that panel, after configuring the storage for the diagnostic files, you can select "Add thread report".
In those report you can easily understand what's going wrong.
Hope this helps.

Multiple Redis connection exception (No Connection available to service) during App service swap slots

I have a web app in production (.Net Core), I deployed it in Azure as App service which is in premium tier p2v2 4 instances. I am also using Azure Redis cache (Premium Tier) which my app is using it as cache. I have two app services (primary and secondary) configured Traffic Manager for load balancing.
Whenever I am trying to deploy my app into production using swap slot feature, Both the app service response time goes up to 20 secs and it is down for around 1 minute and my CPU utilization goes close to 90%. And I am seeing multiple exceptions from Redis client (For ex: No connection is available to service this operation: EVAL; It was not possible to connect to the Redis server(s). To create a disconnected multiplexer, disable AbortOnConnectFail. ConnectTimeout; IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=452,Free=32315,Min=8,Max=32767), Local-CPU: n/a) and my HttpQueue length goes above 10
I can infer from the above image is that worker thread has been overloaded, Donno why it is happening
I am using .Net StackExchange Redis client version 2.0.601, recently did an update from version 1.2.4
Note:
I didn't use slot specific app setting.
It keeps happening for every swap slots during deployment
I didn't find any app service restart in the logs.
I want to know any of you guys are facing this issue, if yes please suggest me where is the problem or how to debug and it would also better if you can share any of things you tried.
I tried to find any error logs in AZure Redis cache server but couldn't find any.
I am trying to figure out what is causing this issue, how to debug this kind of issues with azure, and whether anybody encountered the same and have implemented any resolution for the same?
Please let me know if you need any additional details.
Here is something which might be worth trying :
Cache metrics are reported using several reporting intervals, including Past hour, Today, Past week, and Custom. The Metric blade for each metrics chart displays the average, minimum, and maximum values for each metric in the chart, and some metrics display a total for the reporting interval.
Each metric includes two versions. One metric measures performance for the entire cache, and for caches that use clustering, a second version of the metric that includes (Shard 0-9) in the name measures performance for a single shard in a cache. For example if a cache has 4 shards, Cache Hits is the total amount of hits for the entire cache, and Cache Hits (Shard 3) is just the hits for that shard of the cache.
Try looking for the Error metric while monitoring.
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-how-to-monitor#available-metrics-and-reporting-intervals
Additionally , we need to retry for TimeoutException, RedisConnectionException or SocketException even which ensure it will try to connect in case of any exception, you can read about all the best practises arouns Redis Cache usage in below doc:
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices
https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices#when-is-it-safe-to-retry
Hope it helps.

Sudden dropoff in Azure queue performance

Short version: What reasons could there be for a sudden, dramatic, and seemingly permanent increase in the rate of timing-out Azure queue requests?
It's going to be difficult to provide all of the details that could possibly be relevant here, but here's a start:
This is an Azure application (SDK v2.0) with a WCF service placing work requests on a queue (roughly 100k calls a day) and a couple of worker roles which process the queue. We've got New Relic monitoring with the latest .NET agent (3.3.38).
We've run into an issue in our latest release, deployed a few days ago -- after it ran normally for about 24 hours, all of a sudden we started seeing a greatly increased rate of timeouts when our worker roles fetch messages from the queue, along with a catastrophic drop in throughput (our application can now barely keep up with its own queue using 40 workers, whereas it usually gets by with just 2!) Ever since the timeouts started, they show no signs of letting up, keeping up at the same rate since it started happening.
A couple images from New Relic to illustrate:
While this isn't nearly enough information to provide a good answer, I'm just trying to figure out where I might start looking. I've got support tickets open with New Relic and Microsoft, but we're trying to investigate on our own as well. Could this be throttling? Some kind of resource exhaustion in my queue processor worker role? We don't see increased load on the WCF service, and we haven't changed Azure client libraries or changed much of anything in the code that processes the queue.
I suggest you enable analytics on your storage account to determine if the bottleneck is server side or client side/network related. Specifically, you can look at Storage Analytics Metrics table - AverageE2ELatency and AverageServerLatency properties to check if the issue is server side or client side.
You can learn more about Azure storage analytics from links below
Overview:
http://msdn.microsoft.com/en-us/library/hh343270.aspx
How to enable in portal:
http://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
Metrics table Schema:
http://msdn.microsoft.com/en-us/library/hh343264.aspx
Blog post:
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/08/03/windows-azure-storage-analytics.aspx

Resources