What would cause high KUDU usage (and eventual 502 errors) on an Azure App Service Plan? - azure

We have a number of API apps and WebApps on an Azure App Service P2v2 instance. We've been experiencing an amount of platform instability: the App Service becomes unhealthy and we get a rash of 502 errors across various of the Apps (different ones each time), attributable to very high CPU and Memory usage on the app service. We've tried scaling all the way up to P3v2, but whatever the issue is seems eventually to consume all resources available.
Whenever we've been able to trace a culprit among the apps, it has turned dout not to be the app itself but the Kudu service related to it.
A sample error message is High physical memory usage detected on multiple occasions. The kudu process for the app [sitename]'pe-services-color' is the most common cause of high memory usage. The most common cause of high memory usage for the kudu process is web jobs. where the actual app whose Kudu service is named changes quite frequently.
What could be causing the Kudu services to consume so much CPU/Memory, and what can we do to stabilise this app service?
Is it simply that we have too many apps running on one plan? This seems unlikely since all these apps ran previously on a single classic cloud service instance, but if so, what are the limits for apps and slots on a single plan?
(I have seen this question but the answer doesn't help)
Update
From Azure support, these are apparently the limits on Small - Medium - Large non-shared app services:
Worker Size Max sites
Small 5 Medium 10 Large 20
with 'sites' comprising app services/api apps and their slots.
They seem ridiculously low, and make the larger App Service units highly uneconomic. Can anyone confirm these numbers?
(Incidentally, we found that turning off Always On across the board fixed the issue - it was only causing a problem on empty sites though - we haven't had a chance yet to see if performance is good with all the sites filled.)

High CPU and memory utilization would be mostly caused by your program/code itself. If there are lot of CPU intensive tasks and you applied lot of parallel programming that spawn lot of new threads can contribute to high cpu and memory utilization. So review your code and see such instances. When number of parallel threads increased cpu utilization goes high and it starts scaling up frequently that adds up your cost also sometime thread loss and unexpected results. As Azure resources costs are high you need to plan your performance accordingly.
You can monitor this using the Metrics option of the app service plan in the blade .

Related

What to expect if Azure App Service CPU maxes out momentarily?

I currently have an Azure App Service API that usually runs extremely low average and max CPU (<10% utilization). Every now and again, the CPU will spike due to a temporary spike in client requests. These spikes seemingly only last for a split second, but I’m wondering if this is cause for concern. What is the result of an Azure App Service CPU maxing out temporarily (either for a split second or for several seconds). Will this cause the app to crash, or will it just buffer requests until intensive tasks complete? It is worth noting that despite the spike in CPU, memory utilization remains low. Thanks in advance for input.
It looks like the CPU load is caused entirely from a large number of intensive requests all coming in at the same time.
When the CPU utilization of your Azure App Service API spikes temporarily, it could cause the app to slow down or become unresponsive, depending on the level and duration of the spike.
The system will try to buffer requests during this time, but if the spike is too high, some requests may be dropped or time out. This can result in a poor user experience or errors, especially if the spike is sustained for an extended period of time.
To mitigate this issue, you can take a few steps.
You can optimize the code of your API to reduce the CPU utilization of each request. This could involve reducing the number of operations performed for each request, using caching or other performance-enhancing techniques, or optimizing the algorithm used for processing requests.
You can also consider scaling up the resources of your App Service, such as increasing the number of CPU cores or adding more memory, to handle the increased load during spikes. Another option is to use horizontal scaling by adding more instances of your API to distribute the load across multiple servers, which can help reduce the impact of spikes.
And you can monitor your App Service to detect and respond to spikes in CPU utilization in real-time.
For example, you can use Azure Monitor to set up alerts that trigger when CPU utilization exceeds a certain threshold, and automatically scale your App Service in response.
By checking the Azure monitor logs and Web App Diagnostics, can find the reasons behind CPU Utilization.
Diagnose and solve problems:
CPU Usage:
CPU Drill Down:
References taken from
Application monitoring for Azure App Service
CPU Diagnostics, Identify and Diagnose High CPU issues

Understand why Azure App Service has delay to start processing requests (AppInsight Tracing)

We have a doubt about Azure because in some cases we have some dead times when we received requests in one of our AppServices or when a Service Bus triggers, for example, an Azure Functions.
If you see this image, you will see an example:
AppInsight Example Image
We execute a Request and at 5 seconds, but Azure delays more than 30 seconds to start the execution. We made a lot of optimizations in our apps, but we have no visibility about this delay.
Did someone face the same issue and found some solution? We believe it is a performance issue in the Workers, but, this happens also when the Workers are with a low load of memory and CPU. So we don't know how to scale horizontally automatically the resource if it is without load.
This happens also in our AZF, but we believe it's an issue between the Service Bus and the container of the AZF. In these cases we found the AZF has a higher consumption of CPU, but we don't why, because in the local environment we process a lot of messages with multithreading without any problem.

What is the optimal architecture design on Azure for an infrequently used backend that needs a robust configuration?

I'm trying to find the optimal cloud architecture to host a software on Microsoft Azure.
The scenario is the following:
A (containerised) REST API is exposed to the users through which they can submit POST and GET requests. POST requests trigger a backend that needs a robust configuration to operate properly and GET requests are sent to fetch the result of the backend, if any. This component of the solution is currently hosted on an Azure Web App Service which does the job perfectly.
The (containerised) backend (triggered by POST requests) perform heavy calculations during a short amount of time (typically 5-10 minutes are allotted for the calculation). This backend needs (at least) 4 cores and 16 Gb RAM, but the more the better.
The current configuration consists in the backend hosted together with the REST API on the App Service with a plan that accommodates the backend's requirements. This is clearly not very cost-efficient, as the backend is idle ~90% of the time. On top of that it's not really scalable despite an automatic scaling rule to spawn new instances based on the CPU use: it's indeed possible that if several POST requests come at the same time, they are handled by the same instance and make it crash due to a lack of memory.
Azure Functions doesn't seem to be an option: the serverless (consumption plan) solution they propose is restricted to 1.5 Gb RAM and doesn't have Docker support.
Azure Container Instances neither, because first the max number of CPUs is 4 (which is really few for the needs here, although acceptable) and second there are cold starts of approximately 2 minutes (I imagine due to the creation of the container group, pull of the image, and so on). Despite the process is async from a user perspective, a high latency is not allowed as the result is expected within 5-10 minutes, so cold starts are a problem.
Azure Batch, which at first glance appears to be a perfect fit (beefy configurations available, made for hpc, cost effective, made for time limited tasks, ...) seems to be slow too (it takes a couple of minutes to create a pool and jobs don't run immediately when submitted).
Do you have any idea what I could use?
Thanks in advance!
Azure Functions
You could look at Azure Functions Elastic Premium plan. EP3 has 4 cores, 14GB of RAM and 250GB of storage.
Premium plan hosting provides the following benefits to your functions:
Avoid cold starts with perpetually warm instances
Virtual network connectivity.
Unlimited execution duration, with 60 minutes guaranteed.
Premium instance sizes: one core, two core, and four core instances.
More predictable pricing, compared with the Consumption plan.
High-density app allocation for plans with multiple function apps.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan?tabs=portal
Batch Considerations
When designing an application that uses Batch, you must consider the possibility of Batch not being available in a region. It's possible to encounter a rare situation where there is a problem with the region as a whole, the entire Batch service in the region, or your specific Batch account.
If the application or solution using Batch always needs to be available, then it should be designed to either failover to another region or always have the workload split between two or more regions. Both approaches require at least two Batch accounts, with each account located in a different region.
https://learn.microsoft.com/en-us/azure/batch/high-availability-disaster-recovery

App services on azure seems to be very slow

I am trying to track down when our frontend started to work that slow. Recently I created new app services within the same service plan.
so now I have six apps (2 frontend, 4 backend) running under same App Service plan using Basic pricing tier. Also, we use Kudu for deployments.
Could that be the reason? or how to look for the reason?
this is overview of that service plan
appreciating any ideas and suggestions
#user122222 This is a high CPU issue and not a slow request issue as others have pointed out.
An immediate action you can take is to scale up. If you are using a B1 instance in the basic tier, try to scale up to a B3, which will provide you with more CPU cores and RAM. See if that provides you relief. If so, then you likely need to remain at this instance level. At this point it would also be worth while to analyze your number of requests. You should scale up when you are running many sites or resource intensive sites and you should scale out when you are receiving a high number of requests.
My money is on the fact that you likely have an issue with your code that is causing a deadlock or similar. Your CPU usage graph is stuck at 100% usage over many hours. Even an overloaded ASP will see a few dips over the course of a few hours.
To troubleshoot high CPU usage, start by using the diagnose and solve problems blade in your app service plan. This is the same troubleshooting tool that a support engineer would use in a paid technical support case. Use it to troubleshoot high CPU (not slow requests as based on your screenshot, it would appear the CPU is the culprit of the slow requests).
This can tell you what app in the ASP is causing the issue and sometimes even tell you the process in that app that is causing the issue. Beyond this, I'd suggest creating and analyzing a memory dump of the problematic web app. More steps on how to do that here.
Please try to restart the worker instance.
https://learn.microsoft.com/en-us/rest/api/appservice/app-service-plans/reboot-worker#code-try-0

100% Memory usage on Azure App Service Plan with two Apps - working set used 10gb+

I've got an app service plan with 14gb of memory - it should be plenty for my application's needs. There are two application services running on it, each identical - the private memory consumption of these hovers around 1gb but can spike to 4gb during periods of high usage. One app has a heavier usage pattern than the other.
Lately, during periods of high usage, I've noticed that the heavily used service can become unresponsive, and memory usage stays at 100% in the App Service Plan.
The high traffic service is using 4gb of private memory and starting to massively slow down. When I head over to the /scm.../ProcessExplorer/ page, I can see that the low traffic service has 1gb private memory used and 10gb of 'Working Set'.
As I understand it, on a single machine at least, the working set should be freed up when that memory is needed on another process. Does this happen naturally when two App Services share a single Plan?
It looks to me like the working set on the low-traffic instance is not being freed up to supply the needs of the high-traffic App Service.
If this is indeed the case, the simple fix is to move them to separate App Service Plans, each with 7gb of memory. However this seems like it might potentially be just shifting the problem around - has anyone else noticed similar issues with multiple Apps on a single App Service Plan? As far as I understand it, these shouldn't interfere with one another to the extent that they all need to be separated. Or have I got the wrong diagnosis?
In some high memory-consumption scenarios, your app might truly require more computing resources. In that case, consider scaling to a higher service tier so the application gets all the resources it needs. Other times, a bug in the code might cause a memory leak. A coding practice also might increase memory consumption. Getting insight into what's triggering high memory consumption is a two-part process. First, create a process dump, and then analyze the process dump. Crash Diagnoser from the Azure Site Extension Gallery can efficiently perform both these steps. For more information.
refer Capture and analyze a dump file for intermittent high memory for Web Apps.
In the end we solved this one via mitigation, rather than getting to the root cause.
We found a mitigation strategy to our previous memory issues several months ago, which was just to restart the server each night using a powershell script. This seems to prevent the memory just building up over time, and only costs us a few seconds of downtime. Our system doesn't have much overnight traffic as our users are all based in the same geographic location.
However we recently found that the overnight restart was reporting 'success' but actually failing each night due to expired credentials. Which meant that the memory issues we were having in the question I posted were actually exacerbated by server uptimes of several weeks. Restoring the overnight restart resolved the memory issues we were seeing, and we certainly don't see our system ever using 10gb+ again.
We'll investigate the memory issues if they rear their heads again. KetanChawda-MSFT's suggestion of using memory dumps to analyse the memory usage will be employed for this investigation when it's needed.

Resources