Understand why Azure App Service has delay to start processing requests (AppInsight Tracing) - azure

We have a doubt about Azure because in some cases we have some dead times when we received requests in one of our AppServices or when a Service Bus triggers, for example, an Azure Functions.
If you see this image, you will see an example:
AppInsight Example Image
We execute a Request and at 5 seconds, but Azure delays more than 30 seconds to start the execution. We made a lot of optimizations in our apps, but we have no visibility about this delay.
Did someone face the same issue and found some solution? We believe it is a performance issue in the Workers, but, this happens also when the Workers are with a low load of memory and CPU. So we don't know how to scale horizontally automatically the resource if it is without load.
This happens also in our AZF, but we believe it's an issue between the Service Bus and the container of the AZF. In these cases we found the AZF has a higher consumption of CPU, but we don't why, because in the local environment we process a lot of messages with multithreading without any problem.

Related

What is the optimal architecture design on Azure for an infrequently used backend that needs a robust configuration?

I'm trying to find the optimal cloud architecture to host a software on Microsoft Azure.
The scenario is the following:
A (containerised) REST API is exposed to the users through which they can submit POST and GET requests. POST requests trigger a backend that needs a robust configuration to operate properly and GET requests are sent to fetch the result of the backend, if any. This component of the solution is currently hosted on an Azure Web App Service which does the job perfectly.
The (containerised) backend (triggered by POST requests) perform heavy calculations during a short amount of time (typically 5-10 minutes are allotted for the calculation). This backend needs (at least) 4 cores and 16 Gb RAM, but the more the better.
The current configuration consists in the backend hosted together with the REST API on the App Service with a plan that accommodates the backend's requirements. This is clearly not very cost-efficient, as the backend is idle ~90% of the time. On top of that it's not really scalable despite an automatic scaling rule to spawn new instances based on the CPU use: it's indeed possible that if several POST requests come at the same time, they are handled by the same instance and make it crash due to a lack of memory.
Azure Functions doesn't seem to be an option: the serverless (consumption plan) solution they propose is restricted to 1.5 Gb RAM and doesn't have Docker support.
Azure Container Instances neither, because first the max number of CPUs is 4 (which is really few for the needs here, although acceptable) and second there are cold starts of approximately 2 minutes (I imagine due to the creation of the container group, pull of the image, and so on). Despite the process is async from a user perspective, a high latency is not allowed as the result is expected within 5-10 minutes, so cold starts are a problem.
Azure Batch, which at first glance appears to be a perfect fit (beefy configurations available, made for hpc, cost effective, made for time limited tasks, ...) seems to be slow too (it takes a couple of minutes to create a pool and jobs don't run immediately when submitted).
Do you have any idea what I could use?
Thanks in advance!
Azure Functions
You could look at Azure Functions Elastic Premium plan. EP3 has 4 cores, 14GB of RAM and 250GB of storage.
Premium plan hosting provides the following benefits to your functions:
Avoid cold starts with perpetually warm instances
Virtual network connectivity.
Unlimited execution duration, with 60 minutes guaranteed.
Premium instance sizes: one core, two core, and four core instances.
More predictable pricing, compared with the Consumption plan.
High-density app allocation for plans with multiple function apps.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan?tabs=portal
Batch Considerations
When designing an application that uses Batch, you must consider the possibility of Batch not being available in a region. It's possible to encounter a rare situation where there is a problem with the region as a whole, the entire Batch service in the region, or your specific Batch account.
If the application or solution using Batch always needs to be available, then it should be designed to either failover to another region or always have the workload split between two or more regions. Both approaches require at least two Batch accounts, with each account located in a different region.
https://learn.microsoft.com/en-us/azure/batch/high-availability-disaster-recovery

Limit Azure Function restart rate

I already faced similar problem few times:
Azure Function with ServiceBusTrigger by some reason (misconfiguration, infrastructure issues, doesn't really matter) fails to connect to ServiceBus (so it happens on trigger level) and it leads to two issues:
It tries to restart all the time, increasing CPU consumption
It generates literally a millions of exceptions in AppInsights, which leads to quota exceedance
Practically every error in configuration means significantly increased bills and requires thorough monitoring after every deployment, which is annoying and error prone solution.
So, my question: If there is a way to set some delay between restart attempts to (for example) one second? And, as addition - is there way to limit amount of restart attempts and then shut down the Function?
Establishing a connection to the broker to fetch messages is Functions responsibility, Scale Controller. That aspect is entirely abstracted from customers and not configurable. I suggest raising an issue with Azure Functions team, likely under the Runtime repo.

What would cause high KUDU usage (and eventual 502 errors) on an Azure App Service Plan?

We have a number of API apps and WebApps on an Azure App Service P2v2 instance. We've been experiencing an amount of platform instability: the App Service becomes unhealthy and we get a rash of 502 errors across various of the Apps (different ones each time), attributable to very high CPU and Memory usage on the app service. We've tried scaling all the way up to P3v2, but whatever the issue is seems eventually to consume all resources available.
Whenever we've been able to trace a culprit among the apps, it has turned dout not to be the app itself but the Kudu service related to it.
A sample error message is High physical memory usage detected on multiple occasions. The kudu process for the app [sitename]'pe-services-color' is the most common cause of high memory usage. The most common cause of high memory usage for the kudu process is web jobs. where the actual app whose Kudu service is named changes quite frequently.
What could be causing the Kudu services to consume so much CPU/Memory, and what can we do to stabilise this app service?
Is it simply that we have too many apps running on one plan? This seems unlikely since all these apps ran previously on a single classic cloud service instance, but if so, what are the limits for apps and slots on a single plan?
(I have seen this question but the answer doesn't help)
Update
From Azure support, these are apparently the limits on Small - Medium - Large non-shared app services:
Worker Size Max sites
Small 5 Medium 10 Large 20
with 'sites' comprising app services/api apps and their slots.
They seem ridiculously low, and make the larger App Service units highly uneconomic. Can anyone confirm these numbers?
(Incidentally, we found that turning off Always On across the board fixed the issue - it was only causing a problem on empty sites though - we haven't had a chance yet to see if performance is good with all the sites filled.)
High CPU and memory utilization would be mostly caused by your program/code itself. If there are lot of CPU intensive tasks and you applied lot of parallel programming that spawn lot of new threads can contribute to high cpu and memory utilization. So review your code and see such instances. When number of parallel threads increased cpu utilization goes high and it starts scaling up frequently that adds up your cost also sometime thread loss and unexpected results. As Azure resources costs are high you need to plan your performance accordingly.
You can monitor this using the Metrics option of the app service plan in the blade .

AspNet Core on Azure App Service peaks to 100% CPU and app gateway load balancer not working

We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.

Sudden dropoff in Azure queue performance

Short version: What reasons could there be for a sudden, dramatic, and seemingly permanent increase in the rate of timing-out Azure queue requests?
It's going to be difficult to provide all of the details that could possibly be relevant here, but here's a start:
This is an Azure application (SDK v2.0) with a WCF service placing work requests on a queue (roughly 100k calls a day) and a couple of worker roles which process the queue. We've got New Relic monitoring with the latest .NET agent (3.3.38).
We've run into an issue in our latest release, deployed a few days ago -- after it ran normally for about 24 hours, all of a sudden we started seeing a greatly increased rate of timeouts when our worker roles fetch messages from the queue, along with a catastrophic drop in throughput (our application can now barely keep up with its own queue using 40 workers, whereas it usually gets by with just 2!) Ever since the timeouts started, they show no signs of letting up, keeping up at the same rate since it started happening.
A couple images from New Relic to illustrate:
While this isn't nearly enough information to provide a good answer, I'm just trying to figure out where I might start looking. I've got support tickets open with New Relic and Microsoft, but we're trying to investigate on our own as well. Could this be throttling? Some kind of resource exhaustion in my queue processor worker role? We don't see increased load on the WCF service, and we haven't changed Azure client libraries or changed much of anything in the code that processes the queue.
I suggest you enable analytics on your storage account to determine if the bottleneck is server side or client side/network related. Specifically, you can look at Storage Analytics Metrics table - AverageE2ELatency and AverageServerLatency properties to check if the issue is server side or client side.
You can learn more about Azure storage analytics from links below
Overview:
http://msdn.microsoft.com/en-us/library/hh343270.aspx
How to enable in portal:
http://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
Metrics table Schema:
http://msdn.microsoft.com/en-us/library/hh343264.aspx
Blog post:
http://blogs.msdn.com/b/windowsazurestorage/archive/2011/08/03/windows-azure-storage-analytics.aspx

Resources