Azure Web Application (windows) - stalling during CILJit::compileMethod in calls to Entity Framework Core - azure

I have been looking into performance, specifically calls to an ASP.NET Core 3.1 Web API project that is running on Azure.
Note: yes, we should be moving to a later version of .NET Core, and that's in the pipeline, but it's not something I can just switch to without a bit of effort.
We are targeting netcoreapp3.1 for our libraries, and are referencing Entity Framework Core v3.1.5.
Looking at a typical end-to-end trace in Application Insights, we see this:
If I'm reading this correctly, we're spending a grand total of 135ms in the database executing queries, but between the last 2 queries we appear to stall for ~12 seconds!
When I dig into the profiler trace for this request, I see this:
Again, if I read this right, that means during the second DB call (from our end to end transaction above), we spend ~12.4 seconds inside the call to EntityFrameworkQueryableExtensions.ToListAsync() doing some jit compilation.
That looks excessive to me.
This appears to be a pattern that I see through out the day, even though the application is set to be Always On and there are no restarts of the application between occurrences of this.
The questions I have around this are:
is this to be typically expected?
if so, should it really be taking this long?
is there a way to reduce the need to jit as often as we appear to be doing?
will a move to .NET 6 (and future framework versions) help us here?
On average, the API operates pretty well, and does have a typical avg response time in the < 1 second range. However, when these do occur, they are noticeable and are causing me a headache.

Had exactly the same problem. After emailing azure with the issue, back and forth, it turns out the solution (that worked for us) was to turn off the profiler. https://github.com/dotnet/runtime/issues/66649
Off option - same in .net / .net core
After turning the option off we went from having over 1,500 requests a day taking over 10 seconds, to none taking over 5.
Thanks,
Dave

Related

Azure Function App is very slow, but only sometimes

I have read through most of the questions that seems to be similar to what I'll ask so hopefully I'm not wasting anyone's time.
We have a Function App in Azure Cloud that contains several Durable Functions.
One of these durable functions is a HTTP trigger API REST call.
It will normally take between 0.5 - 3 seconds to execute fully (from call to done, delivered result). But sometimes it takes 20-35 seconds. I don't know why or how I can search for errors.
The durable function fetches information from a Cosmos DB and delivers the result back to the caller.
Function App, Durable Function and Cosmos DB are all located in the same Region. (Checked that).
The Durable Function is set to B2:2 and has toggled Always On to ON.
Is there something I miss or something I should check to make sure it runs smoother?
Log of executions of the app:
I greatly appreciate everyone's time and energy they put into helping me. Thanks a lot.
---- Additions to the post after posting ----
I have checked the interactive tool and if I read that correctly it tells me a maximum execution time of 0.8 seconds and a maximum network lag of 6 seconds. That would indicate something that I suspected before I set up this post and that is that Azure needs to cold start the function. But I have always on toggled on so why?
It doesn't seem to take 30 seconds to complete the function. It seems to take less than 1 second to complete the function and up to a maximum of 6 seconds lag, but where are the other 23 seconds going in a 30 second call?
B2:2 is the service agreement I have with Azure. B2 is the test environments second paid state with 2 instances scaling (I have changed that to 3 after posting this).
Application Insights are on and no other dependencies are present except the Cosmos DB.
AFAIK in Azure Functions,
After 5 minutes of inactivity, Function App goes to the cold state. To come out of it, 10 seconds delay occurs.
Even the Function App is in Hot State, it will take some excessive amount of time to load the external libraries defined in it.
In the Function App, Code Logic Performance also matters the cause of slowness in the Azure Functions.
There are few steps for reducing the cold-start times particularly for the Functions having external libraries:
Running from a package file WEBSITE_RUN_FROM_PACKAGE to 1 may reduce cold-start times, particularly for JavaScript functions with large npm package trees.
From the Azure Portal > Diagnose and solve problems > Troubleshoot Performance category to identify the causes of slowness:
Try Always On Feature available in App Service Plan and Premium Plan of the Azure Functions to prevent such issues.
Regarding the Performance and reliability improving of Azure Functions, please refer here.
If this issue persists still, then please raise an incident with Microsoft Support to get the root cause and resolution.
Try fiddling around with maxqueuepollingintervall. It helped out with our cold starts quite a bit.

Sometime my function take a looong time to execute

I have tried for the 1st time Azure Function, besides a couple of problems where I found a workaround, it was quite easy to develop and publish my function to Azure. I even tried preview features like durable entities and it works great, I am enthusiast.
However, I had some concerns with the timings. My function is http triggered, it's called by another application. Most of the time execution time is ~1sec which is great. Sometimes, I don't know why it takes up to 30 secs to execute the same function. Is this normal? Maybe some cold start? Or it's me doing something wrong? I am a newbie so I'd like the experts opinion. I am using consumption plan in w. Europe.
Unfortunately for this application anything > 4 sec is not acceptable because it will cause an error in the caller reflected in turn to the end user.
Here you can se a screen capture of logs with timings, look at the bottom what crazy slow times.
Any way to ensure timing always within 4 secs?
This much variation would not be expected with cold start. Generally cold start is about 2-5 seconds and should only happen if a long period of no invocations. Also the measurement here is just execution time, and doesn’t include startup time. I’d recommend looking into logs and adding traces to see if there’s a line of code it’s hanging on.
First step is to understand what happens once you hit one Azure Function endpoint, step by step:
Azure must allocate your application to a server with capacity,
The Functions runtime must then start up on that server,
Your code then needs to execute.
I don't know why it takes up to 30 secs to execute the same function. Is this normal? Maybe some cold start?
I think the answer is related to cold start, the following image represents what happens when you trigger a function app's endpoint (Source: Understanding serverless cold start):
I have similar issues once using Consumption plan. A dedicated plan might be a solution for your case, half minute to warm up an endpoint is pretty bad. To keep the function warm, you have a chance to use Premium plan which promises the following:
When you're using the Premium plan, instances of the Azure Functions host are added and removed based on the number of incoming events just like the Consumption plan. Premium plan supports the following features: Perpetually warm instances to avoid any cold start
You can read about this further: Premium plan (preview)
Additional information:
Be careful with the mentioned option because the pricing might be different based on the following:
Instead of billing per execution and memory consumed, billing for the Premium plan is based on the number of core seconds, execution time, and memory used across needed and reserved instances. At least one instance must be warm at all times. This means that there is a fixed monthly cost per active plan, regardless of the number of executions.
I would consider at least for testing purposes the above mentioned option, I hope the answer helps and gives you the idea why you have slow startup.

Is there a possibility that the timer trigger will be executed with a lot of delay?

First of all, the development environment is asp.net core 2.0 (azure functions preview).
A timer trigger was triggered after a significantly later time than the defined time. (About 3 hours and 20 minutes)
It is not a time zone problem.
Under normal circumstances, how much delays should be considered?
Timers in Azure Functions are accurate. If they fail to run or run with significant delay then it indicates there is some other underlying issue, either an application problem or a functions bug (v2 still has many bugs). Assuming you already have app insights enabled, look at the exceptions table for evidence of something going wrong before the timer stopped running.

Azure functions portal log / monitor isn't very accurate

I've been using functions for a while and it seems the longer the Function is around, the less accurate the Portal logs are. When I first was using my functions for maybe 3 months everything monitor/logging wise was fine. Over time things starting getting less accurate.
Now I see the real logs by going to the ms azure storage explorer and checking the AzureWebJobsStorage.
First when I bring up the code/logs the last log it brings up isn't accurate. It will be from a few days ago usually, or the last error. When it triggers though, it does get the live feed. This isn't that big a deal, it's the monitor being inactive that and not being able to see the logs from that which is bad. I suppose I just use the Azure Storage explorer.
Monitor Invocation Logs, always seems a few days behind. This used to be accurate, but the last month or so, it's always a few days behind
Dan,
The local, file based logs, exist primarily to support the portal experience, so the behavior you're observing on the log window is expected as the logs are not written by the runtime as part of the normal invocation process, but only when you're actively developing/testing on the portal.
The issue you're experiencing with the monitor is due to a regression that has been patched and should be fully rolled out today (you can see more details here)
We've been listening to feedback on our logging capabilities, and there has been a lot of investment in that area, resulting in the recently announced built in integration with Application Insights. That integration addresses some of the pain points you've brought up as well as other issues, so I'd strongly recommend trying it out. You can find more information about it here.

Azure App Service: How can I determine which process is consuming high CPU?

UPDATE: I've figured it out. See the end of this question.
I have an Azure App Service running four sites. One of the sites has two deployment slots in addition to the primary one. Recently I've been seeing really high CPU utilization for the App Service plan as a whole.
The dark orange line shows the CPU percentage. This is just after restarting all my sites, which brought it down to this level.
However, when I look at the CPU use reported by each site, it's really low.
The darker blue line shows the CPU time, which is basically nothing. I did this for all of my sites, and all the graphs look the same. Basically, it seems that none of my sites are causing the issue.
A couple of the sites have web jobs, so I took a look at the logs but everything is running fine there. The jobs run for a few seconds every few hours.
So my question is: how can I determine the source of this CPU utilization? Any pointers would be greatly appreciated.
UPDATE: Thanks to the replies below, I was able to get more detail into what was happening. I ended up getting what I needed from SCM / Kudu tools. You can get here by going to your web app in Azure and choosing Advanced Tools from the side nav. From the Kudu dashboard, choose Process Explorer. The value in the Total CPU Time column is not directly useful, because it's the time in seconds that the process has run since it started, which might have been minutes or days ago.
However, if you make a record of the value at intervals, you can look at the change over time, and one process might jump out at you. In my case, it was my WebJobs process. Every 60 seconds, this one process was consuming about 10 seconds of processor time, just within one environment.
The great thing about this Kudu dashboard is, if you can catch the problem while it is actually happening, you can hit the Start Profiling button and capture a diagnostic session. You can then open this up in Visual Studio and get some nice details about where the CPU time is being spent.
Just in case anyone else is seeing similar issues, I'll provide more details about my particular case. As I mentioned, my WebJobs exe was the culprit, and I found that all the CPU time was being spent in StackExchange.Redis.SocketManager, which manages connections to Azure Redis Cache. In my main web app, I create only one connection, as recommended. But Since my web jobs only run every once in a while, I was creating a new connection to Azure Redis Cache each time one ran, which apparently can lead to issues. I changed my code to create the Redis Cache connection once when the WebJob process starts up and use the existing connection when any individual WebJob runs.
Time will tell if this really fixes the issue, but I think it will. When the problem occurred, it always fit the same pattern: After a few days of running fine, my CPU would slowly ramp up over the course of about 12 hours. My thinking is that each time a WebJob ran, it created a connection object, which at first didn't produce trouble, but gradually as WebJobs ran every hour or two, cruft was building up until finally some critical threshold was met and the CPU usage would take off.
Hope this helps someone out there. Best wishes!
May be you should go to webApp scm?
%yourAppName%.scm.azurewebsites.com;
There is a page, that can show you all process, that runned now on your web app. (something like Console > Process).
Also you can go to support page (from scm right corner).
You can find some more info about your performance there, and make memory dump (not for this problem, but it useful for performance issues).
According to your description, I assumed that you could leverage the Crash Diagnoser extension to capture dump files from your Web Apps and WebJobs when the CPUs usage percentage is higher than the specific threshold to isolate this issue. For more details, you could refer to this official blog.

Resources