EF Core 3.1.14 Recurring Cold Start - azure

We have deployed a very simple .NET CORE 3 Web API application to Azure Cloud. The application is a web api and talks to a very simple SQL server database hosted in Azure as well. There are two main performance problems we are noticing
All API calls go to DB for either read or write operations. The tables only contain 4 and 5 rows and the queries are just basic select and insert queries with no joins.
The first call to the API is very slow (30 seconds to query 1 record in a table of size 10) and we added the timer and noticed it is the DB call that is taking 99.99% of the time. So I used the Azure Data Studio Profiler and realized that the query reached SQL Server after like 29.90 seconds. So the issue is not the query itself. Also, the second, third query etc. are super fast and return within < 30 milli-seconds. So the issue is not the internet connectivity between the Web App and the Azure SQL Database.
The bigger problem is that, if you stop calling the API for say 2-3 minutes and then do another call, then again the first query takes 30 seconds. But the subsequent queries are faster.
If this was only happening when w3wp.exe starts then I wouldn't be concerned but if the requests to the API stop coming for 2-3 minutes then again it is down. This is of concern.
We have Always ON set to Yes.
I tried collecting .NET Trace in Azure for the web app but this is giving me this weird error.
Here are the Nuget package versions installed in the VS solution related to EF.
Here is the SQL Server pricing tier.
Is there any other way to collect trace for Azure Web APP. I really need to see the call stack of the code for those 30 seconds to move forward. I have access to KUDU etc.
Thanks.
UPDATE 3 - 8th May 2021
I have posted the answer to my own question. This may not be root cause for other people who face similar problem but at least 1 area to look into.
UPDATE 2 - 7th May 2021
After adding EF Core logging as suggested by Ivan, he is right that the opening of connection is taking too long? Why is that? And how to stop that from happening?
UPDATE 1 - 7th May 2021
Jason Pan - We are using App Service Plan and this is the only application hosted there. The plan is P1V2 (https://azure.microsoft.com/en-us/pricing/details/app-service/windows/).
Ivan Stoev - Yes since the .NET Trace is not working for some reason as explained in my question, we captured the App Insights Profiler Trace to capture the call stack and as per call stack it appears that the connection to the SQL server was opened after 30 seconds. So I made two changes in my code:
a. Removed IDisposable from our Repository class that was having our context inject through DI. Before inside the Dispose method, I was calling Dispose on our context class.
b. I replaced services.AddDbContext with services.AddDbContextPool
I then wrote a test program to call the API method randomly once every 2 to 4 mins for 1 hour and only 1 call took 30 seconds and the remaining 21 took few milli-seconds.
But my next step is to run a 24 hours test (1 call every 2-7 mins for example) to see if this was just fluke or actually the solution.

Ok so posting an answer to my question. It turned out that there was no issue with web application, app service plan, sql server or entity framework. I took a network trace of my application and 1 other application which doesn't have any issue and opened it with network monitor. We noted that they are taking different paths. After looking into the IP address we realized that the other application had a virtual network setup. You can see that by going to your app service plan and then click on Networking option in the left menu bar. And then choose the first one for vNet. Once we configured vNet, then all responses were within < 1 second.
There was one another oversight by me. The Auth0 calls were also taking 14 seconds sometimes. And when I tried running tcpping google.com from KUDU that timed out sometimes as well. But was working fine for other web applications.

Related

Azure Function App is very slow, but only sometimes

I have read through most of the questions that seems to be similar to what I'll ask so hopefully I'm not wasting anyone's time.
We have a Function App in Azure Cloud that contains several Durable Functions.
One of these durable functions is a HTTP trigger API REST call.
It will normally take between 0.5 - 3 seconds to execute fully (from call to done, delivered result). But sometimes it takes 20-35 seconds. I don't know why or how I can search for errors.
The durable function fetches information from a Cosmos DB and delivers the result back to the caller.
Function App, Durable Function and Cosmos DB are all located in the same Region. (Checked that).
The Durable Function is set to B2:2 and has toggled Always On to ON.
Is there something I miss or something I should check to make sure it runs smoother?
Log of executions of the app:
I greatly appreciate everyone's time and energy they put into helping me. Thanks a lot.
---- Additions to the post after posting ----
I have checked the interactive tool and if I read that correctly it tells me a maximum execution time of 0.8 seconds and a maximum network lag of 6 seconds. That would indicate something that I suspected before I set up this post and that is that Azure needs to cold start the function. But I have always on toggled on so why?
It doesn't seem to take 30 seconds to complete the function. It seems to take less than 1 second to complete the function and up to a maximum of 6 seconds lag, but where are the other 23 seconds going in a 30 second call?
B2:2 is the service agreement I have with Azure. B2 is the test environments second paid state with 2 instances scaling (I have changed that to 3 after posting this).
Application Insights are on and no other dependencies are present except the Cosmos DB.
AFAIK in Azure Functions,
After 5 minutes of inactivity, Function App goes to the cold state. To come out of it, 10 seconds delay occurs.
Even the Function App is in Hot State, it will take some excessive amount of time to load the external libraries defined in it.
In the Function App, Code Logic Performance also matters the cause of slowness in the Azure Functions.
There are few steps for reducing the cold-start times particularly for the Functions having external libraries:
Running from a package file WEBSITE_RUN_FROM_PACKAGE to 1 may reduce cold-start times, particularly for JavaScript functions with large npm package trees.
From the Azure Portal > Diagnose and solve problems > Troubleshoot Performance category to identify the causes of slowness:
Try Always On Feature available in App Service Plan and Premium Plan of the Azure Functions to prevent such issues.
Regarding the Performance and reliability improving of Azure Functions, please refer here.
If this issue persists still, then please raise an incident with Microsoft Support to get the root cause and resolution.
Try fiddling around with maxqueuepollingintervall. It helped out with our cold starts quite a bit.

Diagnosing ASP.NET Azure WebApp issue

since a month one of our web application hosted as WebApp on Azure is having some kind of problem and I cannot find the root cause of that.
This WebApp is hosted on Azure on a 2 x B2 App Service Plan. On the same App Service Plan there is another WebApp that is currently working without any issue.
This WebApp is an ASP.NET WebApi application and exposes a REST set of API.
Effect: without any apparent sense (at least for what I know by now), the ThreadCount metric starts to spin up, sometimes very slowly, sometimes in few minutes. What happens is that no requests seems to be served and the service is dead.
Solution: a simple restart of the application (an this means a restart of the AppPool) causes an immediate obvious drop of the ThreadCount and everything starts as usual.
Other observations: there is no "periodicity" in this event. It happened in the evening, in the morning and in the afternoon. It seems that evening is a preferred timeframe, but I won't say there is any correlation.
What I measured through Azure Monitoring Metric:
- Request Count seems to oscillate normally. There is no peak that causes that increase in ThreadCount
- CPU and Memory seems to be normal, nothing strange.
- Response time, like the others metrics
- Connections (that should be related to sockets) oscillates normally. So I'd exclude something related to DB connections.
What may I do in order to understand what's going on?
After a lot of research, this happened to be related to a wrong usage of Dependency Injection (using Ninject) and an application that wasn't designed to use it.
In order to diagnose, I discovered a very helpful feature in Azure. You can reach it by entering into the app that is having the problem, click on "Diagnose and solve problems" then click on "Diagnostic tools" and then select "Collect .NET profiler report". In that panel, after configuring the storage for the diagnostic files, you can select "Add thread report".
In those report you can easily understand what's going wrong.
Hope this helps.

SQL Azure Premium tier is unavailable for more than a minute at a time and we're around 10-20% utilization, if that

We run a web service that gets 6k+ requests per minute during peak hours and about 3k requests per minute during off hours. Lots of data feeds compiled from 3rd party web services and custom generated images. Our service and code is mature, we've been running this for years. A lot of work by good developers has gone into our service's code base.
We're migrating to Azure, and we're seeing some serious problems. For one, we are seeing our Premium P1 SQL Azure database routinely become unavailable for 1-2 full entire minutes. I'm sorry, but this seems absurd. How are we supposed to run a web service with requests waiting 2 minutes for access to our database? This is occurring several times a day. It occurs less after switching from Standard level to Premium level, but we're nowhere near our DB's DTU capacity and we're getting throttled hard far too often.
Our SQL Azure DB is Premium P1 and our load according to the new Azure portal is usually under 20% with a couple spikes each hour reaching 50-75%. Of course, we can't even trust Azure's portal metrics. The old portal gives us no data for our SQL, and the new portal is very obviously wrong at times (our DB was not down for 1/2 an hour, like the graph suggests, but it was down for more than 2 full minutes):
Azure reports the size of our DB at a little over 12GB (in our own SQL Server installation, the DB is under 1GB - that's another of many questions, why is it reported as 12GB on Azure?). We've done plenty of tuning over the years and have good indices.
Our service runs on two D4 cloud service instances. Our DB libraries are all implementing retry logic, waiting 2, 4, 8, 16, 32, and then 48 seconds before failing completely. Controllers are all async, most of our various external service calls are async. DB access is still largely synchronous but our heaviest queries are async. We heavily utilize in-memory and Redis caching. The most frequent use of our DB is 1-3 records inserted for each request (those tables are queried only once every 10 minutes to check error levels).
Aside from batching up those request logging inserts, there's really not much more give in our application's db access code. We're nowhere near our DTU allocation on this database, and the server our DB is on has like 2000 DTU's available to be allocated still. If we have to live with 1+ minute periods of unavailability every day, we're going to abandon Azure.
Is this the best we get?
Querying stats in the database seems to show we are nowhere near our resource limits. Also, on premium tier we should be guaranteed our DTU level second-by-second. But, again, we go more than an entire solid minute without being able to get a database connection. What is going on?
I can also say that after we experience one of these longer delays, our stats seem to reset. The above image was a couple minutes before a 1 min+ delay and this is a couple minutes after:
We have been in contact with Azure's technical staff and they confirm this is a bug in their platform that is causing our database to go through failover multiple times a day. They stated they will be deploying fixes starting this week and continuing over the next month.
Frankly, we're having trouble understanding how anyone can reliably run a web service on Azure. Our pool of Websites randomly goes down for a few minutes a few times a month, taking our public sites down. If our cloud service returns too many 500 responses something in front of it is cutting off all traffic and returning 502's (totally undocumented behavior as far as we can tell). SQL Azure has very limited performance and obviously isn't ready for prime time.

Random 503 errors in Azure Mobile Services

At certain times during the week while I'm testing my Mobile Services app I get a 503 error (Service Unavailable). It happens whether I try to call the app from localhost or live on my Azure Website. It hangs around for 10-15 minutes and then goes away on its own. It doesn't seem to be caused by anything in particular that I am doing (i.e. I have not updated any code). The 503 error occurs when I'm trying to call one of my custom APIs in my Mobile Services account. A few of the requests make it through (strangely enough) but the majority return a 503 error.
I've seen that someone had a very similar problem here (Why does Azure give me an intermittent Error 503. The service is unavailable?) without an acceptable resolution.
I am using the free version of Mobile Services but I should be no where near pushing the limits of what the free version can handle; I am the sole user of the app right now.
It will soon be time to make the service live and I'm shuddering at the thought of support calls that will come in during one of these funky states the service gets into. Any help in debugging the problem would be greatly appreciated.
EDIT:
I've narrowed this down to a database problem. I have one main query (sproc) that I use to feed data to the UI. I noticed that when I get the 503 errors the query takes about 13 seconds (when run in SSMS). When things are running "normally", the query takes less than a second.
This doesn't solve my problem though, in fact it makes it more perplexing because I am using the Business Edition of Windows Azure SQL Database and there shouldn't be a 13 second fluctuation in execution time!
This problem seems to happen randomly. Is there some kind of caching in SQL Server that could explain this? Maybe my query really does take 13 seconds to execute and the caching superficially speeds it up.
Could you try transitioning your database/server to one of the "editions"? They have resource governance to promote predictable performance. Web/Business suffer from a noisy neighbor problem. It sounds like that may be your issue, considering it is intermittent.
Here's a link to a page describing the editions. https://msdn.microsoft.com/en-us/library/azure/dn741340.aspx

Web Site Availability in Windows Azure [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
App pool timeout for azure web sites
I am working on an asp.net mvc 4 app that is hosted in Windows Azure. This app will not have a lot of traffic as people will intermittently (once an hour) use it. I wanted to try using Windows Azure.
My app is currently set to use the FREE web site mode. I noticed that after 30 minutes, the site takes a long-time (> 5 seconds) to load. After that initial load, its fast. Then, if someone doesn't use it for another 30 minutes, it takes >5 seconds to load again.
I then tried upping the web site mode to a SHARED instance. I experienced the same problem there.
I then tried upping the web site mode to a RESERVED instance. The problem then goes away.
While I'd like to use Windows Azure, paying $50+ a month for a RESERVED instance is pretty expensive for a site that few have used up to this point. However, I can't have the initial lag. That will just defer the few users I have. You could say you get what you pay for. At the same time, I have a hard time believing others are experiencing this problem and not complaining. There has to be something I'm missing.
I figure the problem has to deal with the application pool resetting. However, I can't seem to figure a way around this. Is anyone familiar with this issue? Is there a way to fix it on a FREE or SHARED instance?
Thank you!
This is expected behavior based on how Windows Azure Web Sites work. The app pool they live in is spun up "on demand" and then hangs around for a time period.
For a detailed (and shameless plug) you can check out my article on this: http://www.simple-talk.com/dotnet/.net-framework/windows-azure-websites-%e2%80%93-a-new-hosting-model-for-windows-azure/
In summary:
Web Sites are hosted in a process on a farm of machines running IIS. If a site is idle for some time then the process is torn down automatically. Also, if the box is seeing a lot of pressure due to the other sites on the box the idle timeout may come down quite a bit (even as low as five minutes). When the next call comes in you'll see the process spun up again (likely on a completely different server). This is because you are in a shared environment (and is similar to how Heroku works). Once you move to reserved then you are the ONLY person on that virtual machine and if you suffer from noisy neighbor issues in processing its' because of your own stuff.
There are ways to keep your site "up", such as having a job that pings the url frequently; however, given that the idle timeout is somewhat fluid it may not solve every case. You can check out a recent post by Sandrino on how to use Azure Mobile Services as a job scheduler: http://fabriccontroller.net/blog/posts/job-scheduling-in-windows-azure/ . There are also 3rd party services available that can do the ping for you automatically.
To be honest, the web sites are a great feature for quick development and test, or even relatively low traffic sites as you are talking about. If you need a high level of uptime and better performance then you'll want to look at Reserved, or another option if the cost isn't in line with expectations.
This isn't an Azure problem. It is a "feature" of any web site hosted in IIS. The default time-out for app pools is 20 minutes. Read about App Pool timeouts here - http://technet.microsoft.com/en-us/library/cc771956(v=ws.10).aspx - one method is to create a keep alive page and ping the page every 10 minutes or so.

Resources