Whenever I start my Azure Functions app, Live Metrics Streaming in Application Insights shows that there are a high number of servers online -- currently sitting at 24 and climbing -- and seems to bring another server online every few seconds or so. This is before I even make a call to any HttpTriggers. And for every call I make to an HttpTrigger, Functions seems to bring another server online (at least that's what Application Insights is saying).
I'm at a complete loss of how to account for this behavior, especially since my app isn't doing anything yet (no requests). There aren't any logged errors that I can see except for the "Unable to acquire Singleton lock" one for new servers, which I'd expect. The CPU% for a server spikes when it comes online, but then goes back down to 0% within a few seconds and stays there.
Even more bizarre is that when I left the app running over night, I saw 180 servers online in the morning and calls to my API were very slow.
Is the "servers online" metric in Live Metrics Stream misleading with Azure Functions, or is there any way to find out why there are so many servers being brought online? Does anyone have any other suggestions for how to figure out what's happening?
Related
We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.
We're experiencing CPU spikes on our Azure App Service Plan for no obvious reason. Its not something that stops the service, but we'd like to have an understanding of when&how that kind of things happen.
For example, CPU percentage sits at 0-1% range for days but then all of the sudden it spikes to 98%, 45%, 60% and comes back to 0-1% range very quickly. Memory stays unchanged at comfortable 40-45% level, no incoming requests to it, no web jobs, nothing unusual in logs, no failures, service health ok, nothing we could point our finger to as a reason.
We tried to find out through kudu > support > analyze (metrics)...but we couldn't get request submited. It just keeps giving error to try later.
There is only one web app running in that app service plan, its a asp.net core 2.0. web api.
Could someone shed some light on this kind of behavior? Is this normal, expected? If so, why it happens? Is there a danger that it spikes to 90% and don't immediately come back?
Just, what's going on?
After speaking with MS support i've got an answer it is a normal behavior coming from their monitoring tool:
We reviewed our internal tools taking as starting point 12/26 and
today 12/29 and we could notice that this was majority System
processes doing background tasks, which is normal for each sandbox
environment. In your case, it was mostly MonAgentCore.exe fluctuating
in CPU which is our diagnostic log capturing process and this looks
like a very temporary spike and appears normal.
UPDATE: I've figured it out. See the end of this question.
I have an Azure App Service running four sites. One of the sites has two deployment slots in addition to the primary one. Recently I've been seeing really high CPU utilization for the App Service plan as a whole.
The dark orange line shows the CPU percentage. This is just after restarting all my sites, which brought it down to this level.
However, when I look at the CPU use reported by each site, it's really low.
The darker blue line shows the CPU time, which is basically nothing. I did this for all of my sites, and all the graphs look the same. Basically, it seems that none of my sites are causing the issue.
A couple of the sites have web jobs, so I took a look at the logs but everything is running fine there. The jobs run for a few seconds every few hours.
So my question is: how can I determine the source of this CPU utilization? Any pointers would be greatly appreciated.
UPDATE: Thanks to the replies below, I was able to get more detail into what was happening. I ended up getting what I needed from SCM / Kudu tools. You can get here by going to your web app in Azure and choosing Advanced Tools from the side nav. From the Kudu dashboard, choose Process Explorer. The value in the Total CPU Time column is not directly useful, because it's the time in seconds that the process has run since it started, which might have been minutes or days ago.
However, if you make a record of the value at intervals, you can look at the change over time, and one process might jump out at you. In my case, it was my WebJobs process. Every 60 seconds, this one process was consuming about 10 seconds of processor time, just within one environment.
The great thing about this Kudu dashboard is, if you can catch the problem while it is actually happening, you can hit the Start Profiling button and capture a diagnostic session. You can then open this up in Visual Studio and get some nice details about where the CPU time is being spent.
Just in case anyone else is seeing similar issues, I'll provide more details about my particular case. As I mentioned, my WebJobs exe was the culprit, and I found that all the CPU time was being spent in StackExchange.Redis.SocketManager, which manages connections to Azure Redis Cache. In my main web app, I create only one connection, as recommended. But Since my web jobs only run every once in a while, I was creating a new connection to Azure Redis Cache each time one ran, which apparently can lead to issues. I changed my code to create the Redis Cache connection once when the WebJob process starts up and use the existing connection when any individual WebJob runs.
Time will tell if this really fixes the issue, but I think it will. When the problem occurred, it always fit the same pattern: After a few days of running fine, my CPU would slowly ramp up over the course of about 12 hours. My thinking is that each time a WebJob ran, it created a connection object, which at first didn't produce trouble, but gradually as WebJobs ran every hour or two, cruft was building up until finally some critical threshold was met and the CPU usage would take off.
Hope this helps someone out there. Best wishes!
May be you should go to webApp scm?
%yourAppName%.scm.azurewebsites.com;
There is a page, that can show you all process, that runned now on your web app. (something like Console > Process).
Also you can go to support page (from scm right corner).
You can find some more info about your performance there, and make memory dump (not for this problem, but it useful for performance issues).
According to your description, I assumed that you could leverage the Crash Diagnoser extension to capture dump files from your Web Apps and WebJobs when the CPUs usage percentage is higher than the specific threshold to isolate this issue. For more details, you could refer to this official blog.
At certain times during the week while I'm testing my Mobile Services app I get a 503 error (Service Unavailable). It happens whether I try to call the app from localhost or live on my Azure Website. It hangs around for 10-15 minutes and then goes away on its own. It doesn't seem to be caused by anything in particular that I am doing (i.e. I have not updated any code). The 503 error occurs when I'm trying to call one of my custom APIs in my Mobile Services account. A few of the requests make it through (strangely enough) but the majority return a 503 error.
I've seen that someone had a very similar problem here (Why does Azure give me an intermittent Error 503. The service is unavailable?) without an acceptable resolution.
I am using the free version of Mobile Services but I should be no where near pushing the limits of what the free version can handle; I am the sole user of the app right now.
It will soon be time to make the service live and I'm shuddering at the thought of support calls that will come in during one of these funky states the service gets into. Any help in debugging the problem would be greatly appreciated.
EDIT:
I've narrowed this down to a database problem. I have one main query (sproc) that I use to feed data to the UI. I noticed that when I get the 503 errors the query takes about 13 seconds (when run in SSMS). When things are running "normally", the query takes less than a second.
This doesn't solve my problem though, in fact it makes it more perplexing because I am using the Business Edition of Windows Azure SQL Database and there shouldn't be a 13 second fluctuation in execution time!
This problem seems to happen randomly. Is there some kind of caching in SQL Server that could explain this? Maybe my query really does take 13 seconds to execute and the caching superficially speeds it up.
Could you try transitioning your database/server to one of the "editions"? They have resource governance to promote predictable performance. Web/Business suffer from a noisy neighbor problem. It sounds like that may be your issue, considering it is intermittent.
Here's a link to a page describing the editions. https://msdn.microsoft.com/en-us/library/azure/dn741340.aspx
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
App pool timeout for azure web sites
I am working on an asp.net mvc 4 app that is hosted in Windows Azure. This app will not have a lot of traffic as people will intermittently (once an hour) use it. I wanted to try using Windows Azure.
My app is currently set to use the FREE web site mode. I noticed that after 30 minutes, the site takes a long-time (> 5 seconds) to load. After that initial load, its fast. Then, if someone doesn't use it for another 30 minutes, it takes >5 seconds to load again.
I then tried upping the web site mode to a SHARED instance. I experienced the same problem there.
I then tried upping the web site mode to a RESERVED instance. The problem then goes away.
While I'd like to use Windows Azure, paying $50+ a month for a RESERVED instance is pretty expensive for a site that few have used up to this point. However, I can't have the initial lag. That will just defer the few users I have. You could say you get what you pay for. At the same time, I have a hard time believing others are experiencing this problem and not complaining. There has to be something I'm missing.
I figure the problem has to deal with the application pool resetting. However, I can't seem to figure a way around this. Is anyone familiar with this issue? Is there a way to fix it on a FREE or SHARED instance?
Thank you!
This is expected behavior based on how Windows Azure Web Sites work. The app pool they live in is spun up "on demand" and then hangs around for a time period.
For a detailed (and shameless plug) you can check out my article on this: http://www.simple-talk.com/dotnet/.net-framework/windows-azure-websites-%e2%80%93-a-new-hosting-model-for-windows-azure/
In summary:
Web Sites are hosted in a process on a farm of machines running IIS. If a site is idle for some time then the process is torn down automatically. Also, if the box is seeing a lot of pressure due to the other sites on the box the idle timeout may come down quite a bit (even as low as five minutes). When the next call comes in you'll see the process spun up again (likely on a completely different server). This is because you are in a shared environment (and is similar to how Heroku works). Once you move to reserved then you are the ONLY person on that virtual machine and if you suffer from noisy neighbor issues in processing its' because of your own stuff.
There are ways to keep your site "up", such as having a job that pings the url frequently; however, given that the idle timeout is somewhat fluid it may not solve every case. You can check out a recent post by Sandrino on how to use Azure Mobile Services as a job scheduler: http://fabriccontroller.net/blog/posts/job-scheduling-in-windows-azure/ . There are also 3rd party services available that can do the ping for you automatically.
To be honest, the web sites are a great feature for quick development and test, or even relatively low traffic sites as you are talking about. If you need a high level of uptime and better performance then you'll want to look at Reserved, or another option if the cost isn't in line with expectations.
This isn't an Azure problem. It is a "feature" of any web site hosted in IIS. The default time-out for app pools is 20 minutes. Read about App Pool timeouts here - http://technet.microsoft.com/en-us/library/cc771956(v=ws.10).aspx - one method is to create a keep alive page and ping the page every 10 minutes or so.