Stopping / Killing Azure Functions Running Instances on Consumption Plan - azure

How do u kill azure function runnable instances (executions) on a Comsumption Plan (previously known as Dynamic Plan).
I am running the azure function on a runtime version of 1.0.
Few (some not shown in the log in the screenshot below) were running past the FIVE MINUTES functionTimeout threshold (check the one with DOTTED status).
There were however few instances that DID get killed AS expected when they reached the FIVE MINUTES THRESHOLD (check the one with CROSSED status)
What I tried:
As suggested in this SO question Stop/Kill a running Azure Function I restarted the website hosting the azure function
I even stopped / started the website just to be sure
I killed the processes from kudu interface but the logs still keep showing there was a rouge instance.
Process explorer showed 32 Threads but all of them were in WAITING status. Nothing was running from what I could observe.
Finally
I deleted the website and moved over a App Service Plan based function since that seems to be the only option azure functions which need flexible timeouts.

This is a monitoring bug, and although it looks confusing, would have no impact on the runtime behavior.
I have opened an issue to track this here and it will be updated as we make progress.
Thank you for your patience with this and for reporting the problem!

Related

azure function app stalls out during long running orchestrator

I have a long running (node.js) orchestrator in Azure Function App that calls a couple hundred activity functions. Sometimes with a group of 5 or so running in parallel with context.df.Task.all. I find that it will run steadily for about two hours then the function app itself seems to abruptly stop. The logs stop displaying in the log stream. And the records in my database that the activity functions are supposed to be writing stop writing. There are no exceptions in the logs. It will remain paused or stalled like this indefinitely... until I restart the function app. Then it will come back to to life and resume where it stopped before for a time and then stop again.
Does this behavior sound familiar to anyone?
Should I update the extension bundle to [4.0.0, 5.0.0)
Could my storage account be the problem? Should I create a new one?
We are using the "Premium Plan", Could I be running up against a limit of some kind? If so what and what should I tell the IT team to increase.
As far as I know,
Should I update the extension bundle to [4.0.0, 5.0.0)
I believe this issue is not related to extension bundles because this is regarding on the usage compatible extensions, libraries, packages used in the Function App and extension bundle is versioned where each version comprises of Rich set of supported binding extensions to be installed based on the version of the Function App.
If any timeout value is defined in the host.json, make it as unbounded (-1) as the function project is deployed/hosting in the premium plan for the longer timeout duration of function executions.
Could my storage account be the problem? Should I create a new one?
Instead of creating a new account, you can increase the quota of the Storage account to 5 PiB.
If Storage account is in consideration, then make sure that both the function app and storage account are in same region to reduce latency issues.
Also, in production environment - it is better to allocate a separate storage account for each azure function app.
We are using the "Premium Plan", Could I be running up against a limit of some kind? If so what and what should I tell the IT team to increase.
Also, you mentioned in the question that the function app stalls, with no executions after stalling and works by restart from where it has paused. I have seen some points mentioned by Microsoft even the long running functions hosted in premium plan will stops with no executions like your scenario:
Refer to the MS Doc for more information.

Azure function timeout/fails to complete

G'day folks,
I'm having some issues with an Azure function that I'm hoping someone might be able to help with.
We have a relatively long-running process (3-4 mins) that is being triggered from a Service Bus message, and we were having issues with the function execution ending without error and then attempting to re-process. The time take for this to happen is less than all the timeout/lock duration settings we have configured. Watching the logs (log stream, for both file system and app insights) we see the last line of the previous execution, then it kicks straight into the next.
To determine whether it's service bus related, I've also tried executing the process via a blob trigger (the process uses the file as a data source anyway) but I'm seeing the same thing except I don't see the subsequent retries.
In both scenarios I don't see anything in App insights apart from the Trace records. I don't get an exception, or even a 'request' entry. (function logic is all enclosed in try/catch blocks btw)
So my question is - Is it possible to trap these scenarios so we can determine the root cause? Currently I've got nothing to go on to try and diagnose. These errors don't happen when running locally.
FWIW we've seen this issue happen during the execution of a third-party libraries (MS Graph and an OpenXMLPowerTools library) - as we're generating documents for upload into Sharepoint. Not sure if this is relevant.
Thanking you in advance,
Tim
May be this is because of the plan that you are using , If you're using the Consumption plan, the default timeout is 5 minutes, but you can increase it to a maximum of 10 minutes. The maximum timeout on a Premium plan is 60 minutes. You can set your timeout as long as you want if you have a dedicated App Service plan.
Also try configuring the timeout of your function app i.e by changing the value of functionTimeout in host.json of your function app.
You should have a look at durable functions.
They allows us to have long running processes, i.e. import/export tasks.
I was able to wrap a long running import process, which takes about 20 mins to run successfully.

How cloud services are provisioned (and billed) once a new deployment is requested to Azure REST API?

I'm using Azure REST API to create, deploy and start a Cloud Service (classic) (cspkg hosted in Azure Storage) with hundreds of instances. I'm noticing that time Azure takes to provision and start the requested instances is really heterogeneous. First instances might start in 6-7 minutes but last ones might take up to 15-20 minutes, about 10 minutes longer than first ones. So my questions are:
Is this the expected behaviour? If so, what's the logic behind this? Could I do anything to speed things up?
How is Azure billing this? Is it counting the total count of instances since the very initial time when Cloud Service is deployed? or is it taking into account the specific timing on each individual instance?
UPDATE: I've been testing more scenarios and I've found a puzzling surprise. If I replace all the processes that my Cloud Service instances should run by a simple wait for some minutes (run .bat file with timeout command) then all the instances start almost at the same time (about 15 seconds between fastest and slowest instance). It was not just luck and random behaviour, I've proved that this behavior is repeatable and I can't even try to explain the root reason.
I also checked this a few weeks ago, and the startup time, depends on the size of the machine, if it is large it has more resources, so the boot time is faster, and also, if there is any error, exception on startup the VM will recycle till it can successfully start. I googled it, but did not find any solution to speed this up, so I don't think it is possible to do anything about the startup time. In the background every time when you deploy something, it will create a Windows Server, and boot it up and deploy your package on it and puts your web roles behind load balancer, this is why it takes so long, because a lot of things are happening.
The billing part is also not the best for the classic cloud services, you have to pay for it even during the startup and recycle, and even when it is turned off, so if you are done with your update, you should delete the VMs from your staging slot or scale it down, because you will pay for it even if it is turned off.

Azure App Service: How can I determine which process is consuming high CPU?

UPDATE: I've figured it out. See the end of this question.
I have an Azure App Service running four sites. One of the sites has two deployment slots in addition to the primary one. Recently I've been seeing really high CPU utilization for the App Service plan as a whole.
The dark orange line shows the CPU percentage. This is just after restarting all my sites, which brought it down to this level.
However, when I look at the CPU use reported by each site, it's really low.
The darker blue line shows the CPU time, which is basically nothing. I did this for all of my sites, and all the graphs look the same. Basically, it seems that none of my sites are causing the issue.
A couple of the sites have web jobs, so I took a look at the logs but everything is running fine there. The jobs run for a few seconds every few hours.
So my question is: how can I determine the source of this CPU utilization? Any pointers would be greatly appreciated.
UPDATE: Thanks to the replies below, I was able to get more detail into what was happening. I ended up getting what I needed from SCM / Kudu tools. You can get here by going to your web app in Azure and choosing Advanced Tools from the side nav. From the Kudu dashboard, choose Process Explorer. The value in the Total CPU Time column is not directly useful, because it's the time in seconds that the process has run since it started, which might have been minutes or days ago.
However, if you make a record of the value at intervals, you can look at the change over time, and one process might jump out at you. In my case, it was my WebJobs process. Every 60 seconds, this one process was consuming about 10 seconds of processor time, just within one environment.
The great thing about this Kudu dashboard is, if you can catch the problem while it is actually happening, you can hit the Start Profiling button and capture a diagnostic session. You can then open this up in Visual Studio and get some nice details about where the CPU time is being spent.
Just in case anyone else is seeing similar issues, I'll provide more details about my particular case. As I mentioned, my WebJobs exe was the culprit, and I found that all the CPU time was being spent in StackExchange.Redis.SocketManager, which manages connections to Azure Redis Cache. In my main web app, I create only one connection, as recommended. But Since my web jobs only run every once in a while, I was creating a new connection to Azure Redis Cache each time one ran, which apparently can lead to issues. I changed my code to create the Redis Cache connection once when the WebJob process starts up and use the existing connection when any individual WebJob runs.
Time will tell if this really fixes the issue, but I think it will. When the problem occurred, it always fit the same pattern: After a few days of running fine, my CPU would slowly ramp up over the course of about 12 hours. My thinking is that each time a WebJob ran, it created a connection object, which at first didn't produce trouble, but gradually as WebJobs ran every hour or two, cruft was building up until finally some critical threshold was met and the CPU usage would take off.
Hope this helps someone out there. Best wishes!
May be you should go to webApp scm?
%yourAppName%.scm.azurewebsites.com;
There is a page, that can show you all process, that runned now on your web app. (something like Console > Process).
Also you can go to support page (from scm right corner).
You can find some more info about your performance there, and make memory dump (not for this problem, but it useful for performance issues).
According to your description, I assumed that you could leverage the Crash Diagnoser extension to capture dump files from your Web Apps and WebJobs when the CPUs usage percentage is higher than the specific threshold to isolate this issue. For more details, you could refer to this official blog.

Does azure force kill processes by itself? My nodeJS/Java processes/Jmeter are force killed

I am using windows azure for a performance test in about 8 nodes, each running a different application. Since its a performance test, we do have quite a bit of traffic generated.
The test was running just fine for a few hours. Then suddenly we realise a few of the applications like nodeJS, JMeter and even Java processes have been force-killed. Each at a different time.
We find nothing in logs that indicate a out of memory or any other error or application issue. And this happens pretty often, once every few hours. For example we had seen this issue with jmeter shut down once every 3-4 hours and then once it had happened after 10 hours or continuous run.
So we suspect azure is using root permissions to force-kill the above processes.
Did any of you notice this with your applictaions on azure and do you know why?
Short answer, no, Azure does not kill your processes. There is no such thing as 'root permissions' to kill specific processes.
Are you running an IaaS VM or a PaaS Web/Worker Role? For PaaS, check out http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx for where to start getting diagnostic data. For IaaS, troubleshoot it like you would on-prem (DebugDiag, WinDBG, procmon, Application/System event logs, etc) since there is really nothing specific about Azure that would cause this behavior.

Resources