Azure Durable function is crashing out - azure

I have Azure Durable functions with Event Grid as a trigger point which is pointing to blob storage.
I have 8 activity functions and 1 orchestrator.
Based on the file types I receive one of the activity function is executed.
However I keep receiving the crashing message as in the image.

Based on the error message that you have shared is pointing that function failed with "System.ExecutionEngineException"
Generally , System.ExecutionEngineException exception is thrown when the CLR detects that something has gone horribly wrong.
This can happen some considerable time after the problem occurred. This is because the exception is usually a result of corruption of internal data structures - the CLR discovers that something has got into a state that makes no sense. It throws an uncatchable exception because it's not safe to proceed.
Looking at the stack trace that you have mentioned in the screen shot more over exception is pointing out DurableTask.AzureStorage.TimeoutHandler+ <ExecuteWithTimeout issue.
You can use memory dump generated by the Proactive Crash Monitoring tool
to identify the function crash & associated crashing thread call stack.
please create a technical support ticket by following the link wherein technical support team would help you in troubleshooting the issue or open a discussion over Microsoft Q&A Community.

Microsoft is aware of it, and it's currently as designed. It shouldn't affect your apps. https://github.com/Azure/azure-functions-durable-extension/issues/1965#issuecomment-931637193

Related

ASP.NET WebApp in Azure using lots of CPU

We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.

Azure AppService crashed with no details

My .net app hosted in Azure AppService unexpectedly crashed 4 times yesterday and I'm struggling to get details on why it went down. The report "Diagnose and solve problems\Application Crashes", indicates that Stackoverflow exceptions was the cause of the crashes, but I'm looking to get more details like (uri or stack dump). Here are the things I've tried and come up empty:
EventLog: I used the kudu app to get the eventLog(/api/vfs/LogFiles/eventlog.xml) and there are no details on the Stackover exceptions. In fact there are no matches on "stack", "overflow", or "recursion"
Nlog Files: The nlog files just abruptly terminate when these crashes occurred so no details are captured.
Azure AppIngishts: This too has no exceptions logged during the outage windows. There are some exceptions before and after but nothing about the details of the stack overflow.
AppSerive Utilization: The memory and CPU utilization were running in normal limits (40-70%) before the crashes.
Lastly, the app hasnt been updated for weeks so the likelyhood of new functionally causing this is low. In any case, I would need to know where to look as its a fairly complex app.
Any tips to disgonotise this issue would be really appreciated.
Thanks
You could isolate the issue by running the app locally with the latest changes.
You may capture a memory dump to identify if a line in the code is causing the crash (typically array size/recursive loop). Kindly take a look at the blog for the steps.
Kindly let us know the status with more details on the issue, we would be very glad to assist you further.

Azure Functions - Logging Consolidation - Controlling the host log?

I run a system on top of a bunch of Azure Functions and I'm just tidying some last threads up. I mostly abandoned the logging provided out of the box by Azure functions because I found the flush timings to be super irregular and I also wanted to consolidate the logs from all of my functions into one spot and be able to query them. This all works for the most part but I have one annoying use-case remaining where if a function binding is faulty (e.g. the azure function method signature is wrong because someone checked garbage into Git) the function won't be invoked and even the log for the function wont be invoked but the error will instead be placed into a different file (the host log).
Now I guess I can just access the storage account that backs up the azure function and pull the host log from there but I was wondering if there was a better means of directly controlling/intercepting the logging in Azure Functions. Does anyone know if there is at least a way of getting it to flush more quickly?
You can see host logs as well as function logs in associated Application Insights:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-monitoring#other-categories

Azure ServiceBus Queue Timing Out

Encountering a strange issue with one of our queues (for production, no less). When I try to put a message onto the queue, it's throwing an exception that simply states:
A timeout has occurred during the operation
The messages do seem to be making it onto the queue, as evidenced by the fact that I can see the queue length increasing in the management portal. However, the client application is not receiving any messages.
The management portal shows that there have been several failed requests, and also several internal server exceptions; though unfortunately I don't see any way to get more details about those failed requests and errors.
I'm somewhat at a loss as to what may have caused this, how to get more information about what's wrong, and how to move ahead in troubleshooting this. Any help would be greatly appreciated.
edit: I should mention just for completeness sake, that I did not make any changes to the clients that I'm aware of; This issue just sort of started happening all of a sudden
edit #2, woke up this morning, and things have magically returned to normal. Still not sure what happened, so I'd like to change the tone of the question to solicit suggestions as to how this kind of thing may be mitigated and/or troubleshooted (troubleshot? troubleshat? :) ) better
I have experienced this scenario too. When I tried too create a new service bus namespace, and pointed my app to this new namespace, it worked for me. This suggests that it might be some hardware failure going on (on the node where your sb-namespace resides).
Be sure to use transient failure handling, for example http://www.nuget.org/packages/EnterpriseLibrary.WindowsAzure.TransientFaultHandling/
But there might as well be required too use a "second level retry" for errors that are not transient. This you have to code yourself.
Too be more fault tolerant you can also use the new feature of paired namespaces. Here is a good resource: http://msdn.microsoft.com/en-us/library/dn292562.aspx
Hth
//Peter

Azure Diagnostic Configuration from Separate Assembly

We are developing several Azure-based applications in C# and are attempting to centralize some common code in a utility library. One of the common functions is Diagnostic monitoring setup.
We created a class that simplifies the configuration of diag collection, log transfer, etc.
The main issue we are facing is that when we run our code while the class lives in a different assembly from the WebRole or WorkerRole, the diagnostic information is never collected and transferred to azure table storage. If we move the class to the same project as the Web/Worker role, then everything works as expected.
Is there something that either the DiagnosticMonitor.GetDefaultInitialConfiguration(); or the DiagnosticMonitor.Start(StorageConnectionStringKey, _diagConfig); doesn't like about being in another assembly? I'm stumped!
Any insight would be appreciated.
Thanks,
Matt
Which part is not working here? Trace Logs not getting transferred? That seems to be the one that most people have issues with.
We do something similar and have no issues. Typically when you don't see stuff getting transferred it's because the current process where the listener is getting configured is not always the same one where tracing occurs (especially when dynamically adding to trace listener collection). Notably, a lot of users find this issue with web apps in Windows Azure.
What are you expecting to see transferred? Perf counters? Traces? Event Logs? etc.

Resources