Where are the logs and memory dumps of Azure Function crashes? - azure

Sometimes my azure function fails and I have no record of what happened. Function just stops executing.
I think there is major error like StackOverflow, but since there is no record of it I can't be sure.
I created a simple azure function to emulate simple stack overflow:
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
ILogger log, ExecutionContext executionContext)
{
RunStackOverflow();
}
private static void RunStackOverflow()
{
RunStackOverflow();
}
When I call this using http trigger, I get 502 error in browser, but there is nothing in logs about this failure. Screenshot: https://www.screencast.com/t/ymWoBey4KX
StackOverflow is just one of the exceptions that can't be caught and can result in function crash. Locally when I run the function in emulator I see stack overflow error in cmd window where function starts. Screenshot: https://www.screencast.com/t/f85U2KmdEBBt
In Azure portal I checked:
function invocations (screenshot: https://www.screencast.com/t/ufB1Zfthz)
function logs (screenshot: https://www.screencast.com/t/A2ix6yuSuJkE)
app insights (screenshot: https://www.screencast.com/t/NyRFLDK23p)
But there is no log entry of this crash anywhere.
I contacted Azure support, but they are not very helpful so far.
Update on Apr 12
Using KUDU I can create memory dump using command like this
c:\devtools\sysinternals\procdump -e -ma -w 12268
This shows me all stack traces for all threads and this is what I need, but only when first chance exception occurs.
The command to trigger memory dump when there is such exception is:
c:\devtools\sysinternals\procdump -accepteula -e -g -ma 8844
but when I run it and then trigger StackOverflow exception here is what is written out to command line:
[11:37:36] Exception: E0434352.CLR
[11:37:36] Exception: C00000FD.STACK_OVERFLOW <--- Stack overflow
[11:37:37] The process has exited.
[11:37:37] Dump count not reached.
Unfortunately there is no memory dump created, so I can't see a stack trace that caused stack overflow.
I also tried:
c:\devtools\sysinternals\procdump -accepteula -e -g -ma -t 13244
-t option triggers memory dump when process exits.
This one actually records a memory dump when Function crashes. Unfortunately this dump doesn't include stack trace for StackOverflow. It seems to get dumped after the thread already crashed.
Update on Apr 21
There are multiple ways to host Azure functions described here:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale
The most common and default way is Consumption plan. After a bunch of trial-and-errors I found that Diagnostics Tools (https://www.screencast.com/t/DyT6Jpuqm2uo) which can be used to detect and analyze crashes are not available with Consumption plan. On the other hand they are available with App Service (Basic and plus) and other plans. Azure support told me that currently there are no plans to add it to Consumption plan.
So for now I made a new Azure Function hosted using App Service Plan and I was able to use Diagnostic tools to record crash dumps. After fixing the issues I plan to go back to Consumption plan, so it is a bit of a hack, but it does work for now.

Currently, this level of logs are not supported very well.
You can use Diagnose and solve problems option from azure portal by following this link, but note that some features(like Application Crashes) in this option are still in development.
Steps:
1.In azure portal -> your function app -> click Diagnose and solve problems -> then click the Function App Down or Reporting Errors link. Here is the screenshot:
2.Wait for a while before the report completes generating -> then check the items starts with red exclamation mark(by using your code, the error details are in the Web App Restarted item. But it just shows a common message like app crashes, not stackoverflow):

Related

Stopping the listener 'Microsoft.Azure.WebJobs.Host.Blobs.Listeners.BlobListener' for function <XXX>

I have function app where I have one HttpTrigger and 3 BlobTrigger functions. After I deployed it, http trigger is working fine but for others functions which are blob triggers, it gives following errors
"Stopping the listener 'Microsoft.Azure.WebJobs.Host.Blobs.Listeners.BlobListener' for function " for one function
Stopping the listener 'Microsoft.Azure.WebJobs.Host.Listeners.CompositeListener' for function
" for another two
I verified with other environments and config values are same/similar so not sure why we are getting this issue in one environment only. I am using consumption mode.
Update: When file is placed in a blob function is not getting triggered.
Stopping the listener 'Microsoft.Azure.WebJobs.Host.Blobs.Listeners.BlobListener' for function
I was observed the same message when working on the Azure Functions Queue Trigger:
This message doesn't mean the error in function. Due to timeout of Function activity, this message will appear in the App Insights > Traces.
I have stopped sending the messages in the Queue for some time and has been observed the traces like Web Job Host Stopped and if you run the function again or any continuous activity is present in the Function, then this message will not appear in the traces.
If you are using elastic Premium and has VNET integrated, the non-http trigers needs Runtime scale monitoring enabled.
You can find Function App-->Configuration--> Function runtime settings and turn on Runtime scale monitoring.
If function app and storage account which holds the metadata of the function Private linked, you will need to add the app settings WEBSITE_CONTENTOVERVNET = 1.
Also, make sure you have private linked for blob, file, table and queue on storage account.
I created ticket with MS to fix this issue. After analysis I did some code changes as
Function was async but returning void so changed to return Task.
For the trigger I was using connection string from app settings. But then I changed it to azureWebJobStorage(even though bobth were same) in function trigger attribute param
It started working. So posting here in case it is helpful for others

Azure functions is not loggin all the traces in AppInsights

I have an Azure Function App with multiples functions connected with Application Insights.
For some reason that I don't know sometimes, some requests and traces get lost and it's like they never happen, but I can see the data in our DB and also in others systems.
Here is a new function with just one call, in the azure function dashboard I can see the log:
But in Application Insights, when I try to search for the logs of the trace or the request, there is not info retrived.
This's not happening everytime, but there's not the first time I saw this issue. I can see the logs for others requests but I don't know why sometimes logs are lost.
Azure function info:
Runtime Version: 3
Stack: NodeJS
Have you configured sampling? This can appear as data loss.
You can control it as follows, as per the documentation:
const appInsights = require("applicationinsights");
appInsights.setup("<instrumentation_key>");
appInsights.defaultClient.config.samplingPercentage = 33; // 33% of all telemetry will be sent to Application Insights
appInsights.start();

No exceptions or stack traces in Azure Application Insights

I have an ASP.NET Core 3.1 solution deployed into an Azure Web App hooked up to Application Insights. I can't for the life of me get exceptions and stack traces to log into Application Insights, instead I get a basic request trace with no exception information attached:
I've tried most combinations of setting up logging/application insights telemetry, here are some of the things I've tried:
services.AddApplicationInsightsTelemetry(); in the ConfigureServices() method of Startup.cs
Adding logging.AddApplicationInsights(); to my logging builder in Program.cs
Removing the custom error page exception handler in case that was affecting things
I have the APPINSIGHTS_INSTRUMENTATIONKEY environment variable set on my Web App in Azure.
I'm using the following code to generate exceptions in Application Insights:
[AllowAnonymous]
[Route("autoupdate")]
public async Task<IActionResult> ProfileWebhook()
{
var formData = await this.Request.ReadFormAsync();
var config = TelemetryConfiguration.CreateDefault();
var client = new TelemetryClient(config);
client.TrackException(new Exception(string.Join("~", formData.Keys)));
logger.LogError(new Exception(string.Join("~", formData.Keys)), "Fail");
throw new Exception(string.Join("~", formData.Keys));
}
Nothing is working and I'm going crazy! Any help greatly appreciated.
Usually, Application insights will guarantee that all the kinds of telemetries(like exceptions, trace, event etc.) will be arrived around 5 minutes, please refer this doc: How long does it take for telemetry to be collected?. But there is still a chance that it will take a longer time due to beckend issue(a very small chance).
If you're using visual studio, you can check if the telemetry is sent or not via Application Insights search.
You can also check if you're using a correct IKey, or if you have enabled sampling.
But if it keeps this behavior in your side, you should consider contacting MS support to find the root cause.
Hope it helps.

Stackdriver-trace on Google Cloud Run failing, while working fine on localhost

I have a node server running on Google Cloud Run. Now I want to enable stackdriver tracing. When I run the service locally, I am able to get the traces in the GCP. However, when I run the service as Google Cloud Run, I am getting an an error:
"#google-cloud/trace-agent ERROR TraceWriter#publish: Received error with status code 403 while publishing traces to cloudtrace.googleapis.com: Error: The request is missing a valid API key."
I made sure that the service account has tracing agent role.
First line in my app.js
require('#google-cloud/trace-agent').start();
running locally I am using .env file containing
GOOGLE_APPLICATION_CREDENTIALS=<path to credentials.json>
According to https://github.com/googleapis/cloud-trace-nodejs These values are auto-detected if the application is running on Google Cloud Platform so, I don't have this credentials on the gcp image
There are two challenges to using this library with Cloud Run:
Despite the note about auto-detection, Cloud Run is an exception. It is not yet autodetected. This can be addressed for now with some explicit configuration.
Because Cloud Run services only have resources until they respond to a request, queued up trace data may not be sent before CPU resources are withdrawn. This can be addressed for now by configuring the trace agent to flush ASAP
const tracer = require('#google-cloud/trace-agent').start({
serviceContext: {
service: process.env.K_SERVICE || "unknown-service",
version: process.env.K_REVISION || "unknown-revision"
},
flushDelaySeconds: 1,
});
On a quick review I couldn't see how to trigger the trace flush, but the shorter timeout should help avoid some delays in seeing the trace data appear in Stackdriver.
EDIT: While nice in theory, in practice there's still significant race conditions with CPU withdrawal. Filed https://github.com/googleapis/cloud-trace-nodejs/issues/1161 to see if we can find a more consistent solution.

Azure webjob - QueueTrigger stops triggering

I am running an azure webjobs SDK console application (continuous) with the recommended setup:
public static void ProcessQueueMessage([QueueTrigger("logqueue")] string logMessage, TextWriter logger)
The azure queue I am running against has ~6000 messages in it and I am running the web-job locally, as a console application.
The problem I'm having is that the processing randomly stops after processing between zero and ~30 messages. The console stays open, but no more console messages are displayed.
For example, it might just process 2 messages:
Executing: 'Functions.ProcessQueueMessage' - Reason: 'New queue message detected on 'QueueName'.'
Executed: 'Functions.ProcessQueueMessage' (Succeeded)
Executing: 'Functions.ProcessQueueMessage' - Reason: 'New queue message detected on 'QueueName'.'
Executed: 'Functions.ProcessQueueMessage' (Succeeded)
And then, nothing. There doesn't seem to be anything wrong with my internet connection and I can't trace the issues down to any particular messages.
Has anyone else had issues with this SDK?
Update:
I made sure that I was using the right versions of all of the dependencies by removing the nuget packages and then re-running install-package Microsoft.Axure.Webjobs. I am now using webjobs version 1.1.0 which has pulled in version 4.3 of azure storage.
As recommended by Matthew, I have pulled down the source code for azure webjobs to determine where the process is freezing up. Once the freez-up occurs, I pause execution and checked the running threads for what I believe is the culprit within Microsoft.Azure.WebJobs.Host.CompositeTraceWriter
protected virtual void InvokeTextWriter(TraceEvent traceEvent)
{
if (_innerTextWriter != null)
{
string message = traceEvent.Message;
if (!string.IsNullOrEmpty(message) &&
message.EndsWith("\r\n", StringComparison.OrdinalIgnoreCase))
{
// remove any terminating return+line feed, since we're
// calling WriteLine below
message = message.Substring(0, message.Length - 2);
}
_innerTextWriter.WriteLine(message);
if (traceEvent.Exception != null)
{
_innerTextWriter.WriteLine(traceEvent.Exception.ToDetails());
}
}
}
The line it freezes on is line 66 : _innerTextWriter.WriteLine(message);
_innerTextWriter is an instance of System.IO.TextWriter.SyncTextWriter
Is it possible there is some deadlock issue with this class or the way it is being used?
Some notes:
I am running in the debugger, so in this case I believe the textwriter is forwarding to the console internally
I have my batchsize set to 1 via config.Queues.BatchSize = 1;, not sure if that could matter
I'm currently working on setting up an environment on another computer so that I can see if it is reproducible somewhere other than this machine (surface book).
Update
The issue was me not understanding how the new windows 10 command prompt works. Any time you click on the command window, it goes into "select" mode which completely pauses execution of the process.
Basically: https://superuser.com/questions/419717/windows-command-prompt-freezing-randomly?newreg=ece53f5584254346be68f85d1fd2f18d
You can tell it is in this state because it will prefix the window title with the word "Select":
You have to press enter or click again to get it going once again.
So, two final comments:
1) What an incredibly confusing and un-intuitive behavior for a command window!
2) I hope some admin will come take pity on the shame I have brought upon myself and my family by deleting this question.
To get rid of this strange behavior, you can disable QuickEdit mode:
Strange. When it is in this stuck state, can you try adding a new queue message to the queue and see if that triggers? Are you sure your function isn't hanging internally? What version of the SDK are you using? You might also try upgrading to v1.1.0 which we just released last week. If there are really a bunch of messages in the queue waiting to be processed, I can't think of anything that would cause this. The queue listener in the SDK should chug along, reading batches of messages in parallel and dispatching them to your function. Have you changed any of the JobHostConfiguration.Queues configuration knobs? You haven't force updated the version of the Azure SDK have you to something higher than the WebJobs SDK supports?
Another option if you can't figure this out might be to clone the SDK, build it and debug it locally. The repo is here. The main queue processing loop is here.

Resources