Azure alert on Azure Functions "Failed" metric is triggering with no apparent failures - azure

I want an Azure Alert to trigger when a certain function app fails. I set it up as a GTE 1 threshold on the [function name] Failed metric thinking that would yield the expected result. However, when it runs daily I am getting notifications that the alert fired but I cannot find anything in the Application Insights to indicate the failure and it appears to be running successfully and completing.
Here is the triggered alert summary:
Here is the invocation monitoring from the portal showing that same function over the past few days with no failures:
And here is an application insights search over that time period showing no exceptions and all successful dependency actions:
The question is - what could be causing a Azure Function Failed metric to be registering non-zero values without any telemetry in Application Insights?
Update - here is the alert configuration
And the specific condition settings-
Failures blade for wider time range:
There are some dependency failures on a blob 404 but I think that is from a different function that explicitly checks for the existence of blobs at paths to know which files to download from an external source. Also the timestamps don't fall in the sample period.
No exceptions:

Per comment on the question by #ivan-yang I have switched the alerting to use a custom log search instead of the built-in Azure Function metric. At this point that metric seems to be pretty opaque as to what is triggering it and it was triggering every day when I ran the Azure Function with no apparent underlying failure. I plan to avoid this metric now.
My log based alert is using the following query for now to get what I was looking for (an exception happened or a function failed):
requests
| where success == false
| union (exceptions)
| order by timestamp desc
Thanks to #ivan-yang and #krishnendughosh-msft for the help

Related

Azure Monitor alert is sending false failure email notifications of failure count of 1 for Functions app

I have a Functions app where I've configured signal logic to send me an alert whenever a failure greater than or equal to one has occurred in my application. I have been getting emails everyday saying my Azure Monitor alert was triggered followed by an email later saying that the failure was resolved. I know that my app didn't fail because I checked in Application Insights. For instance, I did not have a failure today, but did have a failures the prior 2 days:
However, I did receive a failure email today. If I go to configure the signal logic where I set a static threshold of failure count greater than or equal to 1 it shows this:
Why is it showing a failure for today, when I know that isn't true from the Application Insights logs? Also, if I change the signal logic to look at total failures instead of count of failures, it looks correct:
I've decided to use the total failures metric instead, but it seems that the count functionality is broken.
Edit:
Additional screenshot:
I suggest you can use Custom log search as the signal if you have already connected your function app with Application insights(I'd like to use this kind of signal, and don't see such behavior like yours).
The steps as below:
Step 1: For signal, please select Custom log search. The screenshot is as below:
Step 2: When the azure function times out, it will throw an error and the error type is Microsoft.Azure.WebJobs.Host.FunctionTimeoutException, so you can use the query below to check if it times out or not:
exceptions
| where type == "Microsoft.Azure.WebJobs.Host.FunctionTimeoutException"
Put the above query in the "Search query" field, and configure other settings as per your need. The screenshot is as below:
Then configure other settings like action group etc. Please let me know if you still have such issue.
One thing should be noted: Some kinds of triggers support retry logic, like blogtrigger. So if it reties, you can also receive the alert email. But you can disable the retry logic as per this doc.

Application Insights Alert Trigger History

We are using application insights for sending our metrics. Based on these metrics we have alerts set via a customQuery.
The alerts are working fine. I'm expecting to pull data out of the alerts trigger and put use it for analytical purpose.
Explanation:
I have alerts A,B,C,D,E.....
During a course of period A triggered 5 times, B 3 times, D 10 times.....
Now for this course of the period to start with I'm looking at having an insight into which failure happened most frequently so appropriate action can be taken.
Where can I find this information? Not expecting the Monitor tab as it gives a very basic view.

How to trigger an alert notification of a long-running process in Azure Data Factory V2 using either Azure Monitor or ADF itself?

I've been trying to find the best way to trigger an alert when an ADF task (i.e. CopyActivity or Stored Procedure Task) has been running for more than N hours, I wanted to use the Azure Monitor as it is one of the recommended notification services in Azure, however I have not been able to find a "Running" criteria, hence I had to play with the available criteria (Succeeded and Failed) and check this every N hours, however this is still not perfect as I don't know when the process started and we may run the process manually multiple times a day, is there any way you would recommend doing this? like a event-based notification that listens to some time variable and as soon as it is greater than the threshold triggers an email notification?
is there any way you would recommend doing this? like a event-based
notification that listens to some time variable and as soon as it is
greater than the threshold triggers an email notification?
Based on your requirements, I suggest you using Azure Data Factory SDKs to monitor your pipelines and activities.
You could create a time trigger Azure Function which is triggered every N hours. In that trigger function :
You could list all running activities in data factory account.
Then loop them to monitor the DurationInMs Property in ActivityRun Class to check if any activity has been running for more than N hours and it's still In-Progress status.
Finally, send the email or kill the activity or do whatever you want.
I would suggest simple solution:
Kusto query for listing all pipeline runs where status is "Queued" and joining it on CorrelationId with those that we are not interested in - typically "Succeeded", "Failed". Join flavor leftanti does the job by "Returning all the records from the left side that don't have matches from the right." (as specified in MS documentation).
Next step would be to set your desired timeout value - it is 30m in the example code below.
Finally, you can configure Alert rule based on this query and get your email notification, or whatever you need.
ADFPipelineRun
| where Status == "Queued"
| join kind=leftanti ( ADFPipelineRun
| where Status in ("Failed", "Succeeded") )
on CorrelationId
| where Start < ago(30m)
I tested this only briefly, maybe there is something missing. I have an idea about adding other statuses to be removed from result - like "Cancelled".

Monitoring Timer Triggered Azure Functions For Inactivity

We have a couple of functions in a function app. Two of them are triggered by a timer, do some processing and write to queues to trigger other functions.
They normally work very well until recently where the timer trigger just stopped triggering. We fixed this by restarting the application which resolved the issue. The problem is that we were completely unaware of the trigger stopping as there were no failures and the function app is not constantly 'looked at' by our people.
I'd like to configure automatic monitoring and alerting for this special case. I configured Application Insights for the function app and tried to write an alert which watches the count metric of the functions which are triggered by a timer. If the metric is below the set threshold (below 1 in the last 5 minutes) the alert should be triggered.
I tested this by just stopping the function app. My reasoning behind this was that a function app that does not run should fullfill this condition and should trigger an alert within a reasonable time frame. Unfortunately this was not the case. Apparently a non-existing count is not measured and the alert will never be triggered.
Did someone else experience a similar problem and has a way to work around this?
I've added Application Insights alert:
Type: Custom log search
Search query:
requests | where cloud_RoleName =~ '<FUNCTION_APP_NAME_HERE>' and name == '<FUNCTION_NAME_HEER>'
Alert logic: Number of results less than 1
Evaluated based on: Over last N hours, Run every M hours
Alert fires if there are no launches over last N hours.

How to send Alert Notification for Failed Job in Google Dataproc?

I am wondering if there is a way to hook in some notifications for jobs submitted in Dataproc. We are planning to use Dataproc to run a streaming application 24/7. But Dataproc doesnt seem to have a way to notify for failed jobs.
Just wondering if Google StackDriver can be used by any means.
Thanks
Suren
Sure, StackDriver can be used to set an alert policy on a defined log-metric.
For example, you can set a Metric Absence policy which will monitor for successful job completion and alert if it's missing for a defined period of time.
Go to Logging in your console and set a filter:
resource.type="cloud_dataproc_cluster"
jsonPayload.message:"completed with exit code 0"
Click on Create Metric, after filling the details you'll be redirected to the log-metrics page where you'll be able to create an alert from the metric
As noted in the answer above, log-based metrics can be coerced to provide the OP required functionality. But, metric absence for long-running jobs would imply you would have to wait for longer than a guess at the longest job running time (and you still might get an alert if the job takes a bit longer but is not failing). What 'we' really want is a way of monitoring and alerting on job status failed, or, service completion message indicating failure (like your example), such that we are alerted immediately. Yes, you can define a Stackdriver log-based metric, looking for specific strings or values indicating failure, and this 'works', but metrics are measures that are counted, for example 'how many jobs failed', and require inconvenient workarounds to turn alert-from-metric into a simple 'this job failed' alert. To make this work, for example, the alert filters on a metric and also needs to specify a mean aggregator over an interval to fire an alert. Nasty :(

Resources