How to send Alert Notification for Failed Job in Google Dataproc? - apache-spark

I am wondering if there is a way to hook in some notifications for jobs submitted in Dataproc. We are planning to use Dataproc to run a streaming application 24/7. But Dataproc doesnt seem to have a way to notify for failed jobs.
Just wondering if Google StackDriver can be used by any means.
Thanks
Suren

Sure, StackDriver can be used to set an alert policy on a defined log-metric.
For example, you can set a Metric Absence policy which will monitor for successful job completion and alert if it's missing for a defined period of time.
Go to Logging in your console and set a filter:
resource.type="cloud_dataproc_cluster"
jsonPayload.message:"completed with exit code 0"
Click on Create Metric, after filling the details you'll be redirected to the log-metrics page where you'll be able to create an alert from the metric

As noted in the answer above, log-based metrics can be coerced to provide the OP required functionality. But, metric absence for long-running jobs would imply you would have to wait for longer than a guess at the longest job running time (and you still might get an alert if the job takes a bit longer but is not failing). What 'we' really want is a way of monitoring and alerting on job status failed, or, service completion message indicating failure (like your example), such that we are alerted immediately. Yes, you can define a Stackdriver log-based metric, looking for specific strings or values indicating failure, and this 'works', but metrics are measures that are counted, for example 'how many jobs failed', and require inconvenient workarounds to turn alert-from-metric into a simple 'this job failed' alert. To make this work, for example, the alert filters on a metric and also needs to specify a mean aggregator over an interval to fire an alert. Nasty :(

Related

Azure alert on Azure Functions "Failed" metric is triggering with no apparent failures

I want an Azure Alert to trigger when a certain function app fails. I set it up as a GTE 1 threshold on the [function name] Failed metric thinking that would yield the expected result. However, when it runs daily I am getting notifications that the alert fired but I cannot find anything in the Application Insights to indicate the failure and it appears to be running successfully and completing.
Here is the triggered alert summary:
Here is the invocation monitoring from the portal showing that same function over the past few days with no failures:
And here is an application insights search over that time period showing no exceptions and all successful dependency actions:
The question is - what could be causing a Azure Function Failed metric to be registering non-zero values without any telemetry in Application Insights?
Update - here is the alert configuration
And the specific condition settings-
Failures blade for wider time range:
There are some dependency failures on a blob 404 but I think that is from a different function that explicitly checks for the existence of blobs at paths to know which files to download from an external source. Also the timestamps don't fall in the sample period.
No exceptions:
Per comment on the question by #ivan-yang I have switched the alerting to use a custom log search instead of the built-in Azure Function metric. At this point that metric seems to be pretty opaque as to what is triggering it and it was triggering every day when I ran the Azure Function with no apparent underlying failure. I plan to avoid this metric now.
My log based alert is using the following query for now to get what I was looking for (an exception happened or a function failed):
requests
| where success == false
| union (exceptions)
| order by timestamp desc
Thanks to #ivan-yang and #krishnendughosh-msft for the help

Azure Monitor alert is sending false failure email notifications of failure count of 1 for Functions app

I have a Functions app where I've configured signal logic to send me an alert whenever a failure greater than or equal to one has occurred in my application. I have been getting emails everyday saying my Azure Monitor alert was triggered followed by an email later saying that the failure was resolved. I know that my app didn't fail because I checked in Application Insights. For instance, I did not have a failure today, but did have a failures the prior 2 days:
However, I did receive a failure email today. If I go to configure the signal logic where I set a static threshold of failure count greater than or equal to 1 it shows this:
Why is it showing a failure for today, when I know that isn't true from the Application Insights logs? Also, if I change the signal logic to look at total failures instead of count of failures, it looks correct:
I've decided to use the total failures metric instead, but it seems that the count functionality is broken.
Edit:
Additional screenshot:
I suggest you can use Custom log search as the signal if you have already connected your function app with Application insights(I'd like to use this kind of signal, and don't see such behavior like yours).
The steps as below:
Step 1: For signal, please select Custom log search. The screenshot is as below:
Step 2: When the azure function times out, it will throw an error and the error type is Microsoft.Azure.WebJobs.Host.FunctionTimeoutException, so you can use the query below to check if it times out or not:
exceptions
| where type == "Microsoft.Azure.WebJobs.Host.FunctionTimeoutException"
Put the above query in the "Search query" field, and configure other settings as per your need. The screenshot is as below:
Then configure other settings like action group etc. Please let me know if you still have such issue.
One thing should be noted: Some kinds of triggers support retry logic, like blogtrigger. So if it reties, you can also receive the alert email. But you can disable the retry logic as per this doc.

AppInsights - Monitor for Hung Processes

We are looking at implementing AppInsights for our non-web application. One of the things that we want to monitor for is processes that may be "hung" for more than N number of seconds or minutes. I have been unable to find something built in that does this. The closest thing I have seen or thought of would be to log 2 custom events for the start and end of a process, and then have an alert for a custom log that queries events with no matching "end" event after N minutes.
Is there another way to monitor for hung processes using AppInsights that I am not seeing? Thanks for any help.
If you choose to use application insights, here is the suggestion just for your reference(but if you have another better solution, you can ignore this):
As per this post, you can leverage heartbeat feature, details as below:
if this application runs more than several seconds, you can leverage heartbeat
feature - it sends metric every N minutes/seconds (configurable) and the absence of such
metric will indicate that application is no longer actively running. However, if
Application Insights thread survives, then heartbeat will still be reported.
You can rely on presense/absense of the telemetry from this app in general as well as
couple custom events as you outlined above - Azure Monitor allows to set an alert on
analytics query, so you'll be able to craft a query that returns nothing in case of
application issues and set an alert on 0 count returned by such a query.

Azure Functions notification on failure

I have timer-triggered Azure functions running in production, but now I want to be notified if the function fails.
In my case, access to various connected services can cause crashes, and there are many to troubleshoot. The crash is the type of error I need notification for.
When the function does fail, the log entry indicates failure, so I wonder if there is a hook in the system that would allow me to cause the system to generate a notification.
I know that blob and queue bindings, for instance, support the creation of poison queue entries, but timer trigger binding doesn't say anything about any trigger outputs of that nature.
I see that functions can pass their $return status as input to other functions, but that operation is not explained in depth in the docs. Also, in that case, I need to write another function to process the error status, and I was looking for something built-in.
I Have inquired with #AzureSupport on this, but their answer had nothing to do with Azure Functions, instead referring me to DLL notification hooks, then recommending I file on uservoice.
I'm sure there must be people here who have implemented some sort of error status notification. I prefer a solution that doesn't require code.
The recommended way to monitor and alert on failures is to use AppInsights which integrates fully with Azure Functions now
https://blogs.msdn.microsoft.com/appserviceteam/2017/04/06/azure-functions-application-insights/
Since all the logs are available in AppInsights it's easy to monitor for failures and setup alerts based on your own criteria.
However, if you only care about alerting and not things like monitoring etc, you could use Azure Monitor instead: https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-get-started
When the function does fail, the log entry indicates failure, so I wonder if there is a hook in the system that would allow me to cause the system to generate a notification.
...
I prefer a solution that doesn't require code.
This is a zero-code solution:
I poked #AzureFunctions once before on this topic, and a suggested response was to use Application Insights. It can handle the alerts upon failure and also can use webhooks.
See the Azure Functions App-Insights documentation on how to link your function app to App Insights. Then set up any alerts you want.
Unfortunately this hook doesn't exist.
Can you switch from a timer trigger to a queue trigger?
You can get retries (if you want them), and after the specified number of attempts the message is sent to a poison queue.
To schedule executions you can add queue messages with a visibility timeout to match your schedule.
In order to get alerts on failure you have two options:
A timer trigger than scans the execution logs (via SFTP) for failures.
Wrap the whole function in a try/catch block and in the catch block write a few lines to send you an email with the error details.
Hope this helps.
No code:
Go to your azure cloud account
From the menu select Monitor
Then select Add New Rule
Then Select your condition, action and add the alert details.

Can i execute an on-demand web job from a scheduled webjob?

I need to execute a long running webjob on certain schedules or on-demand with some parameters that need to be passed. I had it in a way where the scheduled webjob would put a message on the queue with the parameters and a queue message triggered job would take over - OR - some user interaction would put the same message on the queue with the parameters and the triggered job would take over. However for some reason the triggered function never-finishes - and right now i cannot see any exceptions being displayed in the dashboard outputs (see Time limit on Azure Webjobs triggered by Queue)
I m looking into whether I can execute my triggered webjob as an On-demand webjob and pass the parameters to it? Is there anyway to call an on-demand web job from a scheduled web job and pass it some command line parameters?
Thanks for your help!
QueueTriggered WebJob functions work very well when configured properly. Please see my answer on the other question which points to documentation resources on how to set your WebJobs SDK Continuous host up properly.
Queue messaging is the correct pattern for you to be using for this scenario. It allows you to pass arbitrary data along to your job, and will also allow you to scale out to multiple instances as needed when your load increases.
You can use the WebJobs Dashboard to invoke your job function directly (see "Run Function" button below) - you can specify the queue message input directly in the Dashboard as a string. This allows you to invoke the function directly as needed with whatever inputs you want, in addition to allowing the function to continue to respond to qeueue messages actually added to the queue.

Resources