How to trigger an alert notification of a long-running process in Azure Data Factory V2 using either Azure Monitor or ADF itself? - azure

I've been trying to find the best way to trigger an alert when an ADF task (i.e. CopyActivity or Stored Procedure Task) has been running for more than N hours, I wanted to use the Azure Monitor as it is one of the recommended notification services in Azure, however I have not been able to find a "Running" criteria, hence I had to play with the available criteria (Succeeded and Failed) and check this every N hours, however this is still not perfect as I don't know when the process started and we may run the process manually multiple times a day, is there any way you would recommend doing this? like a event-based notification that listens to some time variable and as soon as it is greater than the threshold triggers an email notification?

is there any way you would recommend doing this? like a event-based
notification that listens to some time variable and as soon as it is
greater than the threshold triggers an email notification?
Based on your requirements, I suggest you using Azure Data Factory SDKs to monitor your pipelines and activities.
You could create a time trigger Azure Function which is triggered every N hours. In that trigger function :
You could list all running activities in data factory account.
Then loop them to monitor the DurationInMs Property in ActivityRun Class to check if any activity has been running for more than N hours and it's still In-Progress status.
Finally, send the email or kill the activity or do whatever you want.

I would suggest simple solution:
Kusto query for listing all pipeline runs where status is "Queued" and joining it on CorrelationId with those that we are not interested in - typically "Succeeded", "Failed". Join flavor leftanti does the job by "Returning all the records from the left side that don't have matches from the right." (as specified in MS documentation).
Next step would be to set your desired timeout value - it is 30m in the example code below.
Finally, you can configure Alert rule based on this query and get your email notification, or whatever you need.
ADFPipelineRun
| where Status == "Queued"
| join kind=leftanti ( ADFPipelineRun
| where Status in ("Failed", "Succeeded") )
on CorrelationId
| where Start < ago(30m)
I tested this only briefly, maybe there is something missing. I have an idea about adding other statuses to be removed from result - like "Cancelled".

Related

Azure Monitor alert is sending false failure email notifications of failure count of 1 for Functions app

I have a Functions app where I've configured signal logic to send me an alert whenever a failure greater than or equal to one has occurred in my application. I have been getting emails everyday saying my Azure Monitor alert was triggered followed by an email later saying that the failure was resolved. I know that my app didn't fail because I checked in Application Insights. For instance, I did not have a failure today, but did have a failures the prior 2 days:
However, I did receive a failure email today. If I go to configure the signal logic where I set a static threshold of failure count greater than or equal to 1 it shows this:
Why is it showing a failure for today, when I know that isn't true from the Application Insights logs? Also, if I change the signal logic to look at total failures instead of count of failures, it looks correct:
I've decided to use the total failures metric instead, but it seems that the count functionality is broken.
Edit:
Additional screenshot:
I suggest you can use Custom log search as the signal if you have already connected your function app with Application insights(I'd like to use this kind of signal, and don't see such behavior like yours).
The steps as below:
Step 1: For signal, please select Custom log search. The screenshot is as below:
Step 2: When the azure function times out, it will throw an error and the error type is Microsoft.Azure.WebJobs.Host.FunctionTimeoutException, so you can use the query below to check if it times out or not:
exceptions
| where type == "Microsoft.Azure.WebJobs.Host.FunctionTimeoutException"
Put the above query in the "Search query" field, and configure other settings as per your need. The screenshot is as below:
Then configure other settings like action group etc. Please let me know if you still have such issue.
One thing should be noted: Some kinds of triggers support retry logic, like blogtrigger. So if it reties, you can also receive the alert email. But you can disable the retry logic as per this doc.

How to check running status and stop Durable function

I want to process millions of records on-demand, which takes approximate 2-3 hours to process. I want to go serverless that is why tried durable function (first-time). I want to check, how long I can run durable function so I created 3 functions
Http function to kick start Orchestrator function
Orchestrator function
Activity function
My DurableFunction is running and emitting logs in Application Insights from last 5 days and based on my code it would take 15 more days to complete.
I want to know that how to stop Orchestrator function manually?
I can see thousands of entry in ApplicationInsights requests table for single execution, Is there any way to check how many DurableFunction running in backend? and how much time taken by single execution?
I can see some information regarding orchestrator function in "DurableFunctionHubInstance" table but as MS recommended not rely on table.
Since Durable Functions does a lot of checkpointing and replays the orchestration, normal logging might not always be very insightful.
Getting the status
There are several ways to query for the status of orchestrations. One of them is through the Azure Functions Core tools as George Chen mentioned.
Another way to query the status is by using the HTTP API of Durable Functions directly:
GET <rooturl>/runtime/webhooks/durableTask/instances?
taskHub={taskHub}
&connection={connectionName}
&code={systemKey}
&createdTimeFrom={timestamp}
&createdTimeTo={timestamp}
&runtimeStatus={runtimeStatus1,runtimeStatus2,...}
&showInput=[true|false]
&top={integer}
More info in the docs.
The HTTP API also has methods to purge orchestrations. Either a single one by ID or multiple by datetime/status.
DELETE <rooturl>/runtime/webhooks/durabletask/instances/{instanceId}
?taskHub={taskHub}
&connection={connection}
&code={systemKey}
Finally you can also manage your instances using the DurableOrchestrationClient API in C#. Here's a sample on GitHub: HttpGetStatusForMany.cs
I have written & vlogged about using the DurableOrchestrationClient API in case you want to know more about how to use this in C#.
Custom status
Small addition: it's possible to add a custom status object to the orchestration so you can add enriched information about the progress of the orchestration.
Getting the duration
When you query the status of an orchestration instance you get back a DurableOrchestrationStatus object. This contains two properties:
CreatedTime
LastUpdatedTime
I'm guessing you can subtract those and get a reasonable indication of the time it has taken.
You could manage the Durable Functions orchestration instances with Azure Functions Core Tools.
Terminate instances:
func durable terminate --id 0ab8c55a66644d68a3a8b220b12d209c --reason "It was time to be done."
Query instances with filters: you could add the parameter(runtime-status) to filter the running instances.
func durable get-instances --created-after 2018-03-10T13:57:31Z --created-before 2018-03-10T23:59Z --top 15
As for the time functions took, looks like it doesn't support. The similar parameter is the get-history.

Application Insights Alert Trigger History

We are using application insights for sending our metrics. Based on these metrics we have alerts set via a customQuery.
The alerts are working fine. I'm expecting to pull data out of the alerts trigger and put use it for analytical purpose.
Explanation:
I have alerts A,B,C,D,E.....
During a course of period A triggered 5 times, B 3 times, D 10 times.....
Now for this course of the period to start with I'm looking at having an insight into which failure happened most frequently so appropriate action can be taken.
Where can I find this information? Not expecting the Monitor tab as it gives a very basic view.

Can i execute an on-demand web job from a scheduled webjob?

I need to execute a long running webjob on certain schedules or on-demand with some parameters that need to be passed. I had it in a way where the scheduled webjob would put a message on the queue with the parameters and a queue message triggered job would take over - OR - some user interaction would put the same message on the queue with the parameters and the triggered job would take over. However for some reason the triggered function never-finishes - and right now i cannot see any exceptions being displayed in the dashboard outputs (see Time limit on Azure Webjobs triggered by Queue)
I m looking into whether I can execute my triggered webjob as an On-demand webjob and pass the parameters to it? Is there anyway to call an on-demand web job from a scheduled web job and pass it some command line parameters?
Thanks for your help!
QueueTriggered WebJob functions work very well when configured properly. Please see my answer on the other question which points to documentation resources on how to set your WebJobs SDK Continuous host up properly.
Queue messaging is the correct pattern for you to be using for this scenario. It allows you to pass arbitrary data along to your job, and will also allow you to scale out to multiple instances as needed when your load increases.
You can use the WebJobs Dashboard to invoke your job function directly (see "Run Function" button below) - you can specify the queue message input directly in the Dashboard as a string. This allows you to invoke the function directly as needed with whatever inputs you want, in addition to allowing the function to continue to respond to qeueue messages actually added to the queue.

Requeue or delete messages in Azure Storage Queues via WebJobs

I was hoping if someone can clarify a few things regarding Azure Storage Queues and their interaction with WebJobs:
To perform recurring background tasks (i.e. add to queue once, then repeat at set intervals), is there a way to update the same message delivered in the QueueTrigger function so that its lease (visibility) can be extended as a way to requeue and avoid expiry?
With the above-mentioned pattern for recurring background jobs, I'm also trying to figure out a way to delete/expire a job 'on demand'. Since this doesn't seem possible outside the context of WebJobs, I was thinking of maybe storing the messageId and popReceipt for the message(s) to be deleted in Table storage as persistent cache, and then upon delivery of message in the QueueTrigger function do a Table lookup to perform a DeleteMessage, so that the message is not repeated any more.
Any suggestions or tips are appreciated. Cheers :)
Azure Storage Queues are used to store messages that may be consumed by your Azure Webjob, WorkerRole, etc. The Azure Webjobs SDK provides an easy way to interact with Azure Storage (that includes Queues, Table Storage, Blobs, and Service Bus). That being said, you can also have an Azure Webjob that does not use the Webjobs SDK and does not interact with Azure Storage. In fact, I do run a Webjob that interacts with a SQL Azure database.
I'll briefly explain how the Webjobs SDK interact with Azure Queues. Once a message arrives to a queue (or is made 'visible', more on this later) the function in the Webjob is triggered (assuming you're running in continuous mode). If that function returns with no error, the message is deleted. If something goes wrong, the message goes back to the queue to be processed again. You can handle the failed message accordingly. Here is an example on how to do this.
The SDK will call a function up to 5 times to process a queue message. If the fifth try fails, the message is moved to a poison queue. The maximum number of retries is configurable.
Regarding visibility, when you add a message to the queue, there is a visibility timeout property. By default is zero. Therefore, if you want to process a message in the future you can do it (up to 7 days in the future) by setting this property to a desired value.
Optional. If specified, the request must be made using an x-ms-version of 2011-08-18 or newer. If not specified, the default value is 0. Specifies the new visibility timeout value, in seconds, relative to server time. The new value must be larger than or equal to 0, and cannot be larger than 7 days. The visibility timeout of a message cannot be set to a value later than the expiry time. visibilitytimeout should be set to a value smaller than the time-to-live value.
Now the suggestions for your app.
I would just add a message to the queue for every task that you want to accomplish. The message will obviously have the pertinent information for processing. If you need to schedule several tasks, you can run a Scheduled Webjob (on a schedule of your choice) that adds messages to the queue. Then your continuous Webjob will pick up that message and process it.
Add a GUID to each message that goes to the queue. Store that GUID in some other domain of your application (a database). So when you dequeue the message for processing, the first thing you do is check against your database if the message needs to be processed. If you need to cancel the execution of a message, instead of deleting it from the queue, just update the GUID in your database.
There's more info here.
Hope this helps,
As for the first part of the question, you can use the Update Message operation to extend the visibility timeout of a message.
The Update Message operation can be used to continually extend the
invisibility of a queue message. This functionality can be useful if
you want a worker role to “lease” a queue message. For example, if a
worker role calls Get Messages and recognizes that it needs more time
to process a message, it can continually extend the message’s
invisibility until it is processed. If the worker role were to fail
during processing, eventually the message would become visible again
and another worker role could process it.
You can check the REST API documentation here: https://msdn.microsoft.com/en-us/library/azure/hh452234.aspx
For the second part of your question, there are really multiple ways and your method of storing the id/popReceipt as a lookup is a possible option, you can actually have a Web Job dedicated to receive messages on a different queue (e.g plz-delete-msg) and you send a message containing the "messageId" and this Web Job can use Get Message operation then Delete it. (you can make the job generic by passing the queue name!)
https://msdn.microsoft.com/en-us/library/azure/dd179474.aspx
https://msdn.microsoft.com/en-us/library/azure/dd179347.aspx

Resources