Creating an alert for long running pipelines - azure

I currently have an alert setup for Data Factory that sends an email alert if the pipeline runs longer than 120 minutes, following this tutorial: https://www.techtalkcorner.com/long-running-azure-data-factory-pipelines/. So when a pipeline does in fact run longer than the expected time, I do receive an alert however, I am also getting additional & unexpected alerts.
My query looks like:
ADFPipelineRun
| where Status =="InProgress" // Pipeline is in progress
| where RunId !in (( ADFPipelineRun | where Status in ("Succeeded","Failed","Cancelled") | project RunId ) ) // Subquery, pipeline hasn't finished
| where datetime_diff('minute', now(), Start) > 120 // It has been running for more than 120 minutes
I received an alert email on September 28th of course saying a pipeline was running longer than the 120 minutes but when trying to find the pipeline in the Azure Data Factory pipeline runs nothing shows up. In the alert email there is a button that says, "View the alert in Azure monitor" and when I go to that I can then press "View Query Results" above the shown query. Here I can re-enter the query above and filter the date to show all pipelines running longer than 120 minutes since September 27th and it returns 3 pipelines.
Something I noticed about these pipelines is the end time column:
I'm thinking that at some point the UTC time is not properly configured and for that reason, maybe the alert is triggered? Is there something I am doing wrong, or a better way to do this to avoid a bunch of false alarms?

To create Preemptive warnings for long-running jobs.
Create activity.
Click on blank space.
Follow path: Settings > Elapsed time metric
Refer Operationalize Data Pipelines - Azure Data Factory

I'm not sure if you're seeing false alerts. What you've shown here looks like the correct behavior.
You need to keep in mind:
Duration threshold should be offset by the time it takes for the logs to appear in Azure Monitor.
The email alert takes you to the query that triggered the event. Your query is only showing "InProgress" statues and so the End property is not set/updated. You'll need to extend your query to look at one of the other statues to see the actual duration.
Run another query with the RunId of the suspect runs to inspect the durations.
ADFPipelineRun
| where RunId == 'bf461c8b-0b1e-43c4-9cdf-7d9f7ccc6f06'
| distinct TimeGenerated, OperationName, RunId, Start, End, Status
For example:

Related

How can I run a search job periodically in Azure Log Analytics?

I'm trying to visualize the browser statistics of our app hosted in Azure.
For that I'm using the nginx logs and run an Azure Log Analytics query like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend logEntry=parse_json(LogEntry)
| extend userAgent=parse_user_agent(logEntry.nginx.http_user_agent, "browser")
| extend browser=parse_json(userAgent)
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
Then I'm getting this diagram, which is exactly what I want:
But the query is very very slow. If I set the time range to more than just the last few hours it takes several minutes or doesn't work at all.
My solution is to use a search job like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend d=parse_json(LogEntry)
| extend user_agent=parse_user_agent(d.nginx.http_user_agent, "browser")
| extend browser=parse_json(user_agent)
It creates a new table BrowserStats_SRCH on which I can do this search query:
BrowserStats_SRCH
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
This is much faster now and only takes some seconds.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible? Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
In the end I would like to have a deeplink to the pie chart with the browser stats without the need to do any further click. Any help would be appreciated.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible?
You can leverage the api to create a search job. Then use a timer triggered azure function or logic app to call that api on a schedule.
PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-00000000000/resourcegroups/testRG/providers/Microsoft.OperationalInsights/workspaces/testWS/tables/Syslog_suspected_SRCH?api-version=2021-12-01-preview
with a request body containing the query
{
"properties": {
"searchResults": {
"query": "Syslog | where * has 'suspected.exe'",
"limit": 1000,
"startSearchTime": "2020-01-01T00:00:00Z",
"endSearchTime": "2020-01-31T00:00:00Z"
}
}
}
Or you can use the Azure CLI:
az monitor log-analytics workspace table search-job create --subscription ContosoSID --resource-group ContosoRG --workspace-name ContosoWorkspace --name HeartbeatByIp_SRCH --search-query 'Heartbeat | where ComputerIP has "00.000.00.000"' --limit 1500 --start-search-time "2022-01-01T00:00:00.000Z" --end-search-time "2022-01-08T00:00:00.000Z" --no-wait
Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
Before you start the job as described above, remove the old result table using an api call:
DELETE https://management.azure.com/subscriptions/{subscriptionId}/resourcegroups/{resourceGroupName}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}/tables/{tableName}?api-version=2021-12-01-preview
Optionally, you could check the status of the job using this api before you delete it to make sure it is not InProgress or Deleting

How to properly create once-a-day Azure Log Alert for pod errors?

I created an Azure Alert using a Query (KQL - Kusto Query Language) reading from the Log. That is, it's an Log Alert.
After a few minutes, the alert was triggered (as I expected based on my condition).
My condition checks if there are Pods of a specific name in Failed state:
KubePodInventory
| where TimeGenerated between (now(-24h) .. now())
| where ClusterName == 'mycluster'
| where Namespace == 'mynamespace'
| where Name startswith "myjobname"
| where PodStatus == 'Failed'
| where ContainerStatusReason == 'Completed' //or 'Error' or doesn't matter? (there are duplicated entries, one with Completed and one with Error)
| order by TimeGenerated desc
These errors stay in the log, and I only want to catch (alert about them) once per day (that is, I check if there is at least one entry in the log (threshold), then fire the alert).
Is the log query evaluated every time there is a new entry in the log, or is it evaluated in a set frequency?I could not find in Azure Portal a frequency specified to check Alerts, so maybe it evaluates the Alert(s) condition(s) every time there is something new in the Log?

'Delay until' finish time of 'Queue a new build' not working in Azure Logic App

I'm triggering an Azure Logic App from an https webhook for a docker image in Azure Container Registry.
The workflow is roughly:
When a HTTP request is received
Queue a new build
Delay until
FinishTime of Queue a new build
See: Workflow image
The Delay until action doesn't work in that the queueried FinishTime is 0001-01-01T00:00:00.
It complains about the wrong format, so I manually added a Z after the FinishTime keyword.
Now the time stamp is in the right format, however, the timestamp 0001-01-01T00:00:00Z obviously doesn't make sense and subsequent steps are executed without delay.
Anything that I am missing?
edit: Queue a new build queues an Azure pipeline build. I.e. the FinishTime property comes from the pipeline.
You need to set a timestamp in future, the timestamp 0001-01-01T00:00:00Z you set to the "Delay until" action is not a future time. If you set a timestamp as 2020-04-02T07:30:00Z, the "Delay until" action will take effect.
Update:
I don't think the "Delay until" can do what you expect, but maybe you can refer to the operations below. Just add a "Condition" action to judge if the FinishTime is greater than current time.
The expression in the "Condition" is:
sub(ticks(variables('FinishTime')), ticks(utcNow()))
In a word, if the FinishTime is greater than current time --> do the "Delay until" aciton. If the FinishTime is less than current time --> do anything else which you want.(By the way you need to pay attention to the time zone of your timestamp, maybe you need to convert all of the time zone to UTC)
I've been in touch with an Azure support engineer, who has confirmed that the Delay until action should work as I intended to use it, however, that the FinishTime property will not hold a value that I can use.
In the meantime, I have found a workaround, where I'm using some logic and quite a few additional steps. Inconvenient but at least it does what I want.
Here are the most important steps that are executed after the workflow gets triggered from a webhook (docker base image update in Azure Container Registry).
Essentially, I'm initializing the following variables and queing a new build:
buildStatusCompleted: String value containing the target value completed
jarsBuildStatus: String value containing the initial value notStarted
jarsBuildResult: String value containing the default value failed
Then, I'm using an Until action to monitor when the jarsBuildStatus's value is switching to completed.
In the Until action, I'm repeating the following steps until jarsBuildStatus changes its value to buildStatusCompleted:
Delay for 15 seconds
HTTP request to Azure DevOps build, authenticating with personal access token
Parse JSON body of previous raw HTTP output for status and result keywords
Set jarsBuildStatus = status
After breaking out of the Until action (loop), the jarsBuildResult is set to the parsed result.
All these steps are part of a larger build orchestration workflow, where I'm repeating the given steps multiple times for several different Azure DevOps build pipelines.
The final action in the workflow is sending all the status, result and other relevant data as a build summary to Azure DevOps.
To me, this is only a workaround and I'll leave this question open to see if others have suggestions as well or in case the Azure support engineers can give more insight into the Delay until action.
Here's an image of the final workflow (at least, the part where I implemented the Delay until action):
edit: Turns out, I can simplify the workflow because there's a dedicated Azure DevOps action in the Logic App called Send an HTTP request to Azure DevOps, which omits the need for manual authentication (Azure support engineer pointed this out).
The workflow now looks like this:
That is, I can query the build status directly and set the jarsBuildStatus as
#{body('Send_an_HTTP_request_to_Azure_DevOps:_jar''s')['status']}
The code snippet above is automagically converted to a value for the Set variable action. Thus, no need to use an additional Parse JSON action.

Azure Log Analytics - Alerts Advice

I have a question about azure log analytics alerts, in that I don't quite understand how the time frame works within the context of setting up an alert based on an aggregated value.
I have the code below:
Event | where Source == "EventLog" and EventID == 6008 | project TimeGenerated, Computer | summarize AggregatedValue = count(TimeGenerated) by Computer, bin_at(TimeGenerated,24h, datetime(now()))
For time window : 24/03/2019, 09:46:29 - 25/03/2019, 09:46:29
In the above the alert configuration interface insights on adding the bin_at(TimeGenerated,24h, datetime(now())) so I add the function, passing the arguments for a 24h time period. If you are already adding this then what is the point of the time frame.
Basically the result I am looking for is capturing this event over a 24 hour period and alerting when the event count is over 2. I don't understand why a time window is also necessary on top of this because I just want to run the code every five minutes and alert if it detects more than two instances of this event.
Can anyone help with this?
AFAIK you may use the query something like shown below to accomplish your requirement of capturing the required event over a time period of 24 hour.
Event
| where Source == "EventLog" and EventID == 6008
| where TimeGenerated > ago(24h)
| summarize AggregatedValue= any(EventID) by Computer, bin(TimeGenerated, 1s)
The '1s' in this sample query is the time frame with which we are aggregating and getting the output from Log Analytics workspace repository. For more information, refer https://learn.microsoft.com/en-us/azure/kusto/query/summarizeoperator
And to create an alert, you may have to go to Azure portal -> YOURLOGANALYTICSWORKSPACE -> Monitoring tile -> Alerts -> Manager alert rules -> New alert rule -> Add condition -> Custom log search -> Paste any of the above queries under 'Search query' section -> Type '2' under 'Threshold value' parameter of 'Alert logic' section -> Click 'Done' -> Under 'Action Groups' section, select existing action group or create a new one as explained in the below mentioned article -> Update 'Alert Details' -> Click on 'Create alert rule'.
https://learn.microsoft.com/en-us/azure/azure-monitor/platform/action-groups
Hope this helps!! Cheers!! :)
To answer your question in the comments part, yes the alert insists on adding the bin function and that's the reason I have provided relevant query along with bin function by having '1s' and tried to explain about it in my previous answer.
If you put '1s' in bin function then you would fetch output from Log Analytics by aggregating value of any EventID in the timespan of 1second. So output would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time.
If you put '24h' instead of '1s' in bin function then you would fetch output from Log Analytics by aggregating value of any EventID in the timespan of 24hours. So output would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time.
So in this case, we should not be using '24h' in bin function along with 'any' aggregation because if we use it then we would see only one occurrence of output in 24hours of timespan and that doesn't help you to find out event occurrence count using the above provided query having 'any' for aggregation. Instead you may use 'count' aggregation instead of 'any' if you want to have '24h' in bin function. Then this query would look something like shown below.
Event
| where Source == "EventLog" and EventID == 6008
| where TimeGenerated > ago(24h)
| summarize AggregatedValue= count(EventID) by Computer, bin(TimeGenerated, 24h)
The output of this query would look something like shown below where aaaaaaa is considered as a VM name, x is considered as a particular time, y and z are considered as some numbers.
One other note is, all the above mentioned queries and outputs are in the context of setting up an alert based on an aggregated value i.e., setting up an alert when opting 'metric measurement' under alert logic based on section. In other words, aggregatedvalue column is expected in alert query when you opt 'metric measurement' under alert logic based on section. But when you say 'you get a count of the events' that means If i am not wrong, may be you are opting 'number of results' under alert logic based on section, which would not required any aggregation column in the query.
Hope this clarifies!! Cheers!!

Azure Stream Analytics job triggers False Positives missing assets on job start

On starting my on Azure Stream Analytics (ASA) job I get several False Positives (FP) and I want to know what causes this.
I am trying to implement asset tracking in ASA as disccussed in another question. My specific use case is that I want to trigger events when an asset has not send a signal in the last 70 minutes. This works fine when the ASA job is running but triggers false positives on starting the job.
For example when starting the ASA-job at 2017-11-07T09:30:00Z. The ASA-job gives an entry with MostRecentSignalInWindow: 1510042968 (=2017-11-07T08:22:48Z) for name 'A'. while I am sure that there is another event for name 'A' with time: '2017-11-07T08:52:49Z' and one at '2017-11-07T09:22:49Z in the eventhub.
Some events arrive late due to the event ordering policy:
Late: 5 seconds
Out-of-order: 5 seconds
Action: adjust
I use the below query:
WITH
Missing AS (
SELECT
PreviousSignal.name,
PreviousSignal.time,
FROM
[signal-eventhub] PreviousSignal
TIMESTAMP BY
time
LEFT OUTER JOIN
[signal-eventhub] CurrentSignal
TIMESTAMP BY
time
ON
PreviousSignal.name= CurrentSignal.certname
AND
DATEDIFF(second, PreviousSignal, CurrentSignal) BETWEEN 1 AND 4200
WHERE CurrentSignal.name IS NULL
),
EventsInWindow AS (
SELECT
name,
max(DATEDIFF(second, '1970-01-01 00:00:00Z', time)) MostRecentSignalInWindow
FROM
Missing
GROUP BY
name,
TumblingWindow(minute, 1)
)
For anyone reading this, this was a confirmed bug in Azure Stream Analytics and has now been resolved.

Resources