Azure OMS - Heartbeat Test - If the server is down for more than a day, no alerts will be generated - azure

There is an issue with the current Heartbeat query we are using. The query works perfectly, however I have observed an issue while setting up the alert.
Breakdown of the query:
Type=Heartbeat Computer in $ComputerGroups[NON-PROD_Group] |
measure max(TimeGenerated) as LastCall by Computer | where LastCall < NOW-5MINUTE
Query checks for Heartbeat of computer in the group ‘NON-PROD_Group’
Measure max(TimeGenerated) as LastCall by Computer: will check for the time of last occurrence of Heartbeat from all the server and assign the value to a variable ‘LastCall’. ‘LastCall’ now has the time of last heartbeat
Where LastCall < NOW-5MINUTES: this section will check if the last heartbeat was before 5 minutes from ‘NOW’. Alert is triggered based on that.
I have given TIME WINDOW as 24 hour. The issue here is alert is generated for all the occurrences between ‘NOW-5MINUTES’ and 24 HOURS. There are no alerts generated if the LastCall falls outside the time window.
If the server is down for more than a day, no alerts will be generated.
For Instance, if Friday evening One of the server goes down, alert notifications will come in until Saturday evening(24 Hours is maximum time allowed) after that the alert clears and no more notifications are generated.
Monday morning, the alert would be cleared and will report everything is working fine.

Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType, ResourceGroup, Resource, ResourceProvider, ResourceType
| where LastHeartbeat < ago(5m)
// If you are looking for any specific resource group, you may add in where condition like ResourceGroup == 'ecom'

Related

How to properly create once-a-day Azure Log Alert for pod errors?

I created an Azure Alert using a Query (KQL - Kusto Query Language) reading from the Log. That is, it's an Log Alert.
After a few minutes, the alert was triggered (as I expected based on my condition).
My condition checks if there are Pods of a specific name in Failed state:
KubePodInventory
| where TimeGenerated between (now(-24h) .. now())
| where ClusterName == 'mycluster'
| where Namespace == 'mynamespace'
| where Name startswith "myjobname"
| where PodStatus == 'Failed'
| where ContainerStatusReason == 'Completed' //or 'Error' or doesn't matter? (there are duplicated entries, one with Completed and one with Error)
| order by TimeGenerated desc
These errors stay in the log, and I only want to catch (alert about them) once per day (that is, I check if there is at least one entry in the log (threshold), then fire the alert).
Is the log query evaluated every time there is a new entry in the log, or is it evaluated in a set frequency?I could not find in Azure Portal a frequency specified to check Alerts, so maybe it evaluates the Alert(s) condition(s) every time there is something new in the Log?

Creating an alert for long running pipelines

I currently have an alert setup for Data Factory that sends an email alert if the pipeline runs longer than 120 minutes, following this tutorial: https://www.techtalkcorner.com/long-running-azure-data-factory-pipelines/. So when a pipeline does in fact run longer than the expected time, I do receive an alert however, I am also getting additional & unexpected alerts.
My query looks like:
ADFPipelineRun
| where Status =="InProgress" // Pipeline is in progress
| where RunId !in (( ADFPipelineRun | where Status in ("Succeeded","Failed","Cancelled") | project RunId ) ) // Subquery, pipeline hasn't finished
| where datetime_diff('minute', now(), Start) > 120 // It has been running for more than 120 minutes
I received an alert email on September 28th of course saying a pipeline was running longer than the 120 minutes but when trying to find the pipeline in the Azure Data Factory pipeline runs nothing shows up. In the alert email there is a button that says, "View the alert in Azure monitor" and when I go to that I can then press "View Query Results" above the shown query. Here I can re-enter the query above and filter the date to show all pipelines running longer than 120 minutes since September 27th and it returns 3 pipelines.
Something I noticed about these pipelines is the end time column:
I'm thinking that at some point the UTC time is not properly configured and for that reason, maybe the alert is triggered? Is there something I am doing wrong, or a better way to do this to avoid a bunch of false alarms?
To create Preemptive warnings for long-running jobs.
Create activity.
Click on blank space.
Follow path: Settings > Elapsed time metric
Refer Operationalize Data Pipelines - Azure Data Factory
I'm not sure if you're seeing false alerts. What you've shown here looks like the correct behavior.
You need to keep in mind:
Duration threshold should be offset by the time it takes for the logs to appear in Azure Monitor.
The email alert takes you to the query that triggered the event. Your query is only showing "InProgress" statues and so the End property is not set/updated. You'll need to extend your query to look at one of the other statues to see the actual duration.
Run another query with the RunId of the suspect runs to inspect the durations.
ADFPipelineRun
| where RunId == 'bf461c8b-0b1e-43c4-9cdf-7d9f7ccc6f06'
| distinct TimeGenerated, OperationName, RunId, Start, End, Status
For example:

Unable to set Azure Dashboard card to use "Set in Query"

I have a query to check average response times over:
Last 24 hours
24-192 hours
The difference between them as a percentage
let requests0to24HoursAgo = requests
| where timestamp > ago(24h)
| summarize last0to24HoursAverageRequestDuration=avg(duration), id=1;
let requests24to192HoursAgo = requests
| where timestamp > ago(192h)
| where timestamp < ago(24h)
| summarize last24to192HoursAverageRequestDuration=avg(duration), id=1;
let diff = requests0to24HoursAgo
| join
requests24to192HoursAgo
on id
| extend Diff = (last0to24HoursAverageRequestDuration - last24to192HoursAverageRequestDuration) / last24to192HoursAverageRequestDuration * 100
| project
["Average response (last 0-24 hours)"]=last0to24HoursAverageRequestDuration,
["Average response (last 24-192 hours)"]=last24to192HoursAverageRequestDuration,
Diff;
diff
This works perfectly in the Logs section in Azure, but as soon as I pin the query to a dashboard, it's unable to run it with the date range "Set in query" and returns NaN for 2 of the values.
When I click "Open Editing Pane", set it to "Set in Query" and run it, it works. When I then click "Apply", it is still broken on the dashboard.
As per the documentation, In log analytics the default time range of 24 hours applied to all queries.
We have tested in our local environment, we tried overriding the time range parameter using the dashboard tile setting which didnt help the request you made looks like a feature request.
Would suggest you to submit a feedback forum & raise the same issue over Microsoft Q&A
I stumbled across the same issue when the timespan I set within the query (| where timestamp > ago(7d)) was ignored in the dashboard.
I've tested the way with the tile settings, like VenkateshDodda-MT mentioned:
Open the tile settings in the dashboard (-> Configure tile settings)
Tick Override the dashboard time settings at the tile level.
Select a greater timespan than 24h
Although there is no Set in query option like in the query editor, it would be enough to set a timespan of 30 days in your case.
I've also successfully tested it with your query.

How to alert on increased "counter" value with 10 minutes alert interval

So, I have monitoring on error log file(mtail). It's just count number of error lines. And mtail sums number of new lines in file.
I want to send alerts when new error(s) occured each 10 minutes only. Not for every single error.
Please, can you provide exact values for these lines:
expr: increase(php_fpm_errors_total[10m]) > 0
for: 10m
I would appreciate if you provide me some doc links or explanation.
The way you have it, it will alert if you have new errors every time it evaluates (default=1m) for 10 minutes and then trigger an alert. There is also a property in alertmanager called group_wait (default=30s) which after the first triggered alert waits and groups all triggered alerts in the past time into 1 notification. You can remove the for: 10m and set group_wait=10m if you want to send notification even if you have 1 error but just don't want to have 1000 notifications for every single error

getting duplicates records when joining in kql

We have a requirement to get status of windows service when it is started and stopped do that I have returned one query, but I am facing issue when joining 2 tables to get output.
I have tried using inner and left outer joins but still getting duplicates
Event
| where EventLog == "System" and EventID == 7036 and Source == "Service Control Manager"
| parse kind=relaxed EventData with * '<Data Name="param1">' Windows_Service_Name '</Data><Data Name="param2">' Windows_Service_State '</Data>' *
| where Windows_Service_State == "running" and Windows_Service_Name == "Microsoft Monitoring Agent Azure VM Extension Heartbeat Service"
| extend startedtime = TimeGenerated
| join (
Event
| where EventLog == "System" and EventID == 7036 and Source == "Service Control Manager"
| parse kind=relaxed EventData with * '<Data Name="param1">' Windows_Service_Name '</Data><Data Name="param2">' Windows_Service_State '</Data>' *
| where Windows_Service_State == "stopped" and Windows_Service_Name == "Microsoft Monitoring Agent Azure VM Extension Heartbeat Service"
| extend stoppedtime = TimeGenerated
) on Computer
| extend downtime = startedtime - stoppedtime
| project Computer, Windows_Service_Name,stoppedtime , startedtime ,downtime
| top 10 by Windows_Service_Name desc
we want to get no of times that service started and stopped if the service restarted multiple times in a day we are getting duplicate timings in starttime when joining please have a look on link (https://ibb.co/JzqxjC0)
I am not sure I fully understand what is going on, since I don't have access to the data. But. I can see you are using the default join flavor.
The default is inner unique:
The inner-join function is like the standard inner-join from the SQL world. An output record is produced whenever a record on the left side has the same join key as the record on the right side.
Which means a new line in the result is created on every match between the left and the right side. Therefore. let's assume you have a computer that was restarted twice, so it has 2 lines of stopped, and 2 lines of running. That will produce 4 rows in Kusto answer.
Looking at your picture, it makes sense to me because you have lines with negative downtime. I guess that is not possible.
What I would do, is look for an identifier that is unique on every Computer run. Then you can join on that, and stay safe not to generate data that you don't want.

Resources