How to alert on increased "counter" value with 10 minutes alert interval - prometheus-alertmanager

So, I have monitoring on error log file(mtail). It's just count number of error lines. And mtail sums number of new lines in file.
I want to send alerts when new error(s) occured each 10 minutes only. Not for every single error.
Please, can you provide exact values for these lines:
expr: increase(php_fpm_errors_total[10m]) > 0
for: 10m
I would appreciate if you provide me some doc links or explanation.

The way you have it, it will alert if you have new errors every time it evaluates (default=1m) for 10 minutes and then trigger an alert. There is also a property in alertmanager called group_wait (default=30s) which after the first triggered alert waits and groups all triggered alerts in the past time into 1 notification. You can remove the for: 10m and set group_wait=10m if you want to send notification even if you have 1 error but just don't want to have 1000 notifications for every single error

Related

Creating an alert for long running pipelines

I currently have an alert setup for Data Factory that sends an email alert if the pipeline runs longer than 120 minutes, following this tutorial: https://www.techtalkcorner.com/long-running-azure-data-factory-pipelines/. So when a pipeline does in fact run longer than the expected time, I do receive an alert however, I am also getting additional & unexpected alerts.
My query looks like:
ADFPipelineRun
| where Status =="InProgress" // Pipeline is in progress
| where RunId !in (( ADFPipelineRun | where Status in ("Succeeded","Failed","Cancelled") | project RunId ) ) // Subquery, pipeline hasn't finished
| where datetime_diff('minute', now(), Start) > 120 // It has been running for more than 120 minutes
I received an alert email on September 28th of course saying a pipeline was running longer than the 120 minutes but when trying to find the pipeline in the Azure Data Factory pipeline runs nothing shows up. In the alert email there is a button that says, "View the alert in Azure monitor" and when I go to that I can then press "View Query Results" above the shown query. Here I can re-enter the query above and filter the date to show all pipelines running longer than 120 minutes since September 27th and it returns 3 pipelines.
Something I noticed about these pipelines is the end time column:
I'm thinking that at some point the UTC time is not properly configured and for that reason, maybe the alert is triggered? Is there something I am doing wrong, or a better way to do this to avoid a bunch of false alarms?
To create Preemptive warnings for long-running jobs.
Create activity.
Click on blank space.
Follow path: Settings > Elapsed time metric
Refer Operationalize Data Pipelines - Azure Data Factory
I'm not sure if you're seeing false alerts. What you've shown here looks like the correct behavior.
You need to keep in mind:
Duration threshold should be offset by the time it takes for the logs to appear in Azure Monitor.
The email alert takes you to the query that triggered the event. Your query is only showing "InProgress" statues and so the End property is not set/updated. You'll need to extend your query to look at one of the other statues to see the actual duration.
Run another query with the RunId of the suspect runs to inspect the durations.
ADFPipelineRun
| where RunId == 'bf461c8b-0b1e-43c4-9cdf-7d9f7ccc6f06'
| distinct TimeGenerated, OperationName, RunId, Start, End, Status
For example:

Azure Stream Analytics job triggers False Positives missing assets on job start

On starting my on Azure Stream Analytics (ASA) job I get several False Positives (FP) and I want to know what causes this.
I am trying to implement asset tracking in ASA as disccussed in another question. My specific use case is that I want to trigger events when an asset has not send a signal in the last 70 minutes. This works fine when the ASA job is running but triggers false positives on starting the job.
For example when starting the ASA-job at 2017-11-07T09:30:00Z. The ASA-job gives an entry with MostRecentSignalInWindow: 1510042968 (=2017-11-07T08:22:48Z) for name 'A'. while I am sure that there is another event for name 'A' with time: '2017-11-07T08:52:49Z' and one at '2017-11-07T09:22:49Z in the eventhub.
Some events arrive late due to the event ordering policy:
Late: 5 seconds
Out-of-order: 5 seconds
Action: adjust
I use the below query:
WITH
Missing AS (
SELECT
PreviousSignal.name,
PreviousSignal.time,
FROM
[signal-eventhub] PreviousSignal
TIMESTAMP BY
time
LEFT OUTER JOIN
[signal-eventhub] CurrentSignal
TIMESTAMP BY
time
ON
PreviousSignal.name= CurrentSignal.certname
AND
DATEDIFF(second, PreviousSignal, CurrentSignal) BETWEEN 1 AND 4200
WHERE CurrentSignal.name IS NULL
),
EventsInWindow AS (
SELECT
name,
max(DATEDIFF(second, '1970-01-01 00:00:00Z', time)) MostRecentSignalInWindow
FROM
Missing
GROUP BY
name,
TumblingWindow(minute, 1)
)
For anyone reading this, this was a confirmed bug in Azure Stream Analytics and has now been resolved.

IMAP (gmail?) returning incorrect UIDs to FETCH request

I'm fiddling with IMAP in Python currently and have noticed the following (tested using single and multiple messages in a folder):
Select folder, "A" and fetch the UID of the message within
Move to another folder, "B", select that and fetch the new UID
Using the "B" UID, move the message back to folder "A"
Re-select folder "A" and fetch the new UID - as expected this gives a new UID for the message
Finally, issue a FETCH using the new UID from step 4
The issue is that in the 5th step the command executes OK - but the server returns a different UID to the one specified in the request (normally, but not always, a difference of 1)! For example:
LEFC12 UID FETCH 65 (FLAGS...)])
DEBUG:imapclient.imaplib:< * 3 FETCH (UID 64 ... {47}
The same happens with multiple messages - all of them are offset by the same amount.
If I have the process sleep for 20s (as in totally idle for 20s, if it retries every second it never comes back OK) - the fetch returns the correct UID fine. I'm not sure if this is a gmail or IMAP thing, any pointers/help would be greatly appreciated!
EDIT: here's all of the imapclient logs for the sequence above: https://gist.github.com/Fizzadar/37cb1fa808ffb6594326bba293f6daab. I have noticed that this isn't consistent - if you repeat the above steps twice over it always fails, but just the once it fails randomly (~50%, leading me to believe this is a gmail-specific issue).

Azure OMS - Heartbeat Test - If the server is down for more than a day, no alerts will be generated

There is an issue with the current Heartbeat query we are using. The query works perfectly, however I have observed an issue while setting up the alert.
Breakdown of the query:
Type=Heartbeat Computer in $ComputerGroups[NON-PROD_Group] |
measure max(TimeGenerated) as LastCall by Computer | where LastCall < NOW-5MINUTE
Query checks for Heartbeat of computer in the group ‘NON-PROD_Group’
Measure max(TimeGenerated) as LastCall by Computer: will check for the time of last occurrence of Heartbeat from all the server and assign the value to a variable ‘LastCall’. ‘LastCall’ now has the time of last heartbeat
Where LastCall < NOW-5MINUTES: this section will check if the last heartbeat was before 5 minutes from ‘NOW’. Alert is triggered based on that.
I have given TIME WINDOW as 24 hour. The issue here is alert is generated for all the occurrences between ‘NOW-5MINUTES’ and 24 HOURS. There are no alerts generated if the LastCall falls outside the time window.
If the server is down for more than a day, no alerts will be generated.
For Instance, if Friday evening One of the server goes down, alert notifications will come in until Saturday evening(24 Hours is maximum time allowed) after that the alert clears and no more notifications are generated.
Monday morning, the alert would be cleared and will report everything is working fine.
Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType, ResourceGroup, Resource, ResourceProvider, ResourceType
| where LastHeartbeat < ago(5m)
// If you are looking for any specific resource group, you may add in where condition like ResourceGroup == 'ecom'

Recycling IIS 7 app pool at specific time

We are getting an error when we try to set it to a specific time every 7 days at a specific time. The doc says it is possible by using the optional [d] argument. We want to recycle every 7 days at 3 am.
http://technet.microsoft.com/en-us/library/cc754494(v=ws.10).aspx
Command :
C:\Windows\System32\inetsrv>appcmd set apppool /apppool.name: TempPool /+recycli
ng.periodicRestart.schedule.[value='7.03:00:00']
Error Message:
Application Pools
There was an error while performing this operation.
Details:
Timespan value must be between 00:00:00 and 23:59:59 seconds inclusive, with a granularity of 60 seconds
Although this question is a bit expired, but i faced it yesterday when i was writing some c# codes to manipulate an application pool programmly.
I found the sample for schedules at doc in following link which reads "adding an application pool... then set the application pool to daily recycle at 3:00 A.M.", which means we could not specify a fixed time span for recycling by add a schedule.
http://www.iis.net/configreference/system.applicationhost/applicationpools/add/recycling/periodicrestart/schedule/add#006
That's why it throws the exception to ask a time span under 23:59:59.
When you want specify a fixed time span for recycling, you should use time property from periodicRestart level.
See this doc for samples for various ways to target your requirement.
http://www.iis.net/configreference/system.applicationhost/applicationpools/add/recycling/periodicrestart#005
// add schedule to recycle at 3 am every day
appPool.Recycling.PeriodicRestart.Schedule.Clear();
appPool.Recycling.PeriodicRestart.Schedule.Add(new TimeSpan(3, 0, 0));
// set to recycle every 3 hours
appPool.Recycling.PeriodicRestart.Time = new TimeSpan(3, 0, 0);

Resources