I have two questions about Azure Alerts.
You can make a rule, where it sends an alert when the CPU or memory hit something like (Is less than 1000 megabytes". But is there a way to set it up, so it will only send an alert if the condition is met for 5-10min? Right now it just sends an alert right away when there is less than 1000MB.
NetworkWatcher, the condition "Whenever the average checks failed percent is greater than X%". How does azure calculate the %? let's take 50%, what is it looking at to trigger this 50%, or even if you set it as 100% or 1%?
Related
I want an email notification for every logic app run with Failed status like below screenshot.
I tried to configure Runs Failed alerts in logic app but things are not very clear to me.
what should be the excect entry for Threshold value, Operator , aggression type, Period and frequency to get alert notification on every failed run.
For this requirement, I think you can choose Static in "Threshold" and set condition as Great than Count 0. In "Evaluated based on", you can set 5 minutes as "Aggregation granularity (Period)" and set 5 minutes as "Frequency of evaluation". Shown as below screenshot:
The "Evaluated based on" you choose as 24hours and every 5 minutes is not particularly good. Because once the alert triggered, its "Monitor condition" will become "fired", and if it hasn't been solved, the alert will not be triggered again.(For example, your logic failed on 1:00, the alert will be triggered in 5 minutes. But it will not be triggered again if there is a failure during the last 24 hour when evaluate every 5 minutes).
By the way, you can also test it by yourself. You can create a logic app as below, it is allowed to be saved and will fail when run it.
I have an autoscale rule that will not fire.
The out rule indicates if CPU Percentage is above 70% then add an instance. Time duration is 2 minutes and cool off period is 2 minutes.
When I built a Metrics chart to compare the actual CPU percent versus observed, I can clearly see that there are spikes in my CPU but the observed seems to be averaging it out over a longer time period, and I don't know why? What setting can I use in my scale rules to control the time period over which my rule averages?
Thanks for asking question! You may want to investigate Best practices for Autoscale
Also, it’s important to understand the flapping process:
It is recommended to carefully choose different thresholds for scale-out and scale-in based on practical situations and don’t recommend autoscale settings like the examples below with the same or very similar threshold values for out and in conditions:
Take this as an example:
Increase instances by 1 count when Thread Count <= 600
Decrease instances by 1 count when Thread Count >= 600
Now please consider the following process:
Assume there are two instances to begin with and then the average number of threads per instance grows to 625.
Autoscale scales out adding a third instance.
Next, assume that the average thread count across instance falls to 575.
Before scaling down, autoscale tries to estimate what the final state will be if it scaled in. For example, 575 x 3 (current instance count) = 1,725 / 2 (final number of instances when scaled down) = 862.5 threads. This means autoscale would have to immediately scale-out again even after it scaled in, if the average thread count remains the same or even falls only a small amount. However, if it scaled up again, the whole process would repeat, leading to an infinite loop.
To avoid this situation (termed "flapping"), autoscale does not scale down at all. Instead, it skips and reevaluates the condition again the next time the service's job executes. This can confuse many people because autoscale wouldn't appear to work when the average thread count was 575.
Estimation during a scale-in is intended to avoid "flapping" situations, where scale-in and scale-out actions continually go back and forth. Keep this behavior in mind when you choose the same thresholds for scale-out and in.
We recommend choosing an adequate margin between the scale-out and in thresholds. As an example, consider the following better rule combination.
Increase instances by 1 count when CPU% >= 80
Decrease instances by 1 count when CPU% <= 60
To add to this the cool down period which means that if a scale down/up operation has happened, even if the rule is true (example - CPU remains high) the auto scale rule will not trigger. If the cool down is 2 min which means that if a scale down/up operation has happened, for the next 2 minutes, even if the rule is true, it will not be triggered due to cool down period.
I use Application Insights "Availability" feature to check a web site availability and send an alert if it is down.
Now Application Insights sends an alert every 5 minutes, even the "alert failure time window" is 15 minutes. Test frequency is 5 minutes.
So I get an alert after 5 minutes, then after 10 minutes, then after 15 minutes! I get 3 alerts while I need only 1 one alert after 15 minutes. It looks like a bug for me.
How to prevent Application Insights Availability feature to send alerts every 5 minutes?
The email (notification) is sent the moment alert condition is satisfied. It doesn't wait for alert failure time window.
Example: for alerting rule to send notification if 3 locations out of 5 turn red, and 3 locations turning red within the first second => notification will be sent during the same second. It will not wait for 5 (or 15) minutes.
This is by design with the goal to reduce TTD (time to detect).
There are two ways to handle noise:
Configure retries (test will retry 2 times during red => green state switch)
Increase the number of locations to trigger alert (for instance, 14 out of 16)
Either way - only one notification is supposed to be sent, not every 5/15 minutes. Multiple notifications suggest either some bug in tracking current state of an alert (bug in a product) or an Application which intermittently fails (so, alerting rule constantly changes its states green => red => green => ..., as a result email is sent during every transition). Do you get alert every 5 minutes when tests are red all the time?
Alert failure time window defines what failed location means. 5 min test interval and 5 min alert failure means that 1 last result defines whether location failed or not. 5 min test interval and 15 min alert failure means that 3 last results define whether location failed or not. So, if one of those 3 test runs failed then location is considered as failed (even though 2 results after it might have been successes).
Increasing alert failure time window makes alerting rule more aggressive (and noisy for intermittently failing apps).
I'm relatively new to Azure and am trying to see if there's a way to create notifications to occur in real time (or close to) whenever only certain exceptions occur using Application Insights.
Right now I'm able to track exceptions and to trigger metric alerts for when a threshold of exceptions occur over a certain amount of time but can't seem to figure out how to make these alerts sensitive to only certain kinds of exceptions. My first thoughts were to add properties to an exception as I used a telemetry client to track it with the 'TrackException' method then create an alert specific to that property but I'm still unable to figure out how to do it.
Any help is appreciated.
A couple years later now, there's a way to mostly do this with built in functionality.
There isn't an easy way to do this on every exception as it occurs, though. some apps have literally billions of exceptions per day, so evaluating your function every time a exception occurs would be pretty expensive.
Things like this are generally done with custom alerts that do a query and see if anything that meets the criteria exists in the new time period.
you'd do this with "log alerts", documented here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/alerts-unified-log
instead of getting an email every time a specific exception occurred, your query would run every N minutes, and if any rows meet the criteria, you'd get a single mail (or whatever you have the alert configured to do), and you keep getting mails every N minutes where rows that meet the criteria are found.
There are two options:
Call TrackMetric (provide some metric name) when exception of particular type happens in addition to TrackException. Then configure alert based on this metric.
Write a tool/service/azure function which every few minutes runs a query in Application Insights Analytics and posts result as metric (using TrackMetric). Then configure alert from portal.
Right now AI team is working on providing #2 out of the box.
I am testing Azure Application Insights alert functionality. It seems to be either buggy or I don't know how to use it.
If I create a new alert, based on the metric 'Server Exceptions', it seems to work once then never again. Once it fires, it seems to go into a state of 'Active' where there is an orange triangle with an !. See the image below. I created a new one, that I haven't triggered, and as can be seen in the image it has a green circle with a tick.
This sort of implies to me that an alert won't fire again until one 'acknowledges' the alert, which is not a bad idea, but I can't see how to do that.
Edit :
I have just tried to use the 'Exception Rate' as suggested, but I think the minimum threshold to fire the alert would be an average of 1 exception per second over a 5 minute period.
I must say it seems strange that my use-case isn't handled. I have a light weight Web API service that is so simple it should never fail but it could, and as a result if an exception occurs I want to receive an alert straight away.
Alert is supposed to resolve and state is supposed to get back to green when the condition of the alert is no longer fulfilled.
This is exceptionally hard to achieve with "Count" metrics because they go up and up and almost never down. It means that, once fired, the alert won't resolve because the value of the metric stays over the threshold all the time.
You can try to set an alert on the "Rate" metric instead and you should see that the state is returning to green when the "Rate" is within the limits you set.
This is now fixed. Please let us know if you see any issues. Some things to keep in mind:
Alert rules are evaluated on a sliding window: an alert would trigger/resolve based on how the condition evaluates on a sliding window from the instant a sample arrives.
A caveat to the above for exception count based alert rules: we will resolve an alert if there are no exceptions reported for the time window configured in the rule.
Note: this is different from metrics based rules – lack of data does not result in the alert being resolved for those.
"Server exception" metric works as OP expects now in 2018. My use case below:
For the goal of getting an email whenever an Exception happened.
Use "Server exception" metric.
That metric is smart enough to auto-resolve after waiting the period's length of time after the initial alert, if the error has not occurred again.
So you'll have the initial "Alert", then 5 minutes later of no Exceptions, it returns a "Healthy" state.
And since it auto-resolved, if the error happens again tomorrow it will do the "Alert" again.
Note this was using App Insights with a Function App. The Function App Failure metric had problems and wasn't reliable for this (Azure kept logging 0.2 Exception/s and thinking that was over the 1 in 5 min threshold...)