I am testing Azure Application Insights alert functionality. It seems to be either buggy or I don't know how to use it.
If I create a new alert, based on the metric 'Server Exceptions', it seems to work once then never again. Once it fires, it seems to go into a state of 'Active' where there is an orange triangle with an !. See the image below. I created a new one, that I haven't triggered, and as can be seen in the image it has a green circle with a tick.
This sort of implies to me that an alert won't fire again until one 'acknowledges' the alert, which is not a bad idea, but I can't see how to do that.
Edit :
I have just tried to use the 'Exception Rate' as suggested, but I think the minimum threshold to fire the alert would be an average of 1 exception per second over a 5 minute period.
I must say it seems strange that my use-case isn't handled. I have a light weight Web API service that is so simple it should never fail but it could, and as a result if an exception occurs I want to receive an alert straight away.
Alert is supposed to resolve and state is supposed to get back to green when the condition of the alert is no longer fulfilled.
This is exceptionally hard to achieve with "Count" metrics because they go up and up and almost never down. It means that, once fired, the alert won't resolve because the value of the metric stays over the threshold all the time.
You can try to set an alert on the "Rate" metric instead and you should see that the state is returning to green when the "Rate" is within the limits you set.
This is now fixed. Please let us know if you see any issues. Some things to keep in mind:
Alert rules are evaluated on a sliding window: an alert would trigger/resolve based on how the condition evaluates on a sliding window from the instant a sample arrives.
A caveat to the above for exception count based alert rules: we will resolve an alert if there are no exceptions reported for the time window configured in the rule.
Note: this is different from metrics based rules – lack of data does not result in the alert being resolved for those.
"Server exception" metric works as OP expects now in 2018. My use case below:
For the goal of getting an email whenever an Exception happened.
Use "Server exception" metric.
That metric is smart enough to auto-resolve after waiting the period's length of time after the initial alert, if the error has not occurred again.
So you'll have the initial "Alert", then 5 minutes later of no Exceptions, it returns a "Healthy" state.
And since it auto-resolved, if the error happens again tomorrow it will do the "Alert" again.
Note this was using App Insights with a Function App. The Function App Failure metric had problems and wasn't reliable for this (Azure kept logging 0.2 Exception/s and thinking that was over the 1 in 5 min threshold...)
Related
I have two questions about Azure Alerts.
You can make a rule, where it sends an alert when the CPU or memory hit something like (Is less than 1000 megabytes". But is there a way to set it up, so it will only send an alert if the condition is met for 5-10min? Right now it just sends an alert right away when there is less than 1000MB.
NetworkWatcher, the condition "Whenever the average checks failed percent is greater than X%". How does azure calculate the %? let's take 50%, what is it looking at to trigger this 50%, or even if you set it as 100% or 1%?
I want an email notification for every logic app run with Failed status like below screenshot.
I tried to configure Runs Failed alerts in logic app but things are not very clear to me.
what should be the excect entry for Threshold value, Operator , aggression type, Period and frequency to get alert notification on every failed run.
For this requirement, I think you can choose Static in "Threshold" and set condition as Great than Count 0. In "Evaluated based on", you can set 5 minutes as "Aggregation granularity (Period)" and set 5 minutes as "Frequency of evaluation". Shown as below screenshot:
The "Evaluated based on" you choose as 24hours and every 5 minutes is not particularly good. Because once the alert triggered, its "Monitor condition" will become "fired", and if it hasn't been solved, the alert will not be triggered again.(For example, your logic failed on 1:00, the alert will be triggered in 5 minutes. But it will not be triggered again if there is a failure during the last 24 hour when evaluate every 5 minutes).
By the way, you can also test it by yourself. You can create a logic app as below, it is allowed to be saved and will fail when run it.
I am trying to troubleshoot a problem where I run an Azure Function locally on my instance and have it disabled on the Portal. After sending some data through I can see that it successfully hits my local Azure Function but never hits it again after. Strangely enough the data appears to still go through my channels of Queue - Function - Queue - Function but never hits the breakpoints on my local machine after the first successful run. Triple checking the Portal I can see that it is definitely disabled which leads me to believe there might be another instance of the Azure Function running about. I've confirmed that no other devs are working on it so I've also ruled that out...
Looking at https://[MY_FUNCTION_NAME].scm.azurewebsites.net/azurejobs/#/functions I see that there seem to be duplicates of some of my functions with varying statistics on the repeats. My guess is that Azure might be tracking my local instances when I start them but I see the "Successful" green numbers go up on both versions of the function when I pass data through. Blocked out the function names but replaced the matching ones with matching colors (blacked out bars are just single functions I was too lazy to color code). The red circles indicate the function of interest that have different success statistics.
Has anyone else run into this issue?
Turns out there were duplicate functions in a Slot setting... Someone put them there to get deployment options set up but they left the project and never noted it.
Hope this saves someone some frustrations at some point!
I'm relatively new to Azure and am trying to see if there's a way to create notifications to occur in real time (or close to) whenever only certain exceptions occur using Application Insights.
Right now I'm able to track exceptions and to trigger metric alerts for when a threshold of exceptions occur over a certain amount of time but can't seem to figure out how to make these alerts sensitive to only certain kinds of exceptions. My first thoughts were to add properties to an exception as I used a telemetry client to track it with the 'TrackException' method then create an alert specific to that property but I'm still unable to figure out how to do it.
Any help is appreciated.
A couple years later now, there's a way to mostly do this with built in functionality.
There isn't an easy way to do this on every exception as it occurs, though. some apps have literally billions of exceptions per day, so evaluating your function every time a exception occurs would be pretty expensive.
Things like this are generally done with custom alerts that do a query and see if anything that meets the criteria exists in the new time period.
you'd do this with "log alerts", documented here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/alerts-unified-log
instead of getting an email every time a specific exception occurred, your query would run every N minutes, and if any rows meet the criteria, you'd get a single mail (or whatever you have the alert configured to do), and you keep getting mails every N minutes where rows that meet the criteria are found.
There are two options:
Call TrackMetric (provide some metric name) when exception of particular type happens in addition to TrackException. Then configure alert based on this metric.
Write a tool/service/azure function which every few minutes runs a query in Application Insights Analytics and posts result as metric (using TrackMetric). Then configure alert from portal.
Right now AI team is working on providing #2 out of the box.
I've just started playing around with NServiceBus on Azure, and for some reason it takes a long time to get through the first level retries when a message handler throws an exception. With retries set to 5 it takes 20+ minutes before the second level retries kick in.
What is causing the delay?
Here's how I'm configuring the bus:
Configure.Transactions.Advanced(s =>
{
s.DisableDistributedTransactions();
s.DoNotWrapHandlersExecutionInATransactionScope();
});
Configure.With()
.AutofacBuilder(container)
.DefiningCommandsAs(t => t.IsCommand())
.DefiningEventsAs(t => t.IsEvent())
.XmlSerializer()
.MessageForwardingInCaseOfFault()
.AzureConfigurationSource()
.UseTransport<AzureStorageQueue>()
.AzureDiagnosticsLogger()
.AzureMessageQueue()
.AzureSubcriptionStorage()
.UseAzureTimeoutPersister()
.UnicastBus()
.RunHandlersUnderIncomingPrincipal(false);
FYI: I'm using NServiceBus built from the develop branch as of today and running in the emulator.
Oh, I misread the question, I thought it was taking 20 minutes after last retry for the second level to kick in. But than I know what this is and it's configurable!
To support batching (to lower the cost) the message visible time is calculated by multiplying the individual MessageInvisibleTime by the amount in the BatchSize, the default MessageInvisibleTime is 30000 (milliseconds), the default BatchSize is 10. Multiply that again with 5 first level retries and you'll end up with 25 minutes before the first exception occurs and the second level to kick in.
You can reconfigure this if you like: MessageInvisibleTime and BatchSize is a property on the AzureQueueConfig and MaxRetries sits on TransportConfig (in 4.0) or MsmqTransportConfig (in 3.X)
Can you open an issue on github for this, with repro if possible? on http://www.github.com/nservicebus/nservicebus
I suspect the delay comes from the azure timeout persister as that is the one responsible for managing the time between retries, yet 20 minutes seems like a really odd number so have no immediate explanation for the observed behavior.
In the mean time, can you try using the in memory timeoutpersister and see if the issue disappears, that would confirm my hypethesis.
I was under the impression that first level retries did not need a timeoutpersister (was not even aware that of its existence to be honest) and that first level retries were only driven by the peek lock/invisible time of messages in the Azure queue.
For second level retries I would expect the timeoutpersister to play a role (now that I know it exists...).
Yves, correct me if I am wrong.