I've defined an availability test on a Function App (called watchdog) accessible with a function key. The watchdog performs an URL-ping on other health endpoints (protected by AAD). The JSON response is evaluated if one of the functions contains an unhealthy or degraded status the overall overall status will be unhealthy and the HTTP status will be 500.
Watchdog results are correct, since I see proper outcomes when adding/removing fails injection into one of the watched functions.
The test is defined as follows:
Test type: URL ping test
URL: watchdog endpoint with function key
Parse dependent requests: unchecked
Enable retries for availability test failures: unchecked
Test frequency: 5 minutes
Test locations: West Europe
Test Timeout: 120 seconds
Status code must equal: 200
Content match: {successful JSON response}
Alerts: Enabled
I've selected the test and clicked on Open Rules (Alerts) page to edit the generated alert rule.
Then only thing I've added is an action group configured to send a message to my job mail.
I've double checked the mail for correctness. I've double checked my spam folder in outlook.
I've read docs, watched tutorials and investigated but I still have no clue about this alert is not properly fired.
If someone could put me on the right path, I'll be very grateful!
Regards,
Giacomo S. S.
This was happening cause in first place cause I've done a lot different configuration experiments, reaching maximum alerts.
Now the alert is properly fired and the email sent.
Related
I have a web application that builds web-pages using agent (it's written in LS and we use [print html] to output HTML) and from time to time I see an error as below.
02-11-2020 10:00:18 HTTP Web Server: Agent did not complete within configured time limit [/path-to-database.nsf/web?openagent] Anonymous
02-11-2020 10:00:18 HTTP Server: Execution time limit exceeded by Agent '(Web)|Web' in database '/path-to-database.nsf'. Agent signer 'signer name'.
As a result HTTP task stuck so I have to restart it, but that means I have to monitor it all the time.
It does not seems to be related to agent time execution, otherwise I would have this issue constantly.
The activity does not seems to be the issue as well, according to google analytics it's around ~50 active users.
I doubt [Server Tasks\Agent manager] will help, because agent runs under HTTP task.
Does anybody know how to figure out what is the reason of such issue and where I have to dig to fix it.
Update
Domino version 11.0
The agent is triggered by anonymous visitor and does some relatively heavy computation to construct HTML response (loops and lookups are present, but I'm sure all loops ends properly, without infinitive run).
I guess settings for HTTP Agents are under this section (so 2 mins).
Web Agents and Web Services
Run web agents and web services concurrently? Enabled
Web agent and web services timeout: 120 seconds
In general request takes between 300ms-1 second, however there are some heavy pages with 1-5 seconds (but nothing like 10 seconds or more).
I notice the error only when we get more than 50 active users (who activity open new pages and thus trigger the agent).
I guess Richard is right and there must be some condition when agent stuck (maybe related to views update or some background process).
For now I simply restart HTTP to get this issue fixed (for some time).
So my question could be re-phrased to:
What can cause delay of the agent that build web page (taking into account it's related to 50-100 active users).
Thanks a lot :-)
I have an app service with provides a health check endpoint. I have enabled "Health check" on this service and provided the health check endpoint path. I have validated the endpoint in a browser and it is reachable. When everything is running it is reporting a value for the metric of 100. I have set up an alert rule on this metric in Application Insights and tried Average and Min in this rule (< 100). When I kill or stop the service the rule never fires.
It is stated here that this should be possible but I have not found a way to do this:
https://azure.github.io/AppService/2020/08/24/healthcheck-on-app-service.html#alerts
Also I'm not sure what the 100 even is: %?
In the chart when I hover over the last few minutes it doesn't show a numeric value but rather "--". Which is probably why the rule doesn't fire. Anyone got this working? Is it a bug?
Just to report back, this seems to be working as expected now. It also depends on the type of "unhealthiness" of the service. If you stop the service manually it doesn't fire as it is then a planned stop. If you kill the main process then the service is automatically restarted (as it is seemingly implemented with docker and the main process is the entrypoint). If however you do something for the server to report downness (5XX status code for example on the health endpoint) then it works as expected. The metrics are recorded correctly and thus metric graphs are correct. Alerting on such metrics also works then.
This is not a direct answer, but as a workaround, we suggest you can take a look at Application Insights Availability test.
It's easy to configure and use for health check purpose.
I want an Azure Alert to trigger when a certain function app fails. I set it up as a GTE 1 threshold on the [function name] Failed metric thinking that would yield the expected result. However, when it runs daily I am getting notifications that the alert fired but I cannot find anything in the Application Insights to indicate the failure and it appears to be running successfully and completing.
Here is the triggered alert summary:
Here is the invocation monitoring from the portal showing that same function over the past few days with no failures:
And here is an application insights search over that time period showing no exceptions and all successful dependency actions:
The question is - what could be causing a Azure Function Failed metric to be registering non-zero values without any telemetry in Application Insights?
Update - here is the alert configuration
And the specific condition settings-
Failures blade for wider time range:
There are some dependency failures on a blob 404 but I think that is from a different function that explicitly checks for the existence of blobs at paths to know which files to download from an external source. Also the timestamps don't fall in the sample period.
No exceptions:
Per comment on the question by #ivan-yang I have switched the alerting to use a custom log search instead of the built-in Azure Function metric. At this point that metric seems to be pretty opaque as to what is triggering it and it was triggering every day when I ran the Azure Function with no apparent underlying failure. I plan to avoid this metric now.
My log based alert is using the following query for now to get what I was looking for (an exception happened or a function failed):
requests
| where success == false
| union (exceptions)
| order by timestamp desc
Thanks to #ivan-yang and #krishnendughosh-msft for the help
I have a Functions app where I've configured signal logic to send me an alert whenever a failure greater than or equal to one has occurred in my application. I have been getting emails everyday saying my Azure Monitor alert was triggered followed by an email later saying that the failure was resolved. I know that my app didn't fail because I checked in Application Insights. For instance, I did not have a failure today, but did have a failures the prior 2 days:
However, I did receive a failure email today. If I go to configure the signal logic where I set a static threshold of failure count greater than or equal to 1 it shows this:
Why is it showing a failure for today, when I know that isn't true from the Application Insights logs? Also, if I change the signal logic to look at total failures instead of count of failures, it looks correct:
I've decided to use the total failures metric instead, but it seems that the count functionality is broken.
Edit:
Additional screenshot:
I suggest you can use Custom log search as the signal if you have already connected your function app with Application insights(I'd like to use this kind of signal, and don't see such behavior like yours).
The steps as below:
Step 1: For signal, please select Custom log search. The screenshot is as below:
Step 2: When the azure function times out, it will throw an error and the error type is Microsoft.Azure.WebJobs.Host.FunctionTimeoutException, so you can use the query below to check if it times out or not:
exceptions
| where type == "Microsoft.Azure.WebJobs.Host.FunctionTimeoutException"
Put the above query in the "Search query" field, and configure other settings as per your need. The screenshot is as below:
Then configure other settings like action group etc. Please let me know if you still have such issue.
One thing should be noted: Some kinds of triggers support retry logic, like blogtrigger. So if it reties, you can also receive the alert email. But you can disable the retry logic as per this doc.
I have created a AppService on Azure that runs on Tomcat. I'm using metrics for monitoring and set alert rule, that should send me an email, when error 4xx will occur. Nothing happens even though I've created more errors than rule needs to run alert.
Thanks, Dominik
According to your screenshot & the reference for Metric Definitions, it seems to be normal to not happen the event of sending mail, because your alert rule means the count of the HTTP 4xx event is greater than or equal to 5 times over the last 5 minutes, not average count per minute. So you can try to increase the threshold value or shorten the period to check the mail sender whether be triggered when the condition satisfied obviously. Meanwhile, if you doubt the alert trigger whether works fine, you can retrieve the logs via Azure CLI or Powershell.