How to fetch IIS Start log for a corresponding IIS Stop log in Azure Log Analytics outside of Alert's monitoring time period - azure

I'm working on configuring an Azure Log Analytics alert (using KQL) to capture the IIS Stop & Start events (from Events table) in my OMS Workspace, and if the alert query finds that there's no corresponding IIS Start event log generated from a PaaS Role for a particular IIS Stop event log- the user should get notified by an alert so that he can bring IIS back up.
Problem: Let’s say I setup my alert to run over a Time Period & Frequency of 15mins. If the alert triggered at 10:30AM, that means it will scan the IIS logs from 10:15:01 AM to 10:29:59 AM. Now, suppose an IIS Stop event got logged in around 10:28 AM, then the respective IIS Start log (if any) will be logged in after a couple of minutes around 10:31AM or 10:32 AM – and hence it will go out of the alert’s monitoring time period. This will create a false positive failure scenario. (IIS got started back but my alert didn’t captured the Start event log). And thus, it might lead to some unnecessary IIS Start/Reset operations on my PaaS roles.
Attaching a representative quick sketch to explain it figuratively.
Please let me know if there's any possible approach to achieve this. Any suggestions are welcome. Thanks in advance!

Current implementation as follows.
Here we can see False Alert generated at 10:30.
You can see the below approach, where we select last 10 minutes data(Overlapped) every 5 minutes.
For the below case you can generate the alert
See if its helping you.

Related

HTTP Web Server: Agent did not complete within configured time limit

I have a web application that builds web-pages using agent (it's written in LS and we use [print html] to output HTML) and from time to time I see an error as below.
02-11-2020 10:00:18 HTTP Web Server: Agent did not complete within configured time limit [/path-to-database.nsf/web?openagent] Anonymous
02-11-2020 10:00:18 HTTP Server: Execution time limit exceeded by Agent '(Web)|Web' in database '/path-to-database.nsf'. Agent signer 'signer name'.
As a result HTTP task stuck so I have to restart it, but that means I have to monitor it all the time.
It does not seems to be related to agent time execution, otherwise I would have this issue constantly.
The activity does not seems to be the issue as well, according to google analytics it's around ~50 active users.
I doubt [Server Tasks\Agent manager] will help, because agent runs under HTTP task.
Does anybody know how to figure out what is the reason of such issue and where I have to dig to fix it.
Update
Domino version 11.0
The agent is triggered by anonymous visitor and does some relatively heavy computation to construct HTML response (loops and lookups are present, but I'm sure all loops ends properly, without infinitive run).
I guess settings for HTTP Agents are under this section (so 2 mins).
Web Agents and Web Services
Run web agents and web services concurrently? Enabled
Web agent and web services timeout: 120 seconds
In general request takes between 300ms-1 second, however there are some heavy pages with 1-5 seconds (but nothing like 10 seconds or more).
I notice the error only when we get more than 50 active users (who activity open new pages and thus trigger the agent).
I guess Richard is right and there must be some condition when agent stuck (maybe related to views update or some background process).
For now I simply restart HTTP to get this issue fixed (for some time).
So my question could be re-phrased to:
What can cause delay of the agent that build web page (taking into account it's related to 50-100 active users).
Thanks a lot :-)

Azure App Service health check alert not firing

I have an app service with provides a health check endpoint. I have enabled "Health check" on this service and provided the health check endpoint path. I have validated the endpoint in a browser and it is reachable. When everything is running it is reporting a value for the metric of 100. I have set up an alert rule on this metric in Application Insights and tried Average and Min in this rule (< 100). When I kill or stop the service the rule never fires.
It is stated here that this should be possible but I have not found a way to do this:
https://azure.github.io/AppService/2020/08/24/healthcheck-on-app-service.html#alerts
Also I'm not sure what the 100 even is: %?
In the chart when I hover over the last few minutes it doesn't show a numeric value but rather "--". Which is probably why the rule doesn't fire. Anyone got this working? Is it a bug?
Just to report back, this seems to be working as expected now. It also depends on the type of "unhealthiness" of the service. If you stop the service manually it doesn't fire as it is then a planned stop. If you kill the main process then the service is automatically restarted (as it is seemingly implemented with docker and the main process is the entrypoint). If however you do something for the server to report downness (5XX status code for example on the health endpoint) then it works as expected. The metrics are recorded correctly and thus metric graphs are correct. Alerting on such metrics also works then.
This is not a direct answer, but as a workaround, we suggest you can take a look at Application Insights Availability test.
It's easy to configure and use for health check purpose.

AppInsights - Monitor for Hung Processes

We are looking at implementing AppInsights for our non-web application. One of the things that we want to monitor for is processes that may be "hung" for more than N number of seconds or minutes. I have been unable to find something built in that does this. The closest thing I have seen or thought of would be to log 2 custom events for the start and end of a process, and then have an alert for a custom log that queries events with no matching "end" event after N minutes.
Is there another way to monitor for hung processes using AppInsights that I am not seeing? Thanks for any help.
If you choose to use application insights, here is the suggestion just for your reference(but if you have another better solution, you can ignore this):
As per this post, you can leverage heartbeat feature, details as below:
if this application runs more than several seconds, you can leverage heartbeat
feature - it sends metric every N minutes/seconds (configurable) and the absence of such
metric will indicate that application is no longer actively running. However, if
Application Insights thread survives, then heartbeat will still be reported.
You can rely on presense/absense of the telemetry from this app in general as well as
couple custom events as you outlined above - Azure Monitor allows to set an alert on
analytics query, so you'll be able to craft a query that returns nothing in case of
application issues and set an alert on 0 count returned by such a query.

Application Insights Alert Trigger History

We are using application insights for sending our metrics. Based on these metrics we have alerts set via a customQuery.
The alerts are working fine. I'm expecting to pull data out of the alerts trigger and put use it for analytical purpose.
Explanation:
I have alerts A,B,C,D,E.....
During a course of period A triggered 5 times, B 3 times, D 10 times.....
Now for this course of the period to start with I'm looking at having an insight into which failure happened most frequently so appropriate action can be taken.
Where can I find this information? Not expecting the Monitor tab as it gives a very basic view.

Azure Bot Service using over 1GB of data transfer out per day. Why? How can I stop that?

I created a QnA bot using the Azure Bot service, and now I'm seeing data transfers out of my subscription of over 1 GB a day! I cannot figure out why, but since it's billable, I'd like to know why and how I can stop it.
The bot isn't being used yet, so no one is sending queries to it. I'm confused how this is happening.
Here's a screen shot of the graph for use in the last hour as well as a screen shot of the billing for the last few days showing the sudden jump in use.
Is this normal?
If you add AzureWebJobsDisableHomepage with a value of true, to the App settings, the data out will stop.
The setting itself is documented here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/Configuration-Settings (although it doesn't provide an explanation for how this setting affects a bot specifically)
The reasoning behind what is happening is a little complex. Azure Functions are not normally "in memory" and available all the time. There is a small spinup time that is not ideal within a bot. So, apparently there is a job setup with consumption plan bots to ping it every 10 seconds (and by 'ping', i mean retrieve the root of the site). If you open the Log Stream, you'll see an http get request every 10 seconds. Adding AzureWebJobsDisableHomepage doesn't disable the request, but changes the status of what is returned from "OK" to "NoContent".
This will be added to the Bot Service arm template soon (so future consumption plan bots do not automatically accrue these data usages).

Resources