Azure Monitor avoid false positives on VPN disconnect - azure

We are using Azure Monitor to monitor if our Virtual Network Gateway S2S VPN connections disconnects (we have a few connections in each environment), but we would like to reconfigure so that we only get alerts if the connection been down for more than one minute to avoid alerts when the tunnel is reset.
Today we are using this log analytics query which creates false alerts, do you have any suggestions how we can create this
AzureDiagnostics
| where Category == "TunnelDiagnosticLog"
| order by TimeGenerated
Here is an example of what we don't want to trigger an alert. Note that just excluding the GlobalStandby change events won't do it since its not guaranteed that the tunnel connects again.
Configuration in Azure Monitor:

Using Log Analytics I came up with this query that will check the next line in the log to see if its Connected or not and compare the timespan between them.
AzureDiagnostics | serialize
| where Category == "TunnelDiagnosticLog"
| where TimeGenerated < ago(120s) and TimeGenerated > ago(600m)
| extend Result = iif(
(OperationName == "TunnelDisconnected"
and next(OperationName) == "TunnelConnected"
and next(TimeGenerated)-TimeGenerated < 1m)
or OperationName == "TunnelConnected", 0, 1)
| project TimeGenerated,
OperationName,
next(OperationName),
Result,
next(TimeGenerated)-TimeGenerated,
Resource,
ResourceGroup,
_ResourceId
| project-rename Downtime=Column2, NextStatus=Column1
| sort by TimeGenerated asc
| where OperationName == "TunnelDisconnected" and Result == 1

You can try creating Metric measurement log alert with AggregatedValue as count of disconnections aggregated by column with values GatewayTenantWorker... (and any other column as needed) and binned per minute in your log query and configure the alert with threshold as 0 (for any disconnections) and trigger based on consecutive breaches greater than 1 (for more than 1 minute, or 2 for more than 2 minutes (to reduce even more false alerts)).
This should fire an alert when there are any disconnections for more than 1 (or 2) minute(s) in any of the VPN connections.
Assumptions about the data -
Tunnel resets are resolved within a minute.
In case of actual long disconnection, there would be log for current status (Disconnected) per minute. Above solution works only in this case.
If assumptions do not hold true, information about log data pattern in case of long disconnection is needed.

Related

How do I create an alert which fires when one of many machines fails to report a heartbeat?

Overall I'm trying to set up an azure alert to email me when a computer goes down by using the Heartbeat table.
Let's say I have 5 machines in my Azure subscription, and they each report once per minute to the table called Heartbeat, so it looks something like this:
Currently, I can query "Heartbeat | where Computer == 'computer-name'| where TimeGenerated > ago(5m)" and figure out when one computer has not been reporting in the last 5 minutes, and is down (thank you to this great article for that query).
I am not very experienced with any query language, so I am wondering if it is possible to have 1 query which can check to see if there was ANY computer which stopped sending it's logs over the last 5-10 minute period, and thus would be down. Azure uses KQL, or Kusto Query Language for it's queries and there is documentation in the link above.
Thanks for the help
one option is to calculate the max report time for each computer, then filter to the ones whose max report time is older than 5 minutes ago:
let all_computers_lookback = 12h
;
let monitor_looback = 5m
;
Heartbeat
| where TimeGenerated > ago(all_computers_lookback)
| summarize last_reported = max(TimeGenerated) by computer
| where last_reported < ago(monitor_looback)
another alternative:
the first part creates an "inventory" of all computers that reported at least once in the last (e.g. 12 hours).
the second part finds all computers that reported at least once in the last (e.g. 5 minutes)
the third and final part finds the difference between the two (i.e. all computers that didn't report in the last 5 minutes)
Note: if you have more than 1M computers, you can use the join operator instead of the in() operator
let all_computers_lookback = 12h
;
let monitor_looback = 5m
;
let all_computers =
Heartbeat
| where TimeGenerated > ago(all_computers_lookback)
| distinct computer
;
let reporting_computers =
Heartbeat
| where TimeGenerated > ago(monitor_looback)
| distinct computer
;
let non_reporting_computers =
all_computers
| where computer !in(reporting_computers)
;
non_reporting_computers

Azure Monitor metrics based on sum of existing metric values

I have a network resource that only has bytes in and bytes out as metrics, I want to derive another metric with the addition of both bytesin+bytesout. Please suggest how I can add both in & out values and create an Azure Monitor Alert rule based on this new metric.
You cannot create a new metric by combining to existing ones. However, you can create an Alert Rule on a custom query. That means we can create a query like this
AzureMetrics
| where MetricName == "BitsInPerSecondTraffic" or MetricName == "BitsoutPerSecondTraffic"
| where ResourceType == "EXPRESSROUTECIRCUITS"
| summarize AggregatedValue = sum(Total) by bin(TimeGenerated, 15m)
Use that query to create an alert using the azure portal:

AKS Container Insights: How to list not ready pods?

I'm using Azure Container Insights for an AKS cluster and want to filter some logs using Log Analytics and Kusto Query Language. I do it to provide a convenient dashboard and alerts.
What I'm trying to achieve is list only not ready pods. Listing the ones not Running is not enough. This can be easily filtered using kubectl e.g. following this post How to get list of pods which are "ready"?
However this data is not avaiable when querying in Log analytics with Kusto as the containerStatuses seems to be only a string
It should be somehow possible because Container Insights allow this filtering in Metrics section. However it's not fully satisfying because with metrics my filtering capabilities are much smaller.
You can do it for pods as below for last 1h.
let endDateTime = now();
let startDateTime = ago(1h);
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where PodStatus != "Running"
| distinct Computer, PodUid, TimeGenerated, PodStatus
The efdestegul's answer was only listing not "Running" pods and I was looking for not ready ones. However this answer led me to a query which I actually needed and thank you for that. Maybe this will help others.
let timeGrain=1m;
KubePodInventory
// | where Namespace in ('my-namespace-1', 'my-namespace-2')
| summarize countif(ContainerStatus == 'waiting') by bin(TimeGenerated,timeGrain)
| order by countif_ desc
| render timechart
With this query I'm able to render a chart that displays all not ready pods in time. And actually in a very useful way, only the pods that were not ready for more than expected and they needed to be restarted. You can always filter your results for any namespaces you need.

Can we find out if there is an increase to number of requests to a given page?

If i have a web app in Azure, with ApplicationInsights configured, is there a way we can tell if there was an increase in the number of requests to a given page?
I know we can get the "Delta" of performance in a given time slice, compared to the previous period, but doesn't seem like we can do this for requests?
For example, i'd like to answer questions like: "what pages in the last hour had the highest % increase in requests, compared to the previous period"?
Does anyone know how to do this, or can it be done via the AppInsights query language?
Thanks!
Not sure whether it can be done using the Portal, I don't think so. But I came up with the following Kusto query:
requests
| where timestamp > ago(2h) and timestamp < ago(1h)
| summarize previousPeriod = todouble(count()) by url
| join (
requests
| where timestamp > ago(1h)
| summarize lastHour = todouble(count()) by url
) on url
| project url, previousPeriod, lastHour, change = ((lastHour - previousPeriod) / previousPeriod) * 100
| order by change desc
This is about increase/decrease of amount of traffic per url, you can change count() to for example avg(duration) to get the increase/decrease of the average duration.

How can I consume more than the reserved number of request units with Azure Cosmos DB?

We have reserved various number of RUs per second for our various collections. I'm trying to optimize this to save money. For each response from Cosmos, we're logging the request charge property to Application Insights. I have one analytics query that returns the average number of request units per second and one that returns the maximum.
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start < timestamp and timestamp < end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| make-series sum(value) default=0 on timestamp in range(start, end, 1s) by Database, Collection
| mvexpand sum_value to typeof(double), timestamp limit 36000
| summarize avg(sum_value) by Database, Collection
| order by Database asc, Collection asc
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start <= timestamp and timestamp <= end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| summarize sum(value) by Database, Collection, bin(timestamp, 1s)
| summarize arg_max(sum_value, *) by Database, Collection
| order by Database asc, Collection asc
The averages are fairly low but the maxima can be unbelievably high in some cases. An extreme example is a collection with a reservation of 1,000, an average used of 15,59 and a maximum used of 63,341 RUs/s.
My question is: How can this be? Are my queries wrong? Is throttling not working? Or does throttling only work on a longer period of time than a single second? I have checked for request throttling on the Azure Cosmos DB overview dashboard (response code 429), and there was none.
I have to answer myself. I found two problems:
Application Insights logs an inaccurate timestamp. I added a timestamp as a custom dimension, and within a certain minute I get different seconds in my custom timestamp but the built-in timestamp is one second past the minute for many of these. That is why I got (false) peaks in request charge.
We did have throttling. When viewing request throttling in the portal, I have to select a specific database. If I try to view request throttling for all databases, it looks like there is none.

Resources