AKS Container Insights: How to list not ready pods? - azure

I'm using Azure Container Insights for an AKS cluster and want to filter some logs using Log Analytics and Kusto Query Language. I do it to provide a convenient dashboard and alerts.
What I'm trying to achieve is list only not ready pods. Listing the ones not Running is not enough. This can be easily filtered using kubectl e.g. following this post How to get list of pods which are "ready"?
However this data is not avaiable when querying in Log analytics with Kusto as the containerStatuses seems to be only a string
It should be somehow possible because Container Insights allow this filtering in Metrics section. However it's not fully satisfying because with metrics my filtering capabilities are much smaller.

You can do it for pods as below for last 1h.
let endDateTime = now();
let startDateTime = ago(1h);
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where PodStatus != "Running"
| distinct Computer, PodUid, TimeGenerated, PodStatus

The efdestegul's answer was only listing not "Running" pods and I was looking for not ready ones. However this answer led me to a query which I actually needed and thank you for that. Maybe this will help others.
let timeGrain=1m;
KubePodInventory
// | where Namespace in ('my-namespace-1', 'my-namespace-2')
| summarize countif(ContainerStatus == 'waiting') by bin(TimeGenerated,timeGrain)
| order by countif_ desc
| render timechart
With this query I'm able to render a chart that displays all not ready pods in time. And actually in a very useful way, only the pods that were not ready for more than expected and they needed to be restarted. You can always filter your results for any namespaces you need.

Related

Unable to reproduce data from Azure Metrics Chart using Logs

I am trying to create dashboard of my services in Azure. I added Azure Metrics Chart of each service and later wanted to add under it specific details to operations included in service.
But when I try to get it from logs, I get much higher number of requests made. KQL:
requests
| where cloud_RoleName startswith "notificationengine"
| summarize Count = count() by operation_Name
| order by Count
And result:
Problem is with some metrics chart I get values with minimal difference or exactly same while with some like one I shown I get completely different values. I tried to modify KQL or search what might be wrong but never got anywhere.
My guess is that those are 2 different values but in that case why both are labeled as "requests" and if so what are actual differences?
I have taken an Azure Function App with 2 Http Trigger Functions with identical names starts with “HttpTrigger” and run both the functions for couple of times.
Test Case 1:
In the Logs Workspace, Requests count got for the two functions that starts with the word “HttpTrigger”:
But I have pinned the chart of only 1 Function Requests Count to the Azure Dashboard:
Probably, I believe you have written the query of requests of all the services/applications that starts with “notificationengine” but pinned only some apps/services logs-chart to the dashboard.
Test Case 2:

Azure Monitor metrics based on sum of existing metric values

I have a network resource that only has bytes in and bytes out as metrics, I want to derive another metric with the addition of both bytesin+bytesout. Please suggest how I can add both in & out values and create an Azure Monitor Alert rule based on this new metric.
You cannot create a new metric by combining to existing ones. However, you can create an Alert Rule on a custom query. That means we can create a query like this
AzureMetrics
| where MetricName == "BitsInPerSecondTraffic" or MetricName == "BitsoutPerSecondTraffic"
| where ResourceType == "EXPRESSROUTECIRCUITS"
| summarize AggregatedValue = sum(Total) by bin(TimeGenerated, 15m)
Use that query to create an alert using the azure portal:

Access dashboard's time range and granularity from KQL

I've added a chart using KQL and logs from Azure Log Analytics to a dashboard. I'm using make-series which works great but the catch is the following:
The logs I'm getting might not extend to the whole time range dictated by the dashboard. So basically I need access to the starttime/endtime (and time granularity) to make make-series cover the whole timerange.
e.g.
logs
| make-series
P90 = percentile(Elapsed, 90) default = 0,
Average = avg(Elapsed) default = 0
// ??? need start/end time to use in from/to
on TimeGenerated step 1m
Currently, it's not supported. There are some feedbacks about this feature: Support for time granularity selected in Azure Portal Dashboard, and Retrieve the portal time span and use it inside the kusto query.
And some people provided workarounds in the first feedback, you can give it a try.
I posted on another question on this subject - you can do a bit of a hack in your KQL to get this working: https://stackoverflow.com/a/73064218/5785878

Azure Monitor avoid false positives on VPN disconnect

We are using Azure Monitor to monitor if our Virtual Network Gateway S2S VPN connections disconnects (we have a few connections in each environment), but we would like to reconfigure so that we only get alerts if the connection been down for more than one minute to avoid alerts when the tunnel is reset.
Today we are using this log analytics query which creates false alerts, do you have any suggestions how we can create this
AzureDiagnostics
| where Category == "TunnelDiagnosticLog"
| order by TimeGenerated
Here is an example of what we don't want to trigger an alert. Note that just excluding the GlobalStandby change events won't do it since its not guaranteed that the tunnel connects again.
Configuration in Azure Monitor:
Using Log Analytics I came up with this query that will check the next line in the log to see if its Connected or not and compare the timespan between them.
AzureDiagnostics | serialize
| where Category == "TunnelDiagnosticLog"
| where TimeGenerated < ago(120s) and TimeGenerated > ago(600m)
| extend Result = iif(
(OperationName == "TunnelDisconnected"
and next(OperationName) == "TunnelConnected"
and next(TimeGenerated)-TimeGenerated < 1m)
or OperationName == "TunnelConnected", 0, 1)
| project TimeGenerated,
OperationName,
next(OperationName),
Result,
next(TimeGenerated)-TimeGenerated,
Resource,
ResourceGroup,
_ResourceId
| project-rename Downtime=Column2, NextStatus=Column1
| sort by TimeGenerated asc
| where OperationName == "TunnelDisconnected" and Result == 1
You can try creating Metric measurement log alert with AggregatedValue as count of disconnections aggregated by column with values GatewayTenantWorker... (and any other column as needed) and binned per minute in your log query and configure the alert with threshold as 0 (for any disconnections) and trigger based on consecutive breaches greater than 1 (for more than 1 minute, or 2 for more than 2 minutes (to reduce even more false alerts)).
This should fire an alert when there are any disconnections for more than 1 (or 2) minute(s) in any of the VPN connections.
Assumptions about the data -
Tunnel resets are resolved within a minute.
In case of actual long disconnection, there would be log for current status (Disconnected) per minute. Above solution works only in this case.
If assumptions do not hold true, information about log data pattern in case of long disconnection is needed.

How to monitor consecutive exceptions in Azure? (Kusto)

I want to monitor consecutive exceptions.
For example if I get 'X' amount of '500' exceptions in a row, I want it to trigger an action group.
How to write this in Kusto?
I know how to monitor amount of exceptions over a 1min period but I'm a bit stuck on how to monitor consecutive exceptions.
You are looking for setting up a custom log alert on AppInsights
Here is the step by step guide on how to setup
You can use the following query with Summarize Operator
exceptions
| where timestamp >= datetime('2019-01-01')
| summarize min(timestamp) by operation_Id
Please use the query like below:
Exceptions
| summarize count() by xxx
For more details about summarize operator, refer to this article.

Resources