How the cAdvisor gets the container_cpu_load_average_10s? and what is the unit of this metric? - cadvisor

I'm trying to check which container makes the cpu load higher in my cluster.
So that I deployed the cAdvisor daemonset on my k8s cluster with the args below.
args:
- -enable_load_reader=true
- -disable_metrics=sched,network,tcp,advtcp,udp
also .spec.hostNetwork: true
I want to know how the cAdvisor calculates or reports the value of container_cpu_load_average_10s. (how the cAdvisor gets that value)
And if the value of container_cpu_load_average_10s is 100, then that container loads 100 cores? or 100 milicores? or sth else? (the unit of this metric)
And when I checked the metrics (container_cpu_load_average_10s), the value of cadvisor pod's metrics is over 3000.
The others are below 300, but the cadvisor pod's metrics are among 4000~7000.
Please let me know why the values of cadvisor pods' container_cpu_load_average_10s are so high.
Thanks a lot

container_cpu_load_average_10s=loadAvg * 1000 .
convert to 'milliLoad' to avoid floats and preserve precision.

Related

"An Azure Application Gateway instance can support around 10 Capacity Units" - Explanation in simple word

I've red the documentation and searched the internet for a simple explanation on Azure application gateway auto-scaling and the above quoted line but failed.
It would be really helpful if you can explain/provide a explanation link related to the same for better understanding.
Thank you!
When you enable auto scaling you need to set a minimum and maximum instance count. How do you know how many instances you need to handle the minimum amount of traffic you want to be able to handle? That is where Capacity Units play a role:
Capacity Unit is the measure of capacity utilization for an Application Gateway across multiple parameters.
A single Capacity Unit consists of the following parameters:
2500 Persistent connections
2.22-Mbps throughput
1 Compute Unit
If any one of these parameters are exceeded, then another n capacity unit(s) are necessary, even if the other two parameters don’t exceed this single capacity unit’s limits. The parameter with the highest utilization among the three above will be internally used for calculating capacity units, which is in turn billed.
When configuring the minimum and maximum number of instances you can now calculate how many instances you need because a single instance can handle up to 10 Capacity Units, so for example a maximum number of 10 * 2500 = 25.000 persistent connections.
For example: if you expect to have to deal with 6000 persistent connections you will need at least 3 instances (3 * 2500 = up to 7500 persistent connections)

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.
If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)
This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

Prometheus. CPU process time total to % percent

We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.
For instance, we are using this PQuery to calculate total CPU usage in %:
100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.
Until we started to write queries for services.
For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?
Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.
Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?
Thank you, I would be grateful for any help!
I hope this will help!
The result is pretty much the same as the Windows performance manager.
So, for CPU % for running services (tasks, processes):
sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.
Cheers!

Nodetool load and own stats

We are running 2 nodes in a cluster - replication factor 1.
After writing a burst of data, we see the following via node tool status.
Node 1 - load 22G (owns 48.2)
Node 2 - load 17G (owns 51.8)
As the payload size per record is exactly equal - what could lead to a node showing higher load despite lower ownership?
Nodetool status uses the Owns column to indicate the effective percentage of the token range owned by the nodes. While GB is Size of your records
Dont see anything wrong here. Your data is almost evenly distributed around your two nodes which is exactly what you want for perfekt performance.

Grouping servers in grafana/prometheus

I would like to group database servers in grafana dashboards e.g, servers belonging to the same cluster, db-pxc, end up looking like this:
DB-PXC
-Disk_Performance
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-Disk_Space
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-MySQL_Overview
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-MySQL_Table_statistics
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
...
So if i click on the parent dashboard Disk_Space, it displays disk space sub dashboard for each host in the db-pxc cluster (db-pxc-1, db-pxc-2, db-pxc-3, ...). That way i can compare the disk space usage of all my servers in one cluster on a single page. We already have this setup in cacti, but not sure how we can achieve the same with grafana.
We are using Promethues monitoring system, node_exporter & mysqld_exporter for collecting statistics on each individual server, and grafana for viewing the dashboard. To view data of the mysqld and node exporters supported by prometheus in grafana, we are using the Percona Grafana plugin.
Below is an example of what i am asking for. In the picture below, the db cluster name is kdb, db-kdb-1, db-kdb-2, db-kdb-3 and db-kdb-4 being being part of the nodes that forms the cluster. So like seen below, when i click on CPU, it shows all CPU usage of my kdb cluster nodes.
For say percentage root filesystem usage you'd have one graph with an expression like:
100 - node_filesystem_free{job='node',mountpoint='/'} / node_filesystem_size{job='node',mountpoint='/'} * 100
which would show the result for all matching machines.
You need to create a Prometheus target with all the instances' IP of your given cluster and use its job name in Grafana.
Create the following target file with the IP of your cluster's instances:
- targets:
- 10.149.121.21:9100
- 10.149.121.22:9100
- 10.149.121.23:9100
- 10.149.121.24:9100
labels:
job: kdbcluster
Then, on Grafana, you create 4 new graphs with the following respective queries:
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.21:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.22:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.23:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.24:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
If you want to have all graphs on the same one, you can use that query:
100 - (avg by (instance) (irate(node_cpu{mode="idle", job="kdbcluster"}[5m])) * 100)
If you want to add to the previous graph a line which is the average of all the instances CPU load, you can use this query:
100 - (avg (irate(node_cpu{mode="idle", job="kdbcluster"}[5m])) * 100)

Resources