How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled - performance-testing

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.

If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)

This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

Related

Monitor memory usage of each node in a slurm job

My slurm job uses several nodes, and I want to know the maximum memory usage of each node for a running job. What can I do?
Right now, I can ssh into each node and do free -h -s 30 > memory_usage, but I think there must be a better way to do this.

The Slurm accounting will give you the maximum memory usage over time over all tasks directly. If that information is not sufficient, you can setup profiling following this documentaiton and you will receive from Slurm the full memory usage of each process as a time series for the duration of the job. You can then aggregate per node, find the maximum, etc.

Prometheus. CPU process time total to % percent

We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.
For instance, we are using this PQuery to calculate total CPU usage in %:
100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.
Until we started to write queries for services.
For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?
Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.
Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?
Thank you, I would be grateful for any help!

I hope this will help!
The result is pretty much the same as the Windows performance manager.
So, for CPU % for running services (tasks, processes):
sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.
Cheers!

CPU / DTUs getting maxed out on Azure SQL Database, but top queries less than 1% and database only a few MB

I just launched an Azure SQL Database, and the DTU and CPU usage is behaving strangely. The database is only receiving about 30 requests per minute, and the CPU/DTU will be extremely low for hours, and then jump up to 100% and stay there (with no increase in the number of requests that triggers this). When I click to view the top queries, none of them are above 1% cpu usage. I started out on a 5 DTU plan, and yesterday upgraded to 20 DTUs and the same behavior is occurring. Any idea what else might cause the DTU/CPU to get maxed out? See images below:
https://i.imgur.com/LdbYTPw.png
https://i.imgur.com/jlus3FM.png
Thanks in advance for any advice!
Joe
EDIT: I'm getting closer, I found these repeated entries in the error log. (about 8 - 10 per SECOND)
"The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request."
The thing is, the App Service that queries the database is only doing simple selects, updates, and inserts... none of which uses any complex WHERE IN statement. Furthermore, every query is wrapped in a try/catch block, and I'm never seeing an exception like this.
Where could these large queries be originating from?

You are only seeing the CPU component of the DTU graph, what about the "Data IO" and the "Log IO" components? Look at the top 5 queries on the 3 sections, and let me know if you find a query that start with "SELECT Statman ...". If you see that, then the Auto Update Statistics process is creating those DTU spikes.

I would suggest to install the sp_whoisactive script so that you can see what's going on more easily:
http://whoisactive.com/

Azure Analytics: Difference Between Log vs standard based metrics

We are using AppService on Azure which has application insights enabled. While looking at CPU usage we found that while log based metrics that average CPU is 40-80% while standard based metrics is showing CPU usage for same period and resource to be 150-300%.
Can someone explain why there is so much difference? and how come CPU usage go till 300% ?

CPU can be counted in cores (max value = #NumCores * 100) or normalized (average across all cores). For instance, if your app runs on 4 core virtual machine, then 75% overall CPU utilization will map to 300% CPU-core utilization.
I guess in your case one metric is normalized and another isn't.

Throughput calculation in performance testing

If there are 10k busses and peak point is 9:30Am to 10:30Am, there will be 2% increase of Vusers every year then what is the throughput after 10 years?
Please help me how to solve this type of questions without using a tool.
Thanks in Advance.

The formula would be:
10000 * (1.02)^10 = 12190
With regards to implementation, 10000 busses (whatever it is) per our is 166 per minute or 2.7 per second which doesn't seem to be a very high load for me. Depending on your load testing tool there are different options on how to simulate it, for Apache JMeter it will be Constant Throughput Timer, for LoadRunner there are Pages per minute / Hits Per Second goals, etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled - performance-testing

If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max. Thus, you could do: clamp_max(<expr>, 100)

This is probably the most helpful query. (1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100 Replace your instance IP and your node exporter job name.

Related

Monitor memory usage of each node in a slurm job

Prometheus. CPU process time total to % percent

CPU / DTUs getting maxed out on Azure SQL Database, but top queries less than 1% and database only a few MB

Azure Analytics: Difference Between Log vs standard based metrics

Throughput calculation in performance testing

Categories

Resources