Prometheus. CPU process time total to % percent - azure

We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.
For instance, we are using this PQuery to calculate total CPU usage in %:
100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.
Until we started to write queries for services.
For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?
Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.
Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?
Thank you, I would be grateful for any help!

I hope this will help!
The result is pretty much the same as the Windows performance manager.
So, for CPU % for running services (tasks, processes):
sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.
Cheers!

Related

"An Azure Application Gateway instance can support around 10 Capacity Units" - Explanation in simple word

I've red the documentation and searched the internet for a simple explanation on Azure application gateway auto-scaling and the above quoted line but failed.
It would be really helpful if you can explain/provide a explanation link related to the same for better understanding.
Thank you!
When you enable auto scaling you need to set a minimum and maximum instance count. How do you know how many instances you need to handle the minimum amount of traffic you want to be able to handle? That is where Capacity Units play a role:
Capacity Unit is the measure of capacity utilization for an Application Gateway across multiple parameters.
A single Capacity Unit consists of the following parameters:
2500 Persistent connections
2.22-Mbps throughput
1 Compute Unit
If any one of these parameters are exceeded, then another n capacity unit(s) are necessary, even if the other two parameters don’t exceed this single capacity unit’s limits. The parameter with the highest utilization among the three above will be internally used for calculating capacity units, which is in turn billed.
When configuring the minimum and maximum number of instances you can now calculate how many instances you need because a single instance can handle up to 10 Capacity Units, so for example a maximum number of 10 * 2500 = 25.000 persistent connections.
For example: if you expect to have to deal with 6000 persistent connections you will need at least 3 instances (3 * 2500 = up to 7500 persistent connections)

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.
If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)
This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

Express (NodeJS) more cores vs. more nodes? (With Analysis and Examples)

When it comes to running Express (NodeJS) in something like Kubernetes, would it be more cost effective to run with more cores and less nodes? Or more nodes with less cores each? (Assuming the cost of cpus/node is linear ex: 1 node with 4 cores = 2 nodes 2cores)
In terms of redundancy, more nodes seems the obvious answer.
However, in terms of cost effectiveness, less nodes seems better because with more nodes, you are paying more for overhead and less for running your app. Here is an example:
1 node with 4 cores costs $40/month, it is running:
10% Kubernetes overhead on one core
90% your app on one core and near 100% on others
Therefore you are paying $40 for 90% + 3x100% = 390% your app
2 nodes with 2 cores each cost a total of $40/month running:
10% Kubernetes overhead on one core (PER NODE)
90% you app on one core and near 100% on other (PER NODE)
Now you are paying $40 for 2 x (90% + 100%) = 2 x 190% = 380% your app
I am assuming balancing the 2 around like 4-8 cores is ideal so you aren't paying so much for each node, scaling nodes less often, and getting hight percentage of compute running your app per node. Is my logic right?
Edit: Math typo
because the node does not come empty, but it has to run some core apps like :
kubelet
kube-proxy
container-runtime (docker, gVisor, or other)
other daemonset.
Sometimes, 3 large VMs are better than 4 medium VMs in term of the best usage of capacity.
However, the main decider is the type of your workload (your apps):
If your apps eats memory more than CPUs (Like Java Apps), you will need to choose Node of [2CPU, 8GB] is better than [4CPUs, 8GB].
If your apps eats CPUs more than memory (Like ML workload), you will need to choose the opposite; computing-optimized instances.
The golden rule 🏆 is to calculate the whole capacity is better than looking into the individual capacity for each node.
At the end, you need to consider not only cost effectiveness but also :
Resilience
HA
Redundancy

Azure Analytics: Difference Between Log vs standard based metrics

We are using AppService on Azure which has application insights enabled. While looking at CPU usage we found that while log based metrics that average CPU is 40-80% while standard based metrics is showing CPU usage for same period and resource to be 150-300%.
Can someone explain why there is so much difference? and how come CPU usage go till 300% ?
CPU can be counted in cores (max value = #NumCores * 100) or normalized (average across all cores). For instance, if your app runs on 4 core virtual machine, then 75% overall CPU utilization will map to 300% CPU-core utilization.
I guess in your case one metric is normalized and another isn't.

Understanding Azure SQL Performance

The facts:
1 Azure SQL S0 instance
a few tables one of them containing ~ 8.6 Million Rows and 1 PK
Running a Count-query on this table takes nearly 30 minutes (!) to complete.
Upscaling the instance from S0 to S1 reduces the query time to 13 minutes:
Looking into Azure Portal (new version) the resource-usage-monitor shows the following:
Questions:
Does anyone else consider even 13 minutes as rediculos for a simple COUNT()?
Does the second screenshot meen that during the 100%-period my instance isn't responding to other requests?
Why are my metrics limited to 100% in both S0 and S1? (see look under "Which Service Tier is Right for My Database?" stating " These values can be above 100% (a big improvement over the values in the preview that were limited to a maximum of 100).") I'd expect the S0 to bee like on 150% or so if the quoted statement is true.
I'm interested in experiences regarding usage of databases with more than 1.000 records or so from other people. I don't see how a S*-scaled Azure SQL for 22 - 55 € per month could help me in upscaling-strategies at the moment.
Azure SQL Database editions provide increasing levels of DTUs from Basic -> Standard -> Premium levels (CPU,IO,Memory and other resources - see https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx). Once your query reaches its limits of DTU (100%) in any of these resource dimensions, it will continue to receive these resources at that level (but not more) and that may increase the latency in completing the request. It looks like in your scenario above, the query is hitting its DTU limit (10 DTUs for S0 and 20 for S1). You can see the individual resource usage percentages (CPU, Data IO or Log IO) by adding these metrics to the same graph, or by querying the DMV sys.dm_db_resource_stats.
Here is a blog that provides more information on appropriately sizing your database performance levels. http://azure.microsoft.com/blog/2014/09/11/azure-sql-database-introduces-new-near-real-time-performance-metrics/
To your specific questions
1) As you have 8.6 million rows, database needs to scan the index entries to get the count back. So, it may be hitting the IO limit for the edition here.
2) If you have multiple concurrent queries running against your DB, they will be scheduled appropriately to not starve one request or the other. But latencies may increase further for all queries since you will be hitting the available resource limits.
3) For older Web/Business editions, you may be able to see the metric values going beyond 100% (they are normalized to the limits of an S2 level), as they don't have any specific limits and run in a resource-shared environment with other customer loads. For the new editions, metrics will never exceed 100%, because system guarantees you resources upto 100% of that edition's limits, but no more. This provides predictable, guaranteed amount of resources for your DB unlike Web/Business editions, where you may get very little or lot more resources at different times depending on other competing customer DB workloads running on the same machine.
Hope this helps.
-- Srini

Resources