There are quite a few questions here on StackOverflow explaining how to calculate process CPU utilization (e.g. this). What I don't understand is how frequency scaling affects CPU utilization calculations. It seems to me, if I follow the formula recommended (and I actually also checked top's source code and it does the same), a process running on a CPU at the lowest frequency and a process running at the highest CPU frequency for the same duration will yield identical utilization rate. But this doesn't feel right to me, especially when CPU utilization is used as a stand-in to compare power consumption.
What am I missing?
Related
I am trying to measure the impact of CPU scheduler on a large AI program (https://github.com/mozilla/DeepSpeech).
By using strace, I can see that it uses a lot of (~200) CPU threads.
I have tried using Linux Perf to measure this, but I have only been able to find the number of context switch events, not the overhead of them.
What I am trying to achieve is the total CPU core-seconds spent on context switching. Since it is a pretty large program, I would prefer non-invasive tools to avoid having to edit the source code of this program.
How can I do this?
Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.
The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.
Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)
You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.
Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.
But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.
For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.
I'm running a DNA aligner which can utilized multiple threads. Currently doing 16 on a computer cluster. Anecdotally I feel like increasing the amount of threads or even ram space has a diminishing return. Does my benchmark below proof this point? since CPU time < System Time by a wide margin. In other words does this mean that the limiting factor is some other issue like storage so that utilizing more threads will not help at this point? My goal is to reduce total running time.
thank you.
User Time = 1:06:03:35
System Time = 04:35:24
Wallclock Time = 08:24:50
CPU = 1:10:38:59
I setup collectd on my Debian 6 virtual machine for monitoring and performance analysis. Collectd's processes plugin provides statistics about a process' cpu usage, though what units these statistics have is not documented anywhere. It's certainly not jiffies or milliseconds, since the total cpu usage of several processes could go as high as 400,000 (of some unknown unit) per second on a 4 core virtual machine.
By looking at collectd's source code (https://github.com/collectd/collectd/blob/master/src/processes.c - in the ps_read_process function) , I figured out this data is read from the /proc/$pid/stat file of the process. The proc man page (link- http://man7.org/linux/man-pages/man5/proc.5.html) says the cpu usage there is measured in clock ticks.
This is nice, but clock ticks are a little arbitrary for monitoring and performance analysis. I'd like to convert the clock ticks value to something more meaningful, ideally percentage of total cpu time. How can I do that in a portable way, without just assuming my processor provides 3GHZ of clock ticks?
Further inspection of collectd's code revealed that the cpu usage is converted to microseconds.
Also, turns out a similar question was already asked and answered here.
(Sorry for non english character in picture. Each column is thread/CPU/average CPU)
When I open CPU tab in resource monitor on Window 8.1, I see above values.
What's the difference between CPU and average CPU?
At first, I thought average CPU means avaerag usage per core but I have 4 cores so the value should be CPU=4*avg. CPU which is not.
Please let me know the meaning of CPU and average CPU values.
CPU. Current percent of CPU consumption by the process, or how much of the system's processing power is being devoted to this specific process.
Average CPU. This is average CPU consumption by the process over the past 60 seconds. This gives you a real-time look at what's happening on the system right now and for the past minute.
http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-to-monitor-cpu-performance/