Perf tool stat output: multiplex and scaling of "cycles" - linux

I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.
The following is the output of perf tool:
144094.487583 task-clock (msec) # 1.017 CPUs utilized
539912613776 instructions # 1.09 insn per cycle (83.42%)
496622866196 cycles # 3.447 GHz (83.48%)
340952514 cache-misses # 10.354 % of all cache refs (83.32%)
3292972064 cache-references # 22.854 M/sec (83.26%)
144081.898558 cpu-clock (msec) # 1.017 CPUs utilized
4189372 page-faults # 0.029 M/sec
0 major-faults # 0.000 K/sec
4189372 minor-faults # 0.029 M/sec
8614431755 L1-dcache-load-misses # 5.52% of all L1-dcache hits (83.28%)
156079653667 L1-dcache-loads # 1083.223 M/sec (66.77%)
141.622640316 seconds time elapsed
I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.
The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?
I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 # 2.20GHz.

Peter Cordes' answer is on the right track.
PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.
I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.
There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:
$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
119.444.297.774 cycles:u (55,88%)
130.133.371.858 instructions:u # 1,09 insn per cycle (67,81%)
38.277.984 cache-misses:u # 7,780 % of all cache refs (72,92%)
491.979.655 cache-references:u (77,00%)
3.892.617.942 L1-dcache-load-misses:u # 15,57% of all L1-dcache hits (82,19%)
25.004.563.072 L1-dcache-loads:u (43,85%)
Pinning instructions and cycles:
$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
120.683.697.083 cycles:Du
132.185.743.504 instructions:Du # 1,10 insn per cycle
27.917.126 cache-misses:u # 4,874 % of all cache refs (61,14%)
572.718.930 cache-references:u (71,05%)
3.942.313.927 L1-dcache-load-misses:u # 15,39% of all L1-dcache hits (80,38%)
25.613.635.647 L1-dcache-loads:u (51,37%)
Which results in the same multiplexing as with omitting cycles and instructions does:
$ perf stat -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
35.333.318 cache-misses:u # 7,212 % of all cache refs (62,44%)
489.922.212 cache-references:u (73,87%)
3.990.504.529 L1-dcache-load-misses:u # 15,40% of all L1-dcache hits (84,99%)
25.918.321.845 L1-dcache-loads:u
Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.
1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.

IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.
But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.
e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.

Related

How to add specific event counters to Perf whilst keeping the default output/events?

I'm profiling using Perf, currently generating this output:
perf stat -C 3 -B ./my_app
Performance counter stats for 'CPU(s) 3':
23,191.79 msec cpu-clock # 1.000 CPUs utilized
800 context-switches # 34.495 /sec
2 cpu-migrations # 0.086 /sec
1,098 page-faults # 47.344 /sec
55,871,690 cycles # 0.002 GHz
30,950,148 stalled-cycles-frontend # 55.40% frontend cycles idle
64,157,302 instructions # 1.15 insn per cycle
# 0.48 stalled cycles per insn
12,845,079 branches # 553.863 K/sec
227,892 branch-misses # 1.77% of all branches
I'd like to add some specific event counters not listed above.
However, when I list them explicitly, I lose the metadata in the right hand column and the default counters all disappear:
perf stat -e cache-misses -B ./my_app
Performance counter stats for 'CPU(s) 3':
207,463 cache-misses
4.437709174 seconds time elapsed
As you can see, the right-most column has disappeared. I'd like to keep this column, but add specific events.
Is it possible to take the default set of events using -B and add additional events?
If not, if I manually create my list of events, how do I keep the right-most column with the /sec etc?
I don't know of a convenient / short-command-line way to add one extra event. The man page doesn't seem to mention one.
I usually include the default events manually in the --event= list.
perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread
You can use -e or --event= more than once, e.g. -etask-clock,instructions,... -e uops_issued.any,uops_executed.thread if that makes editing the command-line easier to easily remove custom events with a control-w instead of having to alt+backspace to kill a word at a time in bash line editing.
See examples in some of my answers, such as the following where I included a perf stat command and actual output.
Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?
Can x86's MOV really be "free"? Why can't I reproduce this at all?
You can add the events for a "metric group" to the default events with
-M L3_Cache_Access_BW for example, as shown in How to calculate the L3 cache bandwidth by using the performance counters linux?. But not arbitrary single events.
The -d or -dd options can add events to whatever you specified with -e, (e.g. perf stat -e uops_executed.thread,task-clock -dd awk 'BEGIN{for(i=0;i<10000000;i++){}}') but there's no option to add the default events.
On Intel hardware, each core has fixed counters for cycles (clk_unhalted_...) and instructions (inst_retired.any), so always counting those doesn't take away from the number of events you can count with the programmable counters without multiplexing, e.g. 4 on a Skylake with hyperthreading. (perf may not know about that, treating cycles and instructions just like other events. So if it does have to multiplex it may sometimes be counting fewer events than it could be, and thus having a worse duty cycle for some events than it could.) The context-switches and other default events are software events, counted by the kernel not by the PMU, so any number of them can be enabled at once, and don't interact with multiplexing.
Secondary info annotations are just ratios of two events, printed if both are counted.
The /sec secondary info is computed if task-clock or duration-time is one of the events. (Related: Run time and reported cycle counts in linux perf re: system-wide counting and/or --all-user or instructions:u leading to low CPU GHz (cycles/second) if not many unhalted clock cycles happened (in user-space) across the CPUs you were counting.)
For instructions, the default secondary info is IPC, so it's computed if you also measure cycles.
For cache-misses, the secondary info is percent of cache-references. (And no, you don't know which level of cache perf will choose to count with cache-misses, or what event cache-references maps to. These names are super generic.) Similar for other events that count cache misses in specific levels.
The -B option is on by default, and totally orthogonal to all of the event-selection and secondary annotation stuff. It's what uses thousands separators when printing large numbers. Use --no-big-num for the opposite, to get numbers you can copy/paste into a calculator.

Why does my benchmark randomly appear to run 3x as fast, and why does running it inside perf make this happen much more often?

I wrote a benchmark to test some particular functionality. Running the benchmark typically gave consistent results, but roughly one out of ten times it appeared to run something like 3x times faster in every benchmarking test case.
I wondered if there was some kind of branch prediction or cache locality issue affecting this, so I ran it in perf, like so:
sudo perf stat -B -e cache-references,cache-misses,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses ./my_benchmark
Now the results are reversed: roughly nine out of ten times it runs faster, in which case the perf stat output looks like so:
Performance counter stats for './my_benchmark':
336,011 cache-references # 75.756 M/sec (41.40%)
74,722 cache-misses # 22.238 % of all cache refs
4.435442 task-clock (msec) # 0.964 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.129 M/sec
13,745,945 cycles # 3.099 GHz
16,521,518 instructions # 1.20 insn per cycle
4,453,340 branches # 1004.035 M/sec
91,336 branch-misses # 2.05% of all branches (58.60%)
0.004603313 seconds time elapsed
And in roughly one out of ten trials it runs 3x slower, showing results like so:
Performance counter stats for './my_benchmark':
348,441 cache-references # 22.569 M/sec (74.14%)
112,153 cache-misses # 32.187 % of all cache refs (74.14%)
15.439061 task-clock (msec) # 0.965 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.037 M/sec
13,717,144 cycles # 0.888 GHz (62.52%)
16,951,632 instructions # 1.24 insn per cycle (88.40%)
4,463,213 branches # 289.086 M/sec
70,185 branch-misses # 1.57% of all branches (89.20%)
0.015999175 seconds time elapsed
I notice that the task always seems to complete in roughly the same number of cycles, but that the frequencies are different -- in the "fast" case it shows something like 3GHz, whereas in the slow case it shows something like 900 MHz. I don't know explicitly what this stat means, though, so I don't know if this is just a tautological consequence of the similar number of cycles and longer runtime or whether it means the processor's clock is actually running at a different speed.
I do notice that in both cases it says "context switches: 0" and "cpu migrations: 0," so it doesn't look like the slowdown is coming from the benchmark being preempted.
What is going on here, and can I write (or launch?) my program in such a way that I always get the faster performance?
often CPU frequency is variable based on load ... I would force a freq lock prior to running this
what OS are you on ?

Why does it take so many instructions to run an empty program?

So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run:
echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty
perf stat ./empty
This was the output:
Performance counter stats for './empty':
0.341833 task-clock (msec) # 0.678 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
112 page-faults # 0.328 M/sec
1,187,561 cycles # 3.474 GHz
1,550,924 instructions # 1.31 insn per cycle
293,281 branches # 857.966 M/sec
4,942 branch-misses # 1.69% of all branches
0.000504121 seconds time elapsed
Why is it using so many instructions to run a program that does literally nothing? I thought that maybe this was some baseline number of instructions that are necessary to load a program into the OS, so I looked for a minimal executable written in assembly, and I found a 142 byte executable that outputs "Hi World" here (http://timelessname.com/elfbin/)
Running perf stat on the 142 byte hello executable, I get:
Hi World
Performance counter stats for './hello':
0.069185 task-clock (msec) # 0.203 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.043 M/sec
126,942 cycles # 1.835 GHz
116,492 instructions # 0.92 insn per cycle
15,585 branches # 225.266 M/sec
1,008 branch-misses # 6.47% of all branches
0.000340627 seconds time elapsed
This still seems a lot higher than I'd expect, but we can accept it as a baseline. In that case, why did running empty take 10x more instructions? What did those instructions do? And if they're some sort of overhead, why is there so much variation in overhead between a C program and the helloworld assembly program?
It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.
This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.

what's the meaning of cycles annotation in perf stat

8.014196 task-clock # 0.004 CPUs utilized
204 context-switches # 0.025 M/sec
32 cpu-migrations # 0.004 M/sec
0 page-faults # 0.000 K/sec
11,841,196 cycles # 1.478 GHz [46.17%]
9,982,788 stalled-cycles-frontend # 84.31% frontend cycles idle [80.26%]
8,122,708 stalled-cycles-backend # 68.60% backend cycles idle
5,462,302 instructions # 0.46 insns per cycle
# 1.83 stalled cycles per insn
1,098,309 branches # 137.045 M/sec
94,430 branch-misses # 8.60% of all branches [77.23%]
what's the meaning of 1.478 GHz and [46.17%] in cycles's annotation.
This is a thing I hate very much on perf, that the documentation and manual pages are outdated and searching for meaning of some values is pretty complicated. I did search for them once so I add my findings:
what's the meaning of 1.478 GHz
To my knowledge, the value after # is recalculation of the native counter value (the value in the first column) to the user-readable form. This value should roughly correspond to clock speed of your processor:
grep MHz /proc/cpuinfo
should give similar value. It is printed from tools/perf/util/stat-shadow.c.
and [46.17%] in cycles's annotation?
This value should correspond the the portion of time, the hardware counter was active. Perf allows to start more hardware counters and multiplex them in runtime, which makes it easier for programmer.
I didn't found the actual place in the code, but it is described in one recently proposed patch as (part of csv format):
+ - percentage of measurement time the counter was running

Why are number of instructions non-deterministic in Linux performance counters

To be able to profile application runtimes whose binaries will actually be run under a simulator (NS-3/DCE). I wanted to use the linux performance counters, I expected the instruction count for an application which has no source of non-determinism to be deterministic.
I couldn't be more wrong according to the linux performance counters, let's take a simple example:
$ (perf stat -c -- sleep 1 2>&1 && perf stat -c -- sleep 1 2>&1) |grep instructions
669218 instructions # 0,61 insns per cycle
682286 instructions # 0,58 insns per cycle
1) What is the source of this non-determinism? Does this stem from the low-level branch-prediction and other engines in the CPU.
2) Other question, is there a way to know the amount of instructions fed to the CPU (in contrast to the amount of instructions in the example output), in order to do get the amount of executed code in a deterministic way?
Summary:
1) The non-determinism is caused by variation in the sleep 1 command not from branch-prediction or other microarchitectural features.
2) You can find the number of instruction fetched by using a hardware even counter if your CPU supports it. However, this will vary more than the number of instructions retired (which is what perf typically reports for instructions).
Details:
The sleep command is not a good test case if you want a deterministic number of instructions to execute. It will execute a non-deterministic number of instructions because there will be some slight variation in what the kernel is doing.
You can specify whether to collect user-mode or kernel-mode instruction counts with the instructions:u for user-mode or instructions:k for kernel mode. For two runs of:
perf stat -e instructions:k,instructions:u,instructions sleep 1
I get the following results:
Performance counter stats for 'sleep 1':
373,044 instructions:k # 0.00 insns per cycle
199,795 instructions:u # 0.00 insns per cycle
572,839 instructions # 0.00 insns per cycle
1.001018153 seconds time elapsed
and
Performance counter stats for 'sleep 1':
379,722 instructions:k # 0.00 insns per cycle
199,970 instructions:u # 0.00 insns per cycle
579,519 instructions # 0.00 insns per cycle
1.000986201 seconds time elapsed
As you can see the actual elapsed time of sleep 1 varies slightly. Which is the source of the non-determinism. However, the number of user-mode instructions has less variation than kernel-mode instructions.

Resources