what's the meaning of cycles annotation in perf stat - linux

8.014196 task-clock # 0.004 CPUs utilized
204 context-switches # 0.025 M/sec
32 cpu-migrations # 0.004 M/sec
0 page-faults # 0.000 K/sec
11,841,196 cycles # 1.478 GHz [46.17%]
9,982,788 stalled-cycles-frontend # 84.31% frontend cycles idle [80.26%]
8,122,708 stalled-cycles-backend # 68.60% backend cycles idle
5,462,302 instructions # 0.46 insns per cycle
# 1.83 stalled cycles per insn
1,098,309 branches # 137.045 M/sec
94,430 branch-misses # 8.60% of all branches [77.23%]
what's the meaning of 1.478 GHz and [46.17%] in cycles's annotation.

This is a thing I hate very much on perf, that the documentation and manual pages are outdated and searching for meaning of some values is pretty complicated. I did search for them once so I add my findings:
what's the meaning of 1.478 GHz
To my knowledge, the value after # is recalculation of the native counter value (the value in the first column) to the user-readable form. This value should roughly correspond to clock speed of your processor:
grep MHz /proc/cpuinfo
should give similar value. It is printed from tools/perf/util/stat-shadow.c.
and [46.17%] in cycles's annotation?
This value should correspond the the portion of time, the hardware counter was active. Perf allows to start more hardware counters and multiplex them in runtime, which makes it easier for programmer.
I didn't found the actual place in the code, but it is described in one recently proposed patch as (part of csv format):
+ - percentage of measurement time the counter was running

Related

What to focus on in this `perf stat` report to improve performance?

I have a CPU-bound workload that I want to optimize. It consists entirely of numeric computations and look-up tables, i.e. arrays in memory. I am using Linux.
perf stat -d main gives the following:
Performance counter stats for 'main':
1,312.53 msec task-clock # 1.000 CPUs utilized
2 context-switches # 1.524 /sec
1 cpu-migrations # 0.762 /sec
1,828 page-faults # 1.393 K/sec
5,923,147,477 cycles # 4.513 GHz
23,334,861,436 instructions # 3.94 insn per cycle
2,108,821,736 branches # 1.607 G/sec
8,312,184 branch-misses # 0.39% of all branches
29,538,565,980 slots # 22.505 G/sec
23,515,015,270 topdown-retiring # 79.6% retiring
1,505,887,677 topdown-bad-spec # 5.1% bad speculation
3,359,287,895 topdown-fe-bound # 11.4% frontend bound
1,158,375,136 topdown-be-bound # 3.9% backend bound
7,328,606,538 L1-dcache-loads # 5.584 G/sec
1,065,578 L1-dcache-load-misses # 0.01% of all L1-dcache accesses
29,627 LLC-loads # 22.572 K/sec
7,467 LLC-load-misses # 25.20% of all LL-cache accesses
1.313027809 seconds time elapsed
1.304983000 seconds user
0.007981000 seconds sys
I need help interpreting the this output, in particular identify the most significant items and what to focus on.
I can see that there are 0.39% branch misses, 5.1% bad speculations (branch speculations?), 11.4% front-end bound and 25% LLC load misses (but 22K/s only).
I don't know what can be done about fe-bound and be-bound.
Since LLC loads are only 22K/s, is this insignificant?
I can see there are lots of branches, 1.6G/sec. But since bad speculations is 5.1% and branch misses is 0.39%, is this significant? Should I focus on reducing branches?
What to focus on in this perf stat report to improve performance?
Nothing. It provides no information that can determine if code is good or bad or could be better. If one program takes 5 seconds and gives extremely good looking stats in the summary (low cache misses, high instructions per cycle, etc), is it better or worse than an equivalent program that takes 2 seconds but gives extremely bad looking stats in the summary (high cache misses, etc)?
Start by finding out where the CPU is spending the most time (e.g. maybe using perf report); then (for the part of your code that costs the most time) look at the actual source code and determine if it could be using a faster algorithm, or caching results that are used again later (memoization), or using SIMD, or using multiple threads/CPUs.
When you're convinced the slowest part can't be improved, take a look at the 2nd slowest part (then the 3rd slowest part, then..) until you can't justify spending more time on "diminishing returns" (a 50% speed increase in something that consumes 80% of the time would be huge, but a 50% speed increase in something that consumes 1% of the time is almost worthless).

How to determine ARMv8 CPU's frequency?

I am running some basic testing in Ubuntu for ARMv8 (Linux-aarch64) with QEMU emulator.
I want to get current CPU's frequency (nominal frequency is preferred), but from the output of lscpu or cat /proc/cpuinfo, there is NO CPU frequency info.
The answers to a similar question in stackexchange can NOT help me much.
The output of perf stat sleep 1 is as follows,
Performance counter stats for 'sleep 1':
36.845824 task-clock (msec) # 0.034 CPUs utilized
1 context-switches # 0.027 K/sec
0 cpu-migrations # 0.000 K/sec
49 page-faults # 0.001 M/sec
36,759,401 cycles # 0.998 GHz
<not supported> instructions
<not supported> branches
<not supported> branch-misses
1.068524527 seconds time elapsed
May I say the CPU is 1GHz?
The output of cpupower shows nothing about CPU frequency,
t#ubuntu:~/test/kermod$ sudo cpupower monitor
No HW Cstate monitors found
t#ubuntu:~/test/kermod$ sudo cpupower frequency-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
CPUs which run at the same hardware frequency: Not Available
CPUs which need to have their frequency coordinated by software: Not Available
maximum transition latency: Cannot determine or is not supported.
Not Available
available cpufreq governors: Not Available
Unable to determine current policy
current CPU frequency: Unable to call hardware
current CPU frequency: Unable to call to kernel
t#ubuntu:~/test/kermod$ sudo cpupower info
System does not support Intel's performance bias setting
analyzing CPU 0:
The dmidecode -t processor shows,
t#ubuntu:~/test/kermod$ sudo dmidecode -t processor
# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Handle 0x0400, DMI type 4, 42 bytes
Processor Information
Socket Designation: CPU 0
Type: Central Processor
Family: Other
Manufacturer: QEMU
ID: 00 00 00 00 00 00 00 00
Version: virt-4.2
Voltage: Unknown
External Clock: Unknown
Max Speed: 2000 MHz
Current Speed: 2000 MHz
Status: Populated, Enabled
Upgrade: Other
L1 Cache Handle: Not Provided
L2 Cache Handle: Not Provided
L3 Cache Handle: Not Provided
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 1
Core Enabled: 1
Thread Count: 1
Characteristics: None
It says the CPU is 2GHz, but I am not sure if that is correct.
Another way I can use is by sleeping for seconds and reading the difference of cycle counter in CPU to calculate the frequency. But I got the CPU frequency is about 1GHz.
Or is there any way in software or registers in hardware can tell me the ARM CPU's frequency?
** Edit **
I asked my colleague to run perf stat sleep 1 in his real ARMv8 hardware, and we got,
Performance counter stats for 'sleep 1':
1.89 msec task-clock # 0.002 CPUs utilized
1 context-switches # 0.530 K/sec
0 cpu-migrations # 0.000 K/sec
43 page-faults # 0.023 M/sec
1859822 cycles # 0.985 GHz
758842 instructions # 0.41 insn per cycle
91818 branches # 48.632 M/sec
12077 branch-misses # 13.15% of all branches
1.003838600 seconds time elapsed
0.004158000 seconds user
0.000000000 seconds sys
His ARMv8 is running at 1GHz which matches the output of perf stat.
Compared to that in QEMU emulation, the emulated CPU should be also running at 1GHz, am I correct?
Have you looked up any tools to assist in this?
Cpupower has an ARM release cpupower 5.19-1. This should give you the information that you want.
cpupower monitor should display current frequency. Depending on what cpu you have, you will need to verify the core cluster type.
As a note, the frequency an ARM cpu has is not always comparable to an x86 cpu frequency. The way that computations are handled are very different.
** Edit
So CPU frequency is measured in Hertz, which is cycles per second. According to your perf stat sleep 1 the Emulated ARM cpu had 36,759,401 cycles in that one second. That would equate to 36.75Mhz, the task-clock result reflects this.
Firstly, the question is to get the ARMv8 CPU core frequency (nominal or maximum). With all the discussion here and my testing in QEMU emulation and real ARMv8 hardware, there are 2 typical ways to get the CPU core frequency.
perf stat sleep 1 can provide reliable data about the CPU frequency (nominal at least).
In QEMU, it shows,
t#ubuntu:~$ sudo /usr/lib/linux-tools/4.15.0-189-generic/perf stat sleep 1
Performance counter stats for 'sleep 1':
37.075376 task-clock (msec) # 0.035 CPUs utilized
2 context-switches # 0.054 K/sec
0 cpu-migrations # 0.000 K/sec
52 page-faults # 0.001 M/sec
37,039,955 cycles # 0.999 GHz
<not supported> instructions
<not supported> branches
<not supported> branch-misses
1.055087406 seconds time elapsed
Line of 'cycles' shows the emulated CPU is running at 1GHz.
In real ARMv8 hardware (nominal frequency is 1GHz), it shows,
:~# perf stat sleep 1
Performance counter stats for 'sleep 1':
1.93 msec task-clock # 0.002 CPUs utilized
2 context-switches # 0.001 M/sec
0 cpu-migrations # 0.000 K/sec
44 page-faults # 0.023 M/sec
1897778 cycles # 0.982 GHz
779587 instructions # 0.41 insn per cycle
94295 branches # 48.782 M/sec
12509 branch-misses # 13.27% of all branches
1.003847600 seconds time elapsed
0.000000000 seconds user
0.004177000 seconds sys
The line of cycles shows it is running at 1GHz, which matches CPU's nominal frequency.
Software can read pmccntr_el0 to calculate the CPU frequency, and the software needs to be designed to stress the CPU to 100% usage, to avoid CPU power management.
Sleep testing result in QEMU emulator is.
$ taskset -c 3 ./readpmc
Got pmccntr: 0x35dcdf412, 0x410b77476, diff: 0xb2e98064 (3001647204)
calculated CPU frequency: 1000549068
It gets CPU frequency is 1GHz, matches QEMU's emulation.
Sleeping testing result in real hardware (1GHz as nominal frequency).
:~/temp# taskset -c 3 ./a.out
Got pmccntr: 0xffffffff8311b937, 0xffffffff8357acf7, diff: 0x45f3c0 (4584384)
calculated CPU frequency: 1528128
It shows 1.5MHz, which does NOT matches the nomimal frequency of the real hardware.
By changing the code from sleeping 3 seconds to busy looping 3 seconds, the result shows.
:~/temp# taskset -c 3 ./b.out
Inside handler function
Got pmccntr: 0xffffffff8c6857c3, 0x3f39438b, diff: 0xb2d0ebc8 (3000036296)
calculated CPU frequency: 1000012098
It shows CPU is running at 1GHz, matching CPU's real nominal hardware frequency.
The conclusion is software can calculate CPU's nominal frequency in ARMv8 as what can be done in X86 (through tsc).

Why does my benchmark randomly appear to run 3x as fast, and why does running it inside perf make this happen much more often?

I wrote a benchmark to test some particular functionality. Running the benchmark typically gave consistent results, but roughly one out of ten times it appeared to run something like 3x times faster in every benchmarking test case.
I wondered if there was some kind of branch prediction or cache locality issue affecting this, so I ran it in perf, like so:
sudo perf stat -B -e cache-references,cache-misses,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses ./my_benchmark
Now the results are reversed: roughly nine out of ten times it runs faster, in which case the perf stat output looks like so:
Performance counter stats for './my_benchmark':
336,011 cache-references # 75.756 M/sec (41.40%)
74,722 cache-misses # 22.238 % of all cache refs
4.435442 task-clock (msec) # 0.964 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.129 M/sec
13,745,945 cycles # 3.099 GHz
16,521,518 instructions # 1.20 insn per cycle
4,453,340 branches # 1004.035 M/sec
91,336 branch-misses # 2.05% of all branches (58.60%)
0.004603313 seconds time elapsed
And in roughly one out of ten trials it runs 3x slower, showing results like so:
Performance counter stats for './my_benchmark':
348,441 cache-references # 22.569 M/sec (74.14%)
112,153 cache-misses # 32.187 % of all cache refs (74.14%)
15.439061 task-clock (msec) # 0.965 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.037 M/sec
13,717,144 cycles # 0.888 GHz (62.52%)
16,951,632 instructions # 1.24 insn per cycle (88.40%)
4,463,213 branches # 289.086 M/sec
70,185 branch-misses # 1.57% of all branches (89.20%)
0.015999175 seconds time elapsed
I notice that the task always seems to complete in roughly the same number of cycles, but that the frequencies are different -- in the "fast" case it shows something like 3GHz, whereas in the slow case it shows something like 900 MHz. I don't know explicitly what this stat means, though, so I don't know if this is just a tautological consequence of the similar number of cycles and longer runtime or whether it means the processor's clock is actually running at a different speed.
I do notice that in both cases it says "context switches: 0" and "cpu migrations: 0," so it doesn't look like the slowdown is coming from the benchmark being preempted.
What is going on here, and can I write (or launch?) my program in such a way that I always get the faster performance?
often CPU frequency is variable based on load ... I would force a freq lock prior to running this
what OS are you on ?

Perf tool stat output: multiplex and scaling of "cycles"

I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.
The following is the output of perf tool:
144094.487583 task-clock (msec) # 1.017 CPUs utilized
539912613776 instructions # 1.09 insn per cycle (83.42%)
496622866196 cycles # 3.447 GHz (83.48%)
340952514 cache-misses # 10.354 % of all cache refs (83.32%)
3292972064 cache-references # 22.854 M/sec (83.26%)
144081.898558 cpu-clock (msec) # 1.017 CPUs utilized
4189372 page-faults # 0.029 M/sec
0 major-faults # 0.000 K/sec
4189372 minor-faults # 0.029 M/sec
8614431755 L1-dcache-load-misses # 5.52% of all L1-dcache hits (83.28%)
156079653667 L1-dcache-loads # 1083.223 M/sec (66.77%)
141.622640316 seconds time elapsed
I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.
The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?
I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 # 2.20GHz.
Peter Cordes' answer is on the right track.
PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.
I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.
There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:
$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
119.444.297.774 cycles:u (55,88%)
130.133.371.858 instructions:u # 1,09 insn per cycle (67,81%)
38.277.984 cache-misses:u # 7,780 % of all cache refs (72,92%)
491.979.655 cache-references:u (77,00%)
3.892.617.942 L1-dcache-load-misses:u # 15,57% of all L1-dcache hits (82,19%)
25.004.563.072 L1-dcache-loads:u (43,85%)
Pinning instructions and cycles:
$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
120.683.697.083 cycles:Du
132.185.743.504 instructions:Du # 1,10 insn per cycle
27.917.126 cache-misses:u # 4,874 % of all cache refs (61,14%)
572.718.930 cache-references:u (71,05%)
3.942.313.927 L1-dcache-load-misses:u # 15,39% of all L1-dcache hits (80,38%)
25.613.635.647 L1-dcache-loads:u (51,37%)
Which results in the same multiplexing as with omitting cycles and instructions does:
$ perf stat -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
35.333.318 cache-misses:u # 7,212 % of all cache refs (62,44%)
489.922.212 cache-references:u (73,87%)
3.990.504.529 L1-dcache-load-misses:u # 15,40% of all L1-dcache hits (84,99%)
25.918.321.845 L1-dcache-loads:u
Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.
1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.
IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.
But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.
e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.

Why does it take so many instructions to run an empty program?

So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run:
echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty
perf stat ./empty
This was the output:
Performance counter stats for './empty':
0.341833 task-clock (msec) # 0.678 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
112 page-faults # 0.328 M/sec
1,187,561 cycles # 3.474 GHz
1,550,924 instructions # 1.31 insn per cycle
293,281 branches # 857.966 M/sec
4,942 branch-misses # 1.69% of all branches
0.000504121 seconds time elapsed
Why is it using so many instructions to run a program that does literally nothing? I thought that maybe this was some baseline number of instructions that are necessary to load a program into the OS, so I looked for a minimal executable written in assembly, and I found a 142 byte executable that outputs "Hi World" here (http://timelessname.com/elfbin/)
Running perf stat on the 142 byte hello executable, I get:
Hi World
Performance counter stats for './hello':
0.069185 task-clock (msec) # 0.203 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.043 M/sec
126,942 cycles # 1.835 GHz
116,492 instructions # 0.92 insn per cycle
15,585 branches # 225.266 M/sec
1,008 branch-misses # 6.47% of all branches
0.000340627 seconds time elapsed
This still seems a lot higher than I'd expect, but we can accept it as a baseline. In that case, why did running empty take 10x more instructions? What did those instructions do? And if they're some sort of overhead, why is there so much variation in overhead between a C program and the helloworld assembly program?
It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.
This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.

Resources