So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run:
echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty
perf stat ./empty
This was the output:
Performance counter stats for './empty':
0.341833 task-clock (msec) # 0.678 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
112 page-faults # 0.328 M/sec
1,187,561 cycles # 3.474 GHz
1,550,924 instructions # 1.31 insn per cycle
293,281 branches # 857.966 M/sec
4,942 branch-misses # 1.69% of all branches
0.000504121 seconds time elapsed
Why is it using so many instructions to run a program that does literally nothing? I thought that maybe this was some baseline number of instructions that are necessary to load a program into the OS, so I looked for a minimal executable written in assembly, and I found a 142 byte executable that outputs "Hi World" here (http://timelessname.com/elfbin/)
Running perf stat on the 142 byte hello executable, I get:
Hi World
Performance counter stats for './hello':
0.069185 task-clock (msec) # 0.203 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.043 M/sec
126,942 cycles # 1.835 GHz
116,492 instructions # 0.92 insn per cycle
15,585 branches # 225.266 M/sec
1,008 branch-misses # 6.47% of all branches
0.000340627 seconds time elapsed
This still seems a lot higher than I'd expect, but we can accept it as a baseline. In that case, why did running empty take 10x more instructions? What did those instructions do? And if they're some sort of overhead, why is there so much variation in overhead between a C program and the helloworld assembly program?
It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.
This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.
Related
I have a CPU-bound workload that I want to optimize. It consists entirely of numeric computations and look-up tables, i.e. arrays in memory. I am using Linux.
perf stat -d main gives the following:
Performance counter stats for 'main':
1,312.53 msec task-clock # 1.000 CPUs utilized
2 context-switches # 1.524 /sec
1 cpu-migrations # 0.762 /sec
1,828 page-faults # 1.393 K/sec
5,923,147,477 cycles # 4.513 GHz
23,334,861,436 instructions # 3.94 insn per cycle
2,108,821,736 branches # 1.607 G/sec
8,312,184 branch-misses # 0.39% of all branches
29,538,565,980 slots # 22.505 G/sec
23,515,015,270 topdown-retiring # 79.6% retiring
1,505,887,677 topdown-bad-spec # 5.1% bad speculation
3,359,287,895 topdown-fe-bound # 11.4% frontend bound
1,158,375,136 topdown-be-bound # 3.9% backend bound
7,328,606,538 L1-dcache-loads # 5.584 G/sec
1,065,578 L1-dcache-load-misses # 0.01% of all L1-dcache accesses
29,627 LLC-loads # 22.572 K/sec
7,467 LLC-load-misses # 25.20% of all LL-cache accesses
1.313027809 seconds time elapsed
1.304983000 seconds user
0.007981000 seconds sys
I need help interpreting the this output, in particular identify the most significant items and what to focus on.
I can see that there are 0.39% branch misses, 5.1% bad speculations (branch speculations?), 11.4% front-end bound and 25% LLC load misses (but 22K/s only).
I don't know what can be done about fe-bound and be-bound.
Since LLC loads are only 22K/s, is this insignificant?
I can see there are lots of branches, 1.6G/sec. But since bad speculations is 5.1% and branch misses is 0.39%, is this significant? Should I focus on reducing branches?
What to focus on in this perf stat report to improve performance?
Nothing. It provides no information that can determine if code is good or bad or could be better. If one program takes 5 seconds and gives extremely good looking stats in the summary (low cache misses, high instructions per cycle, etc), is it better or worse than an equivalent program that takes 2 seconds but gives extremely bad looking stats in the summary (high cache misses, etc)?
Start by finding out where the CPU is spending the most time (e.g. maybe using perf report); then (for the part of your code that costs the most time) look at the actual source code and determine if it could be using a faster algorithm, or caching results that are used again later (memoization), or using SIMD, or using multiple threads/CPUs.
When you're convinced the slowest part can't be improved, take a look at the 2nd slowest part (then the 3rd slowest part, then..) until you can't justify spending more time on "diminishing returns" (a 50% speed increase in something that consumes 80% of the time would be huge, but a 50% speed increase in something that consumes 1% of the time is almost worthless).
I wrote a benchmark to test some particular functionality. Running the benchmark typically gave consistent results, but roughly one out of ten times it appeared to run something like 3x times faster in every benchmarking test case.
I wondered if there was some kind of branch prediction or cache locality issue affecting this, so I ran it in perf, like so:
sudo perf stat -B -e cache-references,cache-misses,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses ./my_benchmark
Now the results are reversed: roughly nine out of ten times it runs faster, in which case the perf stat output looks like so:
Performance counter stats for './my_benchmark':
336,011 cache-references # 75.756 M/sec (41.40%)
74,722 cache-misses # 22.238 % of all cache refs
4.435442 task-clock (msec) # 0.964 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.129 M/sec
13,745,945 cycles # 3.099 GHz
16,521,518 instructions # 1.20 insn per cycle
4,453,340 branches # 1004.035 M/sec
91,336 branch-misses # 2.05% of all branches (58.60%)
0.004603313 seconds time elapsed
And in roughly one out of ten trials it runs 3x slower, showing results like so:
Performance counter stats for './my_benchmark':
348,441 cache-references # 22.569 M/sec (74.14%)
112,153 cache-misses # 32.187 % of all cache refs (74.14%)
15.439061 task-clock (msec) # 0.965 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.037 M/sec
13,717,144 cycles # 0.888 GHz (62.52%)
16,951,632 instructions # 1.24 insn per cycle (88.40%)
4,463,213 branches # 289.086 M/sec
70,185 branch-misses # 1.57% of all branches (89.20%)
0.015999175 seconds time elapsed
I notice that the task always seems to complete in roughly the same number of cycles, but that the frequencies are different -- in the "fast" case it shows something like 3GHz, whereas in the slow case it shows something like 900 MHz. I don't know explicitly what this stat means, though, so I don't know if this is just a tautological consequence of the similar number of cycles and longer runtime or whether it means the processor's clock is actually running at a different speed.
I do notice that in both cases it says "context switches: 0" and "cpu migrations: 0," so it doesn't look like the slowdown is coming from the benchmark being preempted.
What is going on here, and can I write (or launch?) my program in such a way that I always get the faster performance?
often CPU frequency is variable based on load ... I would force a freq lock prior to running this
what OS are you on ?
I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.
The following is the output of perf tool:
144094.487583 task-clock (msec) # 1.017 CPUs utilized
539912613776 instructions # 1.09 insn per cycle (83.42%)
496622866196 cycles # 3.447 GHz (83.48%)
340952514 cache-misses # 10.354 % of all cache refs (83.32%)
3292972064 cache-references # 22.854 M/sec (83.26%)
144081.898558 cpu-clock (msec) # 1.017 CPUs utilized
4189372 page-faults # 0.029 M/sec
0 major-faults # 0.000 K/sec
4189372 minor-faults # 0.029 M/sec
8614431755 L1-dcache-load-misses # 5.52% of all L1-dcache hits (83.28%)
156079653667 L1-dcache-loads # 1083.223 M/sec (66.77%)
141.622640316 seconds time elapsed
I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.
The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?
I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 # 2.20GHz.
Peter Cordes' answer is on the right track.
PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.
I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.
There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:
$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
119.444.297.774 cycles:u (55,88%)
130.133.371.858 instructions:u # 1,09 insn per cycle (67,81%)
38.277.984 cache-misses:u # 7,780 % of all cache refs (72,92%)
491.979.655 cache-references:u (77,00%)
3.892.617.942 L1-dcache-load-misses:u # 15,57% of all L1-dcache hits (82,19%)
25.004.563.072 L1-dcache-loads:u (43,85%)
Pinning instructions and cycles:
$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
120.683.697.083 cycles:Du
132.185.743.504 instructions:Du # 1,10 insn per cycle
27.917.126 cache-misses:u # 4,874 % of all cache refs (61,14%)
572.718.930 cache-references:u (71,05%)
3.942.313.927 L1-dcache-load-misses:u # 15,39% of all L1-dcache hits (80,38%)
25.613.635.647 L1-dcache-loads:u (51,37%)
Which results in the same multiplexing as with omitting cycles and instructions does:
$ perf stat -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
35.333.318 cache-misses:u # 7,212 % of all cache refs (62,44%)
489.922.212 cache-references:u (73,87%)
3.990.504.529 L1-dcache-load-misses:u # 15,40% of all L1-dcache hits (84,99%)
25.918.321.845 L1-dcache-loads:u
Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.
1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.
IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.
But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.
e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.
8.014196 task-clock # 0.004 CPUs utilized
204 context-switches # 0.025 M/sec
32 cpu-migrations # 0.004 M/sec
0 page-faults # 0.000 K/sec
11,841,196 cycles # 1.478 GHz [46.17%]
9,982,788 stalled-cycles-frontend # 84.31% frontend cycles idle [80.26%]
8,122,708 stalled-cycles-backend # 68.60% backend cycles idle
5,462,302 instructions # 0.46 insns per cycle
# 1.83 stalled cycles per insn
1,098,309 branches # 137.045 M/sec
94,430 branch-misses # 8.60% of all branches [77.23%]
what's the meaning of 1.478 GHz and [46.17%] in cycles's annotation.
This is a thing I hate very much on perf, that the documentation and manual pages are outdated and searching for meaning of some values is pretty complicated. I did search for them once so I add my findings:
what's the meaning of 1.478 GHz
To my knowledge, the value after # is recalculation of the native counter value (the value in the first column) to the user-readable form. This value should roughly correspond to clock speed of your processor:
grep MHz /proc/cpuinfo
should give similar value. It is printed from tools/perf/util/stat-shadow.c.
and [46.17%] in cycles's annotation?
This value should correspond the the portion of time, the hardware counter was active. Perf allows to start more hardware counters and multiplex them in runtime, which makes it easier for programmer.
I didn't found the actual place in the code, but it is described in one recently proposed patch as (part of csv format):
+ - percentage of measurement time the counter was running
To be able to profile application runtimes whose binaries will actually be run under a simulator (NS-3/DCE). I wanted to use the linux performance counters, I expected the instruction count for an application which has no source of non-determinism to be deterministic.
I couldn't be more wrong according to the linux performance counters, let's take a simple example:
$ (perf stat -c -- sleep 1 2>&1 && perf stat -c -- sleep 1 2>&1) |grep instructions
669218 instructions # 0,61 insns per cycle
682286 instructions # 0,58 insns per cycle
1) What is the source of this non-determinism? Does this stem from the low-level branch-prediction and other engines in the CPU.
2) Other question, is there a way to know the amount of instructions fed to the CPU (in contrast to the amount of instructions in the example output), in order to do get the amount of executed code in a deterministic way?
Summary:
1) The non-determinism is caused by variation in the sleep 1 command not from branch-prediction or other microarchitectural features.
2) You can find the number of instruction fetched by using a hardware even counter if your CPU supports it. However, this will vary more than the number of instructions retired (which is what perf typically reports for instructions).
Details:
The sleep command is not a good test case if you want a deterministic number of instructions to execute. It will execute a non-deterministic number of instructions because there will be some slight variation in what the kernel is doing.
You can specify whether to collect user-mode or kernel-mode instruction counts with the instructions:u for user-mode or instructions:k for kernel mode. For two runs of:
perf stat -e instructions:k,instructions:u,instructions sleep 1
I get the following results:
Performance counter stats for 'sleep 1':
373,044 instructions:k # 0.00 insns per cycle
199,795 instructions:u # 0.00 insns per cycle
572,839 instructions # 0.00 insns per cycle
1.001018153 seconds time elapsed
and
Performance counter stats for 'sleep 1':
379,722 instructions:k # 0.00 insns per cycle
199,970 instructions:u # 0.00 insns per cycle
579,519 instructions # 0.00 insns per cycle
1.000986201 seconds time elapsed
As you can see the actual elapsed time of sleep 1 varies slightly. Which is the source of the non-determinism. However, the number of user-mode instructions has less variation than kernel-mode instructions.