To be able to profile application runtimes whose binaries will actually be run under a simulator (NS-3/DCE). I wanted to use the linux performance counters, I expected the instruction count for an application which has no source of non-determinism to be deterministic.
I couldn't be more wrong according to the linux performance counters, let's take a simple example:
$ (perf stat -c -- sleep 1 2>&1 && perf stat -c -- sleep 1 2>&1) |grep instructions
669218 instructions # 0,61 insns per cycle
682286 instructions # 0,58 insns per cycle
1) What is the source of this non-determinism? Does this stem from the low-level branch-prediction and other engines in the CPU.
2) Other question, is there a way to know the amount of instructions fed to the CPU (in contrast to the amount of instructions in the example output), in order to do get the amount of executed code in a deterministic way?
Summary:
1) The non-determinism is caused by variation in the sleep 1 command not from branch-prediction or other microarchitectural features.
2) You can find the number of instruction fetched by using a hardware even counter if your CPU supports it. However, this will vary more than the number of instructions retired (which is what perf typically reports for instructions).
Details:
The sleep command is not a good test case if you want a deterministic number of instructions to execute. It will execute a non-deterministic number of instructions because there will be some slight variation in what the kernel is doing.
You can specify whether to collect user-mode or kernel-mode instruction counts with the instructions:u for user-mode or instructions:k for kernel mode. For two runs of:
perf stat -e instructions:k,instructions:u,instructions sleep 1
I get the following results:
Performance counter stats for 'sleep 1':
373,044 instructions:k # 0.00 insns per cycle
199,795 instructions:u # 0.00 insns per cycle
572,839 instructions # 0.00 insns per cycle
1.001018153 seconds time elapsed
and
Performance counter stats for 'sleep 1':
379,722 instructions:k # 0.00 insns per cycle
199,970 instructions:u # 0.00 insns per cycle
579,519 instructions # 0.00 insns per cycle
1.000986201 seconds time elapsed
As you can see the actual elapsed time of sleep 1 varies slightly. Which is the source of the non-determinism. However, the number of user-mode instructions has less variation than kernel-mode instructions.
Related
I'm profiling using Perf, currently generating this output:
perf stat -C 3 -B ./my_app
Performance counter stats for 'CPU(s) 3':
23,191.79 msec cpu-clock # 1.000 CPUs utilized
800 context-switches # 34.495 /sec
2 cpu-migrations # 0.086 /sec
1,098 page-faults # 47.344 /sec
55,871,690 cycles # 0.002 GHz
30,950,148 stalled-cycles-frontend # 55.40% frontend cycles idle
64,157,302 instructions # 1.15 insn per cycle
# 0.48 stalled cycles per insn
12,845,079 branches # 553.863 K/sec
227,892 branch-misses # 1.77% of all branches
I'd like to add some specific event counters not listed above.
However, when I list them explicitly, I lose the metadata in the right hand column and the default counters all disappear:
perf stat -e cache-misses -B ./my_app
Performance counter stats for 'CPU(s) 3':
207,463 cache-misses
4.437709174 seconds time elapsed
As you can see, the right-most column has disappeared. I'd like to keep this column, but add specific events.
Is it possible to take the default set of events using -B and add additional events?
If not, if I manually create my list of events, how do I keep the right-most column with the /sec etc?
I don't know of a convenient / short-command-line way to add one extra event. The man page doesn't seem to mention one.
I usually include the default events manually in the --event= list.
perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread
You can use -e or --event= more than once, e.g. -etask-clock,instructions,... -e uops_issued.any,uops_executed.thread if that makes editing the command-line easier to easily remove custom events with a control-w instead of having to alt+backspace to kill a word at a time in bash line editing.
See examples in some of my answers, such as the following where I included a perf stat command and actual output.
Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?
Can x86's MOV really be "free"? Why can't I reproduce this at all?
You can add the events for a "metric group" to the default events with
-M L3_Cache_Access_BW for example, as shown in How to calculate the L3 cache bandwidth by using the performance counters linux?. But not arbitrary single events.
The -d or -dd options can add events to whatever you specified with -e, (e.g. perf stat -e uops_executed.thread,task-clock -dd awk 'BEGIN{for(i=0;i<10000000;i++){}}') but there's no option to add the default events.
On Intel hardware, each core has fixed counters for cycles (clk_unhalted_...) and instructions (inst_retired.any), so always counting those doesn't take away from the number of events you can count with the programmable counters without multiplexing, e.g. 4 on a Skylake with hyperthreading. (perf may not know about that, treating cycles and instructions just like other events. So if it does have to multiplex it may sometimes be counting fewer events than it could be, and thus having a worse duty cycle for some events than it could.) The context-switches and other default events are software events, counted by the kernel not by the PMU, so any number of them can be enabled at once, and don't interact with multiplexing.
Secondary info annotations are just ratios of two events, printed if both are counted.
The /sec secondary info is computed if task-clock or duration-time is one of the events. (Related: Run time and reported cycle counts in linux perf re: system-wide counting and/or --all-user or instructions:u leading to low CPU GHz (cycles/second) if not many unhalted clock cycles happened (in user-space) across the CPUs you were counting.)
For instructions, the default secondary info is IPC, so it's computed if you also measure cycles.
For cache-misses, the secondary info is percent of cache-references. (And no, you don't know which level of cache perf will choose to count with cache-misses, or what event cache-references maps to. These names are super generic.) Similar for other events that count cache misses in specific levels.
The -B option is on by default, and totally orthogonal to all of the event-selection and secondary annotation stuff. It's what uses thousands separators when printing large numbers. Use --no-big-num for the opposite, to get numbers you can copy/paste into a calculator.
I am trying to use linux perf to profile the L3 cache bandwidth gor a python script. I see that there are no available commands to measure that directly. But I know how to get the llc performance counters using the below command. Can anyone let me know on how to calculate the L3 cache bandwidth using the perf counters or refer me to any tools that are available to measure the l3 cache bandwidth? Thanks in advance for the help.
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches python hello.py
perf stat has some named "metrics" that it knows how to calculate from other things. According to perf list on my system, those include L3_Cache_Access_BW and L3_Cache_Fill_BW.
L3_Cache_Access_BW
[Average per-core data access bandwidth to the L3 cache [GB / sec]]
L3_Cache_Fill_BW
[Average per-core data fill bandwidth to the L3 cache [GB / sec]]
This is from my system with a Skylake (i7-6700k). Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.
I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around. It measured 52 GB/s total bandwidth (read+write I think).
The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.
$ echo 4000000 |
taskset -c 3 perf stat --all-user -M L3_Cache_Access_BW -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions ./sieve
Performance counter stats for './sieve-buggy':
7,711,201,973 offcore_requests.all_requests # 816.916 M/sec
# 52.27 L3_Cache_Access_BW
9,441,504,472 ns duration_time # 1.000 G/sec
9,439.41 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1,020 page-faults # 108.058 /sec
38,736,147,765 cycles # 4.104 GHz
53,699,139,784 instructions # 1.39 insn per cycle
9.441504472 seconds time elapsed
9.432262000 seconds user
0.000000000 seconds sys
Or with just -M L3_Cache_Access_BW and no -e events, it just shows offcore_requests.all_requests # 54.52 L3_Cache_Access_BW and duration_time. So it overrides the default and doesn't count cycles,instructions and so on.
I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer. It's counted whether it hits or misses in L3 cache. Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.
I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.
The following is the output of perf tool:
144094.487583 task-clock (msec) # 1.017 CPUs utilized
539912613776 instructions # 1.09 insn per cycle (83.42%)
496622866196 cycles # 3.447 GHz (83.48%)
340952514 cache-misses # 10.354 % of all cache refs (83.32%)
3292972064 cache-references # 22.854 M/sec (83.26%)
144081.898558 cpu-clock (msec) # 1.017 CPUs utilized
4189372 page-faults # 0.029 M/sec
0 major-faults # 0.000 K/sec
4189372 minor-faults # 0.029 M/sec
8614431755 L1-dcache-load-misses # 5.52% of all L1-dcache hits (83.28%)
156079653667 L1-dcache-loads # 1083.223 M/sec (66.77%)
141.622640316 seconds time elapsed
I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.
The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?
I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 # 2.20GHz.
Peter Cordes' answer is on the right track.
PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.
I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.
There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:
$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
119.444.297.774 cycles:u (55,88%)
130.133.371.858 instructions:u # 1,09 insn per cycle (67,81%)
38.277.984 cache-misses:u # 7,780 % of all cache refs (72,92%)
491.979.655 cache-references:u (77,00%)
3.892.617.942 L1-dcache-load-misses:u # 15,57% of all L1-dcache hits (82,19%)
25.004.563.072 L1-dcache-loads:u (43,85%)
Pinning instructions and cycles:
$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
120.683.697.083 cycles:Du
132.185.743.504 instructions:Du # 1,10 insn per cycle
27.917.126 cache-misses:u # 4,874 % of all cache refs (61,14%)
572.718.930 cache-references:u (71,05%)
3.942.313.927 L1-dcache-load-misses:u # 15,39% of all L1-dcache hits (80,38%)
25.613.635.647 L1-dcache-loads:u (51,37%)
Which results in the same multiplexing as with omitting cycles and instructions does:
$ perf stat -e cache-misses -e cache-references -e L1-dcache-load-misses -e L1-dcache-loads ...
35.333.318 cache-misses:u # 7,212 % of all cache refs (62,44%)
489.922.212 cache-references:u (73,87%)
3.990.504.529 L1-dcache-load-misses:u # 15,40% of all L1-dcache hits (84,99%)
25.918.321.845 L1-dcache-loads:u
Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.
1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.
IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.
But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.
e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.
So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run:
echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty
perf stat ./empty
This was the output:
Performance counter stats for './empty':
0.341833 task-clock (msec) # 0.678 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
112 page-faults # 0.328 M/sec
1,187,561 cycles # 3.474 GHz
1,550,924 instructions # 1.31 insn per cycle
293,281 branches # 857.966 M/sec
4,942 branch-misses # 1.69% of all branches
0.000504121 seconds time elapsed
Why is it using so many instructions to run a program that does literally nothing? I thought that maybe this was some baseline number of instructions that are necessary to load a program into the OS, so I looked for a minimal executable written in assembly, and I found a 142 byte executable that outputs "Hi World" here (http://timelessname.com/elfbin/)
Running perf stat on the 142 byte hello executable, I get:
Hi World
Performance counter stats for './hello':
0.069185 task-clock (msec) # 0.203 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.043 M/sec
126,942 cycles # 1.835 GHz
116,492 instructions # 0.92 insn per cycle
15,585 branches # 225.266 M/sec
1,008 branch-misses # 6.47% of all branches
0.000340627 seconds time elapsed
This still seems a lot higher than I'd expect, but we can accept it as a baseline. In that case, why did running empty take 10x more instructions? What did those instructions do? And if they're some sort of overhead, why is there so much variation in overhead between a C program and the helloworld assembly program?
It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.
This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.
Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks
$ sudo perf stat ls
Performance counter stats for 'ls':
0.602144 task-clock # 0.762 CPUs utilized
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
236 page-faults # 0.392 M/sec
768956 cycles # 1.277 GHz
962999 stalled-cycles-frontend # 125.23% frontend cycles idle
634360 stalled-cycles-backend # 82.50% backend cycles idle
890060 instructions # 1.16 insns per cycle
# 1.08 stalled cycles per insn
179378 branches # 297.899 M/sec
9362 branch-misses # 5.22% of all branches [48.33%]
0.000790562 seconds time elapsed
The theory:
Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).
Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).
We can divide the total cycles being spent by the CPU in 3 categories:
Cycles where instructions get retired (useful work)
Cycles being spent in the Back-End (wasted)
Cycles spent in the Front-End (wasted).
To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.
The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).
The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.
Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.
The implementation in profilers:
How do you interpret the BE and FE stalled cycles?
Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm
In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.
Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?
I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.
Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c
To convert generic events exported by perf into your CPU documentation raw events you can run:
more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend
It will show you something like
event=0x0e,umask=0x01,inv,cmask=0x01
According to the Intel documentation SDM volume 3B (I have a core i5-2520):
UOPS_ISSUED.ANY:
Increments each cycle the # of Uops issued by the RAT to RS.
Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.
For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:
UOPS_DISPATCHED.THREAD:
Counts total number of uops to be dispatched per- thread each cycle
Set Cmask = 1, INV =1 to count stall cycles.
Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.
A CPU cycle is “stalled” when the pipeline doesn't advance during it.
Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.
Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/
According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"
%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N );
%Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ;
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ;
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ;
N = 4*CPU_CLK_UNHALTED.THREAD" (for SandyBridge)
Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (toplev.py): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):
% toplev.py -d -l2 numademo 100M stream
...
perf stat --log-fd 4 -x, -e
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles}
numademo 100M stream
...
BE Backend Bound: 72.03%
This category reflects slots where no uops are being delivered due to a lack
of required resources for accepting more uops in the Backend of the pipeline.
.....
FE Frontend Bound: 54.07%
This category reflects slots where the Frontend of the processor undersupplies
its Backend.
Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal stalled-cycles:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a
author Ingo Molnar <mingo#el...> 2011-04-29 11:19:47 (GMT)
committer Ingo Molnar <mingo#el...> 2011-04-29 12:23:58 (GMT)
commit 8f62242246351b5a4bc0c1f00c0c7003edea128a (patch)
tree 9021c99956e0f9dc64655aaa4309c0f0fdb055c9
parent ede70290046043b2638204cab55e26ea1d0c6cd9 (diff)
perf events: Add generic front-end and back-end stalled cycle event definitions
Add two generic hardware events: front-end and back-end stalled cycles.
These events measure conditions when the CPU is executing code but its
capabilities are not fully utilized. Understanding such situations and
analyzing them is an important sub-task of code optimization workflows.
Both events limit performance: most front end stalls tend to be caused
by branch misprediction or instruction fetch cachemisses, backend
stalls can be caused by various resource shortages or inefficient
instruction scheduling.
Front-end stalls are the more important ones: code cannot run fast
if the instruction stream is not being kept up.
An over-utilized back-end can cause front-end stalls and thus
has to be kept an eye on as well.
The exact composition is very program logic and instruction mix
dependent.
We use the terms 'stall', 'front-end' and 'back-end' loosely and
try to use the best available events from specific CPUs that
approximate these concepts.
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm#git.kernel.org
Signed-off-by: Ingo Molnar
/* Install the stalled-cycles event: UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
- intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES] = 0x1803fb1;
+ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;
- PERF_COUNT_HW_STALLED_CYCLES = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_FRONTEND = 7,
+ PERF_COUNT_HW_STALLED_CYCLES_BACKEND = 8,