Major Perf and PIN profiling discrepancies

Major Perf and PIN profiling discrepancies - linux

To analyze certain attributes of execution times, I was going to use both Perf and PIN in separate executions of a program to get all of my information. PIN would give me instruction mixes, and Perf would give me hardware performance on those mixes. As a sanity check, I profiled the following command line argument:
g++ hello_world.cpp -o hello
So my complete command line inputs were the following:
perf stat -e cycles -e instructions g++ hello_world.cpp -o hello
pin -t icount.so -- g++ hello_world.cpp -o hello
In the PIN commands, I ignored all the path stuff for the files for the sake of this post. Additionally, I altered the basic icount.so to also record instruction mixes in addition to the default dynamic instruction count. The results were astonishingly different
PIN Results:
Count 1180608
14->COND_BR: 295371
49->UNCOND_BR: 21869
//skipping all of the other instruction types for now
Perf Results:
20,538,346 branches
105,662,160 instructions # 0.00 insns per cycle
0.072352035 seconds time elapsed
This was supposed to serve as a sanity check by having roughly the same instruction counts and roughly the same branch distributions. Why would the dynamic instruction counts be off by a factor of x100?! I was expection some noise, but that's a bit much.
Also, the amount of branches is 20% for Perf, but PIN reports around 25% (that also seems like a tad wide of a discrepancy, but it's probably just a side effect from the massive instruction count distortion).

There are significant differences between what's counted by the icount pintool and the instructions performance event, which is mapped to the architectural Instructions Retired hardware performance event on modern Intel processors. I assume you're on an Intel processor.
pin is only injected in child processes when the -follow_execv command-line option is specified and, if the pintool registered a callback function to intercept process creation, the callback returned true. On the other hand, perf profiles all child processes by default. You can tell perf to only profile the specified process using the -i option.
perf, by default, profiles all events that occurs in user mode and kernel mode (if /proc/sys/kernel/perf_event_paranoid is smaller than 2). pin only supports profiling in user mode.
The icount pintool counts at the basic block granularity, which is essentially a short, single-entry, single-exit sequence of instructions. If an instruction in the block caused an exception, the rest of the instructions in the block will not be executed, but they've already been counted. An exception may be handled without terminating the program. instructions only count instructions at retirement.
The icount pintool, by default, counts each iteration of a rep-prefixed instruction as one instruction. The instructions event counts a rep-prefixed instruction as a single instruction irrespective of the number of iterations.
On some processors, the instructions event may over count or under count.
The instructions event count may be larger due to the first two reasons. The icount pintool instruction count may be larger due to the next two reasons. The last reason may result in unpredictable discrepancies. Since the perf count is about 100x larger than the icount count, it's clear that the first two factors are dominant in this case.
You can get the two tools to get a lot closer counts by passing -i to perf to not profile children, adding the :u modifier to the instructions event name to count only in user mode, and passing -reps 1 to pin to count rep-prefixed instructions per instruction rather than per iteration.
perf stat -i -e cycles,instructions:u g++ hello_world.cpp -o hello
pin -t icount.so -reps 1 -- g++ hello_world.cpp -o hello
Instead of passing -i to perf, you can pass -follow_execv to pin as follows:
pin -follow_execv -t icount.so -reps 1 -- g++ hello_world.cpp -o hello
In this way, both tools will profile the entire process hierarchy rooted at the specified process (i.e., a running g++).
I expect the counts to be very close with these measures, but they still won't be identical.

Related

Weird Backtrace in Perf

I used the following command to extract backtraces leading to user level L3-misses in a simple evince benchmark:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
As it is clear, the sampling period is quite large (10000 events between consecutive samples). For this experiment, the output of perf script had some samples similar to this one:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
At the bottom of the backtrace, there is a symbol called [unknown], which seems OK. But then a line in general_composite_rect() is called. Is this backtrace OK?
AFAIK, the first caller in the backtrace should be something like _start() or __GI___clone(). But the backtrace is not in this form. What is wrong?
Is there any way to resolve the issue? Are the truncated (parts of) backtraces reliable?

TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer or with -g or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main() or _start or clone()/start_thread() top functions.
perf profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report command, but every wake of tool can save thousands of samples; samples can be read by perf script or perf script -D.
perf_events interrupt handler, something near __perf_event_overflow of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:
https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.
Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.
With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method (--call-graph) of perf record selects the method to be used. It is documented in man perf-record http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".
Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like
"--call-graph dwarf,4096".
So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240 or --call-graph dwarf,81920 etc.
Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user(); called from kernel/events/callchain.c:get_perf_callchain() <- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler) <- READ_ONCE(event->overflow_handler)(event, data, regs); of __perf_event_overflow.
Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?

I had the same problem and it was like this: when you are collecting traces with --call-graph dwarf, if the size of the stack is too big, you will get unknown in the stack backtrace.
The default maximum stack size is 8kB, but it can be increased like this, --call-graph dwarf,16578. Unfortunately, perf has some other problems when you increase the stack size. In my case, the solution was to get rid of a large stack-allocated array by allocating it on the heap.

linux perf report inconsistent behavior

I have an application I'm profiling using perf and I find the results when using perf report are not consistent, and I can't discern the pattern.
I start the application and profile it by pid for 60 seconds:
perf record -p <pid> -o <file> sleep 60
And when I pull the results in with perf report -i <file>, sometimes I see a "+" in the far left column that allows me to drill down into the function call trees when I press ENTER, and sometimes that "+" is not there. It seems to be dependent on some property of the recorded file, in that I have a collection of recorded files, some which allow this drill down and some which do not.
Any suggestions on how to get consistent behavior here would be appreciated.

The default event being measured by perf record is cpu-cycles.
(Or depending on the machine, sometimes cpu-cycles:p or cpu-cycles:pp)
Are you sure your application is not sleeping a lot? Does it consume a lot of cpu cycles?
Try a perf measurement on something that stresses the CPU by doing a lot of computations:
$ apt-get install stress
$ perf record -e cpu-cycles --call-graph fp stress --cpu 1 --timeout 5
$ perf report
Subsequent runs should then show more or less similar results.
In case your program is CPU intensive, and call stacks do differ a lot between runs, then you may want to look at the --call-graph option, as perf can record call-graphs with different methods:
fp (function pointer)
lbr (last branch record)
dwarf
Maybe different methods give better results.

Using /proc/*/stat for profiling

On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.
I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.
Does anyone know?

Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?
It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):
perf record -p $PID -e task-clock -o perf.output.file
Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)
Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)
perf report -i perf.output.file
or converted to raw PC (EIP) samples with perf script -i perf.output.file.
PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.

If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.

Using perf to record a profile that includes sleep/blocked times

I want to get a sampling profile of my program that includes blocked time (waiting for a network service) as well as CPU time.
perf's default profiling mode (perf record -F 99 -g -- ./binary) samples whole-system running time, but doesn't give a clear indication about how much time my program spends in what parts of my program: it's skewed toward CPU-intensive parts and doesn't show IO-intensive parts at all. The sleep time profiling mode (related on SO) shows sleep times but no general profile.
What I'd like is something really simple: record a call stack of my program every 10ms, no matter whether it's running or currently blocked. Then make a flamegraph out of that.

Benchmark a linux Bash script

Is there a way to benchmark a bash script's performance? the script downloads a remote file, and then makes calls to multiple commandline programs to manipulate. I would like to know (or as much as possible):
Total time
Time spent downloading
Time spent on each command called
-=[ I think these could be wrapped in "time" calls right? ]=-
Average download speed
uses wget
Total Memory used
Total CPU usage
CPU usage per command called
I'm able to make edits to the bash script to insert any benchmark commands needed at specific points (ie, between app calls). Not sure if some "top" ninja-ry could solve this or not. Not able to find anything useful (at least to limited understanding) in man file.
Will be running the benchmarks on OSX Terminal as well as Ubuntu (if either matter).

strace -o trace -c -Ttt ./scrip
-c is to trace the time spent by cpu on specific call.
-Ttt will tell you time in microseconds at time of each system call running.
-o will save output in file "trace".

You should be able to achieve this a number of ways. One way is to use time built-in function for each command of interest and capture the results. You may have to be careful about any pipes and redirects;
You may also consider trapping SIGCHLD, DEBUG, RETURN, ERR and EXIT signals and putting timing information in there, but you may not get some results.
This concept of CPU usage of each command won't give you any thing useful, all commands use 100% of cpu. Memory usage is something you can pull out but you should look at
If you want to get deep process statistics then you would want to use strace... See strace(1) man page for details. I doubt that -Ttt as it is suggest elsewhere is useful all that tells you are system call times and you want other process trace info.
You may also want to see ltrace and dstat tools.
A similar question is answered here Linux benchmarking tools

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string