linux perf report inconsistent behavior - linux

I have an application I'm profiling using perf and I find the results when using perf report are not consistent, and I can't discern the pattern.
I start the application and profile it by pid for 60 seconds:
perf record -p <pid> -o <file> sleep 60
And when I pull the results in with perf report -i <file>, sometimes I see a "+" in the far left column that allows me to drill down into the function call trees when I press ENTER, and sometimes that "+" is not there. It seems to be dependent on some property of the recorded file, in that I have a collection of recorded files, some which allow this drill down and some which do not.
Any suggestions on how to get consistent behavior here would be appreciated.

The default event being measured by perf record is cpu-cycles.
(Or depending on the machine, sometimes cpu-cycles:p or cpu-cycles:pp)
Are you sure your application is not sleeping a lot? Does it consume a lot of cpu cycles?
Try a perf measurement on something that stresses the CPU by doing a lot of computations:
$ apt-get install stress
$ perf record -e cpu-cycles --call-graph fp stress --cpu 1 --timeout 5
$ perf report
Subsequent runs should then show more or less similar results.
In case your program is CPU intensive, and call stacks do differ a lot between runs, then you may want to look at the --call-graph option, as perf can record call-graphs with different methods:
fp (function pointer)
lbr (last branch record)
dwarf
Maybe different methods give better results.

Related

Using /proc/*/stat for profiling

On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.
I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.
Does anyone know?
Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?
It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):
perf record -p $PID -e task-clock -o perf.output.file
Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)
Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)
perf report -i perf.output.file
or converted to raw PC (EIP) samples with perf script -i perf.output.file.
PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.
If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.

Using perf to record a profile that includes sleep/blocked times

I want to get a sampling profile of my program that includes blocked time (waiting for a network service) as well as CPU time.
perf's default profiling mode (perf record -F 99 -g -- ./binary) samples whole-system running time, but doesn't give a clear indication about how much time my program spends in what parts of my program: it's skewed toward CPU-intensive parts and doesn't show IO-intensive parts at all. The sleep time profiling mode (related on SO) shows sleep times but no general profile.
What I'd like is something really simple: record a call stack of my program every 10ms, no matter whether it's running or currently blocked. Then make a flamegraph out of that.

How to measure mispredictions for a single branch on Linux?

I know that I can get the total percentage of branch mispredictions during the execution of a program with perf stat. But how can I get the statistics for a specific branch (if or switch statement in C code)?
You can sample on the branch-misses event:
sudo perf record -e branch-misses <yourapp>
and then report it (and even selecting the function you're interested in):
sudo perf report -n --symbols=<yourfunction>
There you can access the annotated code and get some statistics for a given branch. Or directly annotate it with the perf command with --symbol option.

How to use linux perf tool for code comprehension

I'm fascinated by the ability of 'perf' to record call graphs and am trying to understand how to use it to understand a new code base.
I compiled the code in debug mode, and ran unit tests using the following command:
perf record --call-graph dwarf make test
This creates a 230 meg perf.data. I then write out the call graph
perf report --call-graph --stdio > callgraph.txt
This creates a 50 meg file.
Ideally, I would only like to see code belonging to the project, not kernel code, system calls, c++ standard libraries, even boost and whatever other third party software. Currently I see items like __GI___dl_iterate_phdr, _Unwind_Find_FDE, etc.
I love the flamegraph project. However, that visualization isn't good for code comprehension. Are there any other projects, write-ups, ideas, which might be helpful?
perf report -g for huge application should not be dumped to external file as too verbose. Collected perf.data (with -g) will work without file redirection with interactive perf report TUI interface. You may disable callgraph reporting to find functions took most time with perf record without -g or perf report --no-children.
There is gprof2dot script (https://github.com/jrfonseca/gprof2dot) to visualize lagre perf report call-graphs as compact picture (graph).
There is also Brendan D. Gregg's interactive FlameGraphs in svg/js; and he often notes in presentations that perf report -g output shows many megabyte raw dumps of report as lot of A4 pages. There is usage instruction for the perf: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#perf:
# git clone https://github.com/brendangregg/FlameGraph # or download it from github
# cd FlameGraph
# perf record -F 99 -g -- ../command
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# ./flamegraph.pl out.perf-folded > perf-kernel.svg
PS: Why you are profiling make process? Try to select some test and profile only them. Use lower profile frequency to get smaller perf.data file. Also disable kernel-mode samples with :u suffix of default event "cycles": perf record -F 99 -g -e cycles:u -- ../command

Major Perf and PIN profiling discrepancies

To analyze certain attributes of execution times, I was going to use both Perf and PIN in separate executions of a program to get all of my information. PIN would give me instruction mixes, and Perf would give me hardware performance on those mixes. As a sanity check, I profiled the following command line argument:
g++ hello_world.cpp -o hello
So my complete command line inputs were the following:
perf stat -e cycles -e instructions g++ hello_world.cpp -o hello
pin -t icount.so -- g++ hello_world.cpp -o hello
In the PIN commands, I ignored all the path stuff for the files for the sake of this post. Additionally, I altered the basic icount.so to also record instruction mixes in addition to the default dynamic instruction count. The results were astonishingly different
PIN Results:
Count 1180608
14->COND_BR: 295371
49->UNCOND_BR: 21869
//skipping all of the other instruction types for now
Perf Results:
20,538,346 branches
105,662,160 instructions # 0.00 insns per cycle
0.072352035 seconds time elapsed
This was supposed to serve as a sanity check by having roughly the same instruction counts and roughly the same branch distributions. Why would the dynamic instruction counts be off by a factor of x100?! I was expection some noise, but that's a bit much.
Also, the amount of branches is 20% for Perf, but PIN reports around 25% (that also seems like a tad wide of a discrepancy, but it's probably just a side effect from the massive instruction count distortion).
There are significant differences between what's counted by the icount pintool and the instructions performance event, which is mapped to the architectural Instructions Retired hardware performance event on modern Intel processors. I assume you're on an Intel processor.
pin is only injected in child processes when the -follow_execv command-line option is specified and, if the pintool registered a callback function to intercept process creation, the callback returned true. On the other hand, perf profiles all child processes by default. You can tell perf to only profile the specified process using the -i option.
perf, by default, profiles all events that occurs in user mode and kernel mode (if /proc/sys/kernel/perf_event_paranoid is smaller than 2). pin only supports profiling in user mode.
The icount pintool counts at the basic block granularity, which is essentially a short, single-entry, single-exit sequence of instructions. If an instruction in the block caused an exception, the rest of the instructions in the block will not be executed, but they've already been counted. An exception may be handled without terminating the program. instructions only count instructions at retirement.
The icount pintool, by default, counts each iteration of a rep-prefixed instruction as one instruction. The instructions event counts a rep-prefixed instruction as a single instruction irrespective of the number of iterations.
On some processors, the instructions event may over count or under count.
The instructions event count may be larger due to the first two reasons. The icount pintool instruction count may be larger due to the next two reasons. The last reason may result in unpredictable discrepancies. Since the perf count is about 100x larger than the icount count, it's clear that the first two factors are dominant in this case.
You can get the two tools to get a lot closer counts by passing -i to perf to not profile children, adding the :u modifier to the instructions event name to count only in user mode, and passing -reps 1 to pin to count rep-prefixed instructions per instruction rather than per iteration.
perf stat -i -e cycles,instructions:u g++ hello_world.cpp -o hello
pin -t icount.so -reps 1 -- g++ hello_world.cpp -o hello
Instead of passing -i to perf, you can pass -follow_execv to pin as follows:
pin -follow_execv -t icount.so -reps 1 -- g++ hello_world.cpp -o hello
In this way, both tools will profile the entire process hierarchy rooted at the specified process (i.e., a running g++).
I expect the counts to be very close with these measures, but they still won't be identical.

Resources