How to use linux perf tool for code comprehension - linux

I'm fascinated by the ability of 'perf' to record call graphs and am trying to understand how to use it to understand a new code base.
I compiled the code in debug mode, and ran unit tests using the following command:
perf record --call-graph dwarf make test
This creates a 230 meg perf.data. I then write out the call graph
perf report --call-graph --stdio > callgraph.txt
This creates a 50 meg file.
Ideally, I would only like to see code belonging to the project, not kernel code, system calls, c++ standard libraries, even boost and whatever other third party software. Currently I see items like __GI___dl_iterate_phdr, _Unwind_Find_FDE, etc.
I love the flamegraph project. However, that visualization isn't good for code comprehension. Are there any other projects, write-ups, ideas, which might be helpful?

perf report -g for huge application should not be dumped to external file as too verbose. Collected perf.data (with -g) will work without file redirection with interactive perf report TUI interface. You may disable callgraph reporting to find functions took most time with perf record without -g or perf report --no-children.
There is gprof2dot script (https://github.com/jrfonseca/gprof2dot) to visualize lagre perf report call-graphs as compact picture (graph).
There is also Brendan D. Gregg's interactive FlameGraphs in svg/js; and he often notes in presentations that perf report -g output shows many megabyte raw dumps of report as lot of A4 pages. There is usage instruction for the perf: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#perf:
# git clone https://github.com/brendangregg/FlameGraph # or download it from github
# cd FlameGraph
# perf record -F 99 -g -- ../command
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# ./flamegraph.pl out.perf-folded > perf-kernel.svg
PS: Why you are profiling make process? Try to select some test and profile only them. Use lower profile frequency to get smaller perf.data file. Also disable kernel-mode samples with :u suffix of default event "cycles": perf record -F 99 -g -e cycles:u -- ../command

Related

Using the perf events from perf list programatically

When I run perf list on my Linux system I get a long list of available perf events.
Is it possible to list and use these events programatically from another process, using perf_event_open(2)? That is, how can I get this list from another process and determine the corresponding values to populate in perf_event_attr?
I'm not looking for solutions that use another third-party listing of the events, e.g,. libpfm4 or jevents. I know some events can be reconstructed from the files in /sys/devices/cpu/events/ (and similar files for other event types), but these are a small subset of the events that perf list shows.
There is no solution to get full list of raw events from kernel (with any syscall like perf_event_open(2)) without using third-party (or first party) lists. Perf tool uses some basic events scanned from /sys/bus/event_source/devices/cpu/events and similar sysfs folders, but it has its own list of cpu model specific events: https://elixir.bootlin.com/linux/v5.5.19/source/tools/perf/pmu-events, and there is readme file which points that perf uses jevents (perf has 8 MB of x86 json event lists, at tools/perf/pmu-events/arch/x86)
The contents of this directory allow users to specify PMU events in their
CPUs by their symbolic names rather than raw event codes (see example below).
The main program in this directory, is the 'jevents', which is built and
executed BEFORE the perf binary itself is built.
The 'jevents' program tries to locate and process JSON files in the directory
tree tools/perf/pmu-events/arch/foo.
You can download perf sources from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/ and use some source code navigation tools to check cmd_list function builtin-list.c file (with some undocumented options). Also you can build perf tools from these sources and there will be compilation of jevents (HOSTCC pmu-events/jevents.o, LINK pmu-events/jevents) early in perf building.
Current cpu model is detected from table pmu_events_map (pmu-events/pmu-events.c) by perf_pmu__find_map (util/pmu.c) called from pmu_add_cpu_aliases, called from pmu_lookup, from perf_pmu__find, from perf_pmu__scan from print_pmu_events from cmd_list (handler of perf list builtin command).
As of 5.5 version of perf (from linux kernel 5.5 as perf is part of linux kernel), there is no raw dump of event list with description. There is undocumented option perf list --raw-dump which will print list of all events for every available monitoring unit, for example, pmu: perf list --raw-dump pmu |tr ' ' '\n'. The output of this raw dump is unstable between perf versions.
Kernel part of perf_events subsystem has no full event lists in arch/x86/events or kernel/events folders, only mapping of standard perf events (listed in sysfs) like cycles or cpu/branch-misses/ to raw events of specific cpu model.

Linux `perf record --append` option missing

Online manpages like https://linux.die.net/man/1/perf-record suggest that there is an option for Linux perf command that supports incremental profiling, i.e. merging the profiling data from multiple different runs, via perf record --append. However, on my system with perf version 4.15.18, the option is missing. Is my perf version too new, or too old, to use the --append option? Alternatively, if the --append option is missing, is there another way for me to merge/append perf results from multiple runs and do incremental profiling?
This question arose when doing sampling-based profiling using LLVM. In LLVM, instrumentation-based profiling supports merging profile data across multiple runs, and I was wondering if we can do the same thing with perf.
It was removed quite a while ago, see https://lore.kernel.org/patchwork/patch/391730/ and related discussion here: https://marc.info/?l=linux-kernel&m=137031146932578&w=2. Looks like the way --append is implemented is rather simple: simply by changing the write mode of profiling data to "append", and it doesn't work well with perf report, so they decided to remove it.
There seems to be the option --timestamp-filename of timestamping the output filename, which is potentially useful to batch-sample programs using perf. When doing sampling-based optimization in LLVM, we can then use AutoFDO to convert the profiles into LLVM-readable profiles and use llvm-profdata merge to merge everything.

linux perf report inconsistent behavior

I have an application I'm profiling using perf and I find the results when using perf report are not consistent, and I can't discern the pattern.
I start the application and profile it by pid for 60 seconds:
perf record -p <pid> -o <file> sleep 60
And when I pull the results in with perf report -i <file>, sometimes I see a "+" in the far left column that allows me to drill down into the function call trees when I press ENTER, and sometimes that "+" is not there. It seems to be dependent on some property of the recorded file, in that I have a collection of recorded files, some which allow this drill down and some which do not.
Any suggestions on how to get consistent behavior here would be appreciated.
The default event being measured by perf record is cpu-cycles.
(Or depending on the machine, sometimes cpu-cycles:p or cpu-cycles:pp)
Are you sure your application is not sleeping a lot? Does it consume a lot of cpu cycles?
Try a perf measurement on something that stresses the CPU by doing a lot of computations:
$ apt-get install stress
$ perf record -e cpu-cycles --call-graph fp stress --cpu 1 --timeout 5
$ perf report
Subsequent runs should then show more or less similar results.
In case your program is CPU intensive, and call stacks do differ a lot between runs, then you may want to look at the --call-graph option, as perf can record call-graphs with different methods:
fp (function pointer)
lbr (last branch record)
dwarf
Maybe different methods give better results.

Is there a way to find performance of individual functions in a process using perf tool?

I am trying to get performance of individual functions within a process. How can I do it using perf tool? Is there any other tool for this?
For example, let's say, main function calls functions A , B , C . I want to get performance of main function as well as functions A,B,C individually .
Is there a good document for understating perf source code?
Thank you.
What you want to do is user-land probing. Perf can only do part of it.
Try sudo perf top -p [pid] and then watch the scoreboard. It will show the list of functions sorted by CPU usage. Here is an snapshort of redis during benchmark:
If you want to get more infos of your user-land functions, such as IO usage, latency, memory usage, I strongly suggest you to use Systemtap. It is both scripting language and tool for profiling program on Linux kernel-based operation system. Here is a tutorial about it:
http://qqibrow.github.io/performance-profiling-with-systemtap/
And you don't need to be a expert of systemtap scripting, there are many good script online for you.
For example, there is an example about using it to find out the latency of specific function.
https://github.com/openresty/stapxx#func-latency-distr
See the Perforator tool, which is built for this: https://github.com/zyedidia/perforator.
Perforator uses the same perf_event_open API that perf uses, but also uses ptrace so that profiling can be selectively enabled only for certain regions of a program (such as functions). See the examples at the Github repository for details.
perf is documented at https://perf.wiki.kernel.org/index.php/Main_Page with a tutorial at https://perf.wiki.kernel.org/index.php/Tutorial
perf report gives the breakdown by "command", see https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report. perf annotate provides a way to select what commands to report, see "Source level analysis with perf annotate" in https://perf.wiki.kernel.org/index.php/Tutorial#Options_controlling_output_2.

Open perf.data in Kcachegrind

I read somewhere that it is possible to convert perf.data (output from linux perf record profiling tool) to a format that kcachegrind can parse/plot, however I didn't find an application capable of doing this convertion and neither does kcachegrind opens perf.data.
Is this possible: use kcachegrind to see perf output? Which tool can I use?
There are two approaches for conversion of perf data to callgrind format, but its unclear which of them is more mature.
The one with more current commits called perfgrind can be found at https://github.com/ostash/perfgrind
However, it is stated to lack callgraph support, and commits came to a halt after announcement of a patch for the 2nd tool on the kernel mailing list, see lkml.org/lkml/2013/3/27/535.
The 2nd tool https://github.com/vitillo/perf approaches direct integration into the perf command, but has not yet seen an official release.
At least the perf 3.10.0 I tried does not support the proposed 'perf convert' syntax.

Resources