When I run perf list on my Linux system I get a long list of available perf events.
Is it possible to list and use these events programatically from another process, using perf_event_open(2)? That is, how can I get this list from another process and determine the corresponding values to populate in perf_event_attr?
I'm not looking for solutions that use another third-party listing of the events, e.g,. libpfm4 or jevents. I know some events can be reconstructed from the files in /sys/devices/cpu/events/ (and similar files for other event types), but these are a small subset of the events that perf list shows.
There is no solution to get full list of raw events from kernel (with any syscall like perf_event_open(2)) without using third-party (or first party) lists. Perf tool uses some basic events scanned from /sys/bus/event_source/devices/cpu/events and similar sysfs folders, but it has its own list of cpu model specific events: https://elixir.bootlin.com/linux/v5.5.19/source/tools/perf/pmu-events, and there is readme file which points that perf uses jevents (perf has 8 MB of x86 json event lists, at tools/perf/pmu-events/arch/x86)
The contents of this directory allow users to specify PMU events in their
CPUs by their symbolic names rather than raw event codes (see example below).
The main program in this directory, is the 'jevents', which is built and
executed BEFORE the perf binary itself is built.
The 'jevents' program tries to locate and process JSON files in the directory
tree tools/perf/pmu-events/arch/foo.
You can download perf sources from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/ and use some source code navigation tools to check cmd_list function builtin-list.c file (with some undocumented options). Also you can build perf tools from these sources and there will be compilation of jevents (HOSTCC pmu-events/jevents.o, LINK pmu-events/jevents) early in perf building.
Current cpu model is detected from table pmu_events_map (pmu-events/pmu-events.c) by perf_pmu__find_map (util/pmu.c) called from pmu_add_cpu_aliases, called from pmu_lookup, from perf_pmu__find, from perf_pmu__scan from print_pmu_events from cmd_list (handler of perf list builtin command).
As of 5.5 version of perf (from linux kernel 5.5 as perf is part of linux kernel), there is no raw dump of event list with description. There is undocumented option perf list --raw-dump which will print list of all events for every available monitoring unit, for example, pmu: perf list --raw-dump pmu |tr ' ' '\n'. The output of this raw dump is unstable between perf versions.
Kernel part of perf_events subsystem has no full event lists in arch/x86/events or kernel/events folders, only mapping of standard perf events (listed in sysfs) like cycles or cpu/branch-misses/ to raw events of specific cpu model.
Related
perf is able to record multiple fields such as addr, ip, timestamp. It also can record general registers as seen at https://github.com/torvalds/linux/blob/master/tools/perf/arch/x86/util/perf_regs.c. But I can't find any related document about recording control registers using perf. So how can I achieve that using perf? Are there any other tools available?
You cannot record control register values using perf tools. The list of registers that you can sample using --intr-regs option is limited to the registers listed here. You can confirm this by looking here.
The registers that can be accessed by the perf events module is architecture dependent, as can be seen here and here. Including selective register states into the perf record/script output has been introduced by this commit. This means, all of perf would be limited to using the registers that have been specified and nothing more.
There are other questions/answers here, that tell you some ways of writing a program/kernel module to access the control registers. On top of this, you can use QEMU (in TCG mode) and run your program inside the VM. You can then print register state periodically (at the end of each TB - where you'll see all register values). There are designated emulators like GDB also, which might help you.
Edit -
There is one way by which cr3 register values can be recorded. You can use IntelPT to record control flow information for a program, during its execution. IntelPT tracks changes to CR3 registers with the help of the PIP packet. You can use the traces generated by IntelPT to track and determine the CR3 values.
I used the following command to extract backtraces leading to user level L3-misses in a simple evince benchmark:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
As it is clear, the sampling period is quite large (10000 events between consecutive samples). For this experiment, the output of perf script had some samples similar to this one:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
At the bottom of the backtrace, there is a symbol called [unknown], which seems OK. But then a line in general_composite_rect() is called. Is this backtrace OK?
AFAIK, the first caller in the backtrace should be something like _start() or __GI___clone(). But the backtrace is not in this form. What is wrong?
Is there any way to resolve the issue? Are the truncated (parts of) backtraces reliable?
TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer or with -g or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main() or _start or clone()/start_thread() top functions.
perf profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report command, but every wake of tool can save thousands of samples; samples can be read by perf script or perf script -D.
perf_events interrupt handler, something near __perf_event_overflow of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:
https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.
Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.
With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method (--call-graph) of perf record selects the method to be used. It is documented in man perf-record http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".
Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like
"--call-graph dwarf,4096".
So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240 or --call-graph dwarf,81920 etc.
Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user(); called from kernel/events/callchain.c:get_perf_callchain() <- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler) <- READ_ONCE(event->overflow_handler)(event, data, regs); of __perf_event_overflow.
Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?
I had the same problem and it was like this: when you are collecting traces with --call-graph dwarf, if the size of the stack is too big, you will get unknown in the stack backtrace.
The default maximum stack size is 8kB, but it can be increased like this, --call-graph dwarf,16578. Unfortunately, perf has some other problems when you increase the stack size. In my case, the solution was to get rid of a large stack-allocated array by allocating it on the heap.
I want to profile C++ program on Linux using random sampling that is described in this answer:
However, if you're in a hurry and you can manually interrupt your
program under the debugger while it's being subjectively slow, there's
a simple way to find performance problems.
The problem is that I can't use gdb debugger because I want to profile on production under heavy load and debugger is too intrusive and considerably slows down the program. However I can use perf record and perf report for finding bottlenecks without affecting program performance. Is there a way to collect a number of readable (gdb like) stack traces with perf instead of gdb?
perf does offer callstack recording with three different techniques
By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
Modern systems can use hardware-supported last branch record, lbr.
The stack is accessible in perf analysis tools such as perf report or perf script.
For more details check out man perf-record.
On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.
I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.
Does anyone know?
Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?
It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):
perf record -p $PID -e task-clock -o perf.output.file
Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)
Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)
perf report -i perf.output.file
or converted to raw PC (EIP) samples with perf script -i perf.output.file.
PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.
If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.
I'm fascinated by the ability of 'perf' to record call graphs and am trying to understand how to use it to understand a new code base.
I compiled the code in debug mode, and ran unit tests using the following command:
perf record --call-graph dwarf make test
This creates a 230 meg perf.data. I then write out the call graph
perf report --call-graph --stdio > callgraph.txt
This creates a 50 meg file.
Ideally, I would only like to see code belonging to the project, not kernel code, system calls, c++ standard libraries, even boost and whatever other third party software. Currently I see items like __GI___dl_iterate_phdr, _Unwind_Find_FDE, etc.
I love the flamegraph project. However, that visualization isn't good for code comprehension. Are there any other projects, write-ups, ideas, which might be helpful?
perf report -g for huge application should not be dumped to external file as too verbose. Collected perf.data (with -g) will work without file redirection with interactive perf report TUI interface. You may disable callgraph reporting to find functions took most time with perf record without -g or perf report --no-children.
There is gprof2dot script (https://github.com/jrfonseca/gprof2dot) to visualize lagre perf report call-graphs as compact picture (graph).
There is also Brendan D. Gregg's interactive FlameGraphs in svg/js; and he often notes in presentations that perf report -g output shows many megabyte raw dumps of report as lot of A4 pages. There is usage instruction for the perf: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#perf:
# git clone https://github.com/brendangregg/FlameGraph # or download it from github
# cd FlameGraph
# perf record -F 99 -g -- ../command
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# ./flamegraph.pl out.perf-folded > perf-kernel.svg
PS: Why you are profiling make process? Try to select some test and profile only them. Use lower profile frequency to get smaller perf.data file. Also disable kernel-mode samples with :u suffix of default event "cycles": perf record -F 99 -g -e cycles:u -- ../command