Linux `perf record --append` option missing - linux

Online manpages like https://linux.die.net/man/1/perf-record suggest that there is an option for Linux perf command that supports incremental profiling, i.e. merging the profiling data from multiple different runs, via perf record --append. However, on my system with perf version 4.15.18, the option is missing. Is my perf version too new, or too old, to use the --append option? Alternatively, if the --append option is missing, is there another way for me to merge/append perf results from multiple runs and do incremental profiling?
This question arose when doing sampling-based profiling using LLVM. In LLVM, instrumentation-based profiling supports merging profile data across multiple runs, and I was wondering if we can do the same thing with perf.

It was removed quite a while ago, see https://lore.kernel.org/patchwork/patch/391730/ and related discussion here: https://marc.info/?l=linux-kernel&m=137031146932578&w=2. Looks like the way --append is implemented is rather simple: simply by changing the write mode of profiling data to "append", and it doesn't work well with perf report, so they decided to remove it.
There seems to be the option --timestamp-filename of timestamping the output filename, which is potentially useful to batch-sample programs using perf. When doing sampling-based optimization in LLVM, we can then use AutoFDO to convert the profiles into LLVM-readable profiles and use llvm-profdata merge to merge everything.

Related

Is there a way to see what happens during program execution in detail on Linux?

I am trying to debug a performance of my program. What would be ideal is to have a way to see in detail when was the thread doing useful work, when was it blocked by page faults, when was it executing some memory writes and reads, etc...
I would simply like to have a detailed understanding of whats going on. Is it possible?
The linux kernel sources come with the perf tool that can measure a large number of performance counter, all of those you listed included, and can print statistics about it, annotate symbols, instructions and source lines with them (if debug symbols are available), and can track any process or also logical cpu cores.
Your Linux distribution will have the tool probably in a standalone package. Some hardening options of the kernel may limit what information root or non-root users can collect with it.
You can use perf and visualizing a perf output file graphically with hotspot

How to collect some readable stack traces with perf?

I want to profile C++ program on Linux using random sampling that is described in this answer:
However, if you're in a hurry and you can manually interrupt your
program under the debugger while it's being subjectively slow, there's
a simple way to find performance problems.
The problem is that I can't use gdb debugger because I want to profile on production under heavy load and debugger is too intrusive and considerably slows down the program. However I can use perf record and perf report for finding bottlenecks without affecting program performance. Is there a way to collect a number of readable (gdb like) stack traces with perf instead of gdb?
perf does offer callstack recording with three different techniques
By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
Modern systems can use hardware-supported last branch record, lbr.
The stack is accessible in perf analysis tools such as perf report or perf script.
For more details check out man perf-record.

Is there a way to find performance of individual functions in a process using perf tool?

I am trying to get performance of individual functions within a process. How can I do it using perf tool? Is there any other tool for this?
For example, let's say, main function calls functions A , B , C . I want to get performance of main function as well as functions A,B,C individually .
Is there a good document for understating perf source code?
Thank you.
What you want to do is user-land probing. Perf can only do part of it.
Try sudo perf top -p [pid] and then watch the scoreboard. It will show the list of functions sorted by CPU usage. Here is an snapshort of redis during benchmark:
If you want to get more infos of your user-land functions, such as IO usage, latency, memory usage, I strongly suggest you to use Systemtap. It is both scripting language and tool for profiling program on Linux kernel-based operation system. Here is a tutorial about it:
http://qqibrow.github.io/performance-profiling-with-systemtap/
And you don't need to be a expert of systemtap scripting, there are many good script online for you.
For example, there is an example about using it to find out the latency of specific function.
https://github.com/openresty/stapxx#func-latency-distr
See the Perforator tool, which is built for this: https://github.com/zyedidia/perforator.
Perforator uses the same perf_event_open API that perf uses, but also uses ptrace so that profiling can be selectively enabled only for certain regions of a program (such as functions). See the examples at the Github repository for details.
perf is documented at https://perf.wiki.kernel.org/index.php/Main_Page with a tutorial at https://perf.wiki.kernel.org/index.php/Tutorial
perf report gives the breakdown by "command", see https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report. perf annotate provides a way to select what commands to report, see "Source level analysis with perf annotate" in https://perf.wiki.kernel.org/index.php/Tutorial#Options_controlling_output_2.

Open perf.data in Kcachegrind

I read somewhere that it is possible to convert perf.data (output from linux perf record profiling tool) to a format that kcachegrind can parse/plot, however I didn't find an application capable of doing this convertion and neither does kcachegrind opens perf.data.
Is this possible: use kcachegrind to see perf output? Which tool can I use?
There are two approaches for conversion of perf data to callgrind format, but its unclear which of them is more mature.
The one with more current commits called perfgrind can be found at https://github.com/ostash/perfgrind
However, it is stated to lack callgraph support, and commits came to a halt after announcement of a patch for the 2nd tool on the kernel mailing list, see lkml.org/lkml/2013/3/27/535.
The 2nd tool https://github.com/vitillo/perf approaches direct integration into the perf command, but has not yet seen an official release.
At least the perf 3.10.0 I tried does not support the proposed 'perf convert' syntax.

Profiling partial programs in Linux

I have a program in which significant amount of time is spent loading and saving data. Now I want to know how much time each function is taking in terms of percentage of the total running time. However, I want to exclude the time taken by loading and saving functions from the total time considered by the profiler. Is there any way to do so using gprof or any other popular profiler?
Similarly you can use
valgrind --tool=callgrind --collect-atstart=no --toggle-collect=<function>
Other options to look at:
--instr-atstart # to avoid runtime overhead while not profiling
To get instructionlevel stats:
--collect-jumps=yes
--dump-instr=yes
Alternatively you can 'remote control' it on the fly: callgrind_control or annotate your source code (IIRC also with branch predictions stats): callgrind_annotate.
The excellent tool kcachegrind is a marvellous visualization/navigation tool. I can hardly recommend it enough:
I would consider using something more modern than gprof, such as OProfile. When generating a report using opreport you can use the --exclude-symbols option to exclude functions you are not interested in.
See the OProfile webpage for more details; however for a quick start guide see the OProfile docs page.
Zoom from RotateRight offers a system-wide time profile for Linux. If your code spends a lot of time in i/o, then that time won't show up in a time profile of the CPUs. Alternatively, if you want to account for time spent in i/o, try the "thread time profile".
for a simple, basic solution, you might want log data to a csv file.
e.g. Format [functionKey,timeStamp\n]
... then load that up in Excel. Get the deltas, and then include or exclude based on if functions. Nothing fancy. On the upside, you could get some visualisations fairly cheaply.

Resources