How to collect some readable stack traces with perf? - linux

I want to profile C++ program on Linux using random sampling that is described in this answer:
However, if you're in a hurry and you can manually interrupt your
program under the debugger while it's being subjectively slow, there's
a simple way to find performance problems.
The problem is that I can't use gdb debugger because I want to profile on production under heavy load and debugger is too intrusive and considerably slows down the program. However I can use perf record and perf report for finding bottlenecks without affecting program performance. Is there a way to collect a number of readable (gdb like) stack traces with perf instead of gdb?

perf does offer callstack recording with three different techniques
By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
Modern systems can use hardware-supported last branch record, lbr.
The stack is accessible in perf analysis tools such as perf report or perf script.
For more details check out man perf-record.

Related

Is there a way to see what happens during program execution in detail on Linux?

I am trying to debug a performance of my program. What would be ideal is to have a way to see in detail when was the thread doing useful work, when was it blocked by page faults, when was it executing some memory writes and reads, etc...
I would simply like to have a detailed understanding of whats going on. Is it possible?
The linux kernel sources come with the perf tool that can measure a large number of performance counter, all of those you listed included, and can print statistics about it, annotate symbols, instructions and source lines with them (if debug symbols are available), and can track any process or also logical cpu cores.
Your Linux distribution will have the tool probably in a standalone package. Some hardening options of the kernel may limit what information root or non-root users can collect with it.
You can use perf and visualizing a perf output file graphically with hotspot

Using /proc/*/stat for profiling

On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.
I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.
Does anyone know?
Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?
It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):
perf record -p $PID -e task-clock -o perf.output.file
Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)
Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)
perf report -i perf.output.file
or converted to raw PC (EIP) samples with perf script -i perf.output.file.
PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.
If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.

Is there a way to find performance of individual functions in a process using perf tool?

I am trying to get performance of individual functions within a process. How can I do it using perf tool? Is there any other tool for this?
For example, let's say, main function calls functions A , B , C . I want to get performance of main function as well as functions A,B,C individually .
Is there a good document for understating perf source code?
Thank you.
What you want to do is user-land probing. Perf can only do part of it.
Try sudo perf top -p [pid] and then watch the scoreboard. It will show the list of functions sorted by CPU usage. Here is an snapshort of redis during benchmark:
If you want to get more infos of your user-land functions, such as IO usage, latency, memory usage, I strongly suggest you to use Systemtap. It is both scripting language and tool for profiling program on Linux kernel-based operation system. Here is a tutorial about it:
http://qqibrow.github.io/performance-profiling-with-systemtap/
And you don't need to be a expert of systemtap scripting, there are many good script online for you.
For example, there is an example about using it to find out the latency of specific function.
https://github.com/openresty/stapxx#func-latency-distr
See the Perforator tool, which is built for this: https://github.com/zyedidia/perforator.
Perforator uses the same perf_event_open API that perf uses, but also uses ptrace so that profiling can be selectively enabled only for certain regions of a program (such as functions). See the examples at the Github repository for details.
perf is documented at https://perf.wiki.kernel.org/index.php/Main_Page with a tutorial at https://perf.wiki.kernel.org/index.php/Tutorial
perf report gives the breakdown by "command", see https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report. perf annotate provides a way to select what commands to report, see "Source level analysis with perf annotate" in https://perf.wiki.kernel.org/index.php/Tutorial#Options_controlling_output_2.

Is there any profiler that works with -fomit-frame-pointer on x86_64?

SysProf doesn't properly generate call stack without it, GProf isn't accurate at all. And also, are profilers that work without -fno-omit-frame-pointer as accurate as those that rely on it?
Recent versions of linux perf can be used (with --call-graph dwarf):
perf record -F99 --call-graph dwarf myapp
It uses .eh_frames (or .debug_frames) with libunwind to unwind the stack.
In my experience, it get lost, sometimes.
With recent version of perf+kernel on Haswell, you might be able to use the Last Branch Record with --call-graph lbr.
There are none that I'm aware of. With frame pointers, walking a stack is a fairly simple exercise. You simply dereference the frame pointer to find the old frame pointer, stack pointer, and instruction pointer, and repeat until you're done. Without frame pointers you cannot reliably walk a stack without additional information, which on ELF platforms generally means DWARF CFI. DWARF is fairly complex to parse, and requires you to read in a fair amount of additional information which is tricky to do in the time constraints that profilers need to work in.
One plausible method for implementing this would be to simply save the stack memory at every sample and then walk it offline using the CFI to unwind properly. Depending on the depth of the stack this could require quite a bit of storage, and the copying could be prohibitive. I've never heard of a profiler using this technique, but Julian Seward floated it as a potential implementation strategy for Firefox's built-in profiler.
It would be hard for most profilers to work when -fomit-frame-pointer is asserted. You probably need to not use that and to link against debugging versions of the libraries (which are almost certainly compiled without -fomit-frame-pointer) if you want to do reasonable profiling.

How to profile program on Linux platform without rebuilding?

I've used two profiling tools (VTune on windows and dbx (within sunstudio) on Solaris) which can profile program without rebuild them, and during profiling, the program just run at the same speed as normal. Both of these 2 features saved me a lot of time.
Now I want to know if there is some free tools available on Linux platform can do the same thing. I think I need profiling tools based on sampling. VTune is good but expensive ... I've heard of gprof and valgrind. But seems gprof need instrument the program (so we have to rebuild the program) and valgrind will slow down the program execution quite a lot. (from valgrind's introduction, Cachegrind runs programs about 20--100x slower than normal, and Callgrind which I need to profiling is based on Cachegrind)
For profiling, I just need to figure out the execution time of function calls so I can find out where the performance degradation happens. Actually I don't need many low level profiling information as Cachegrind provided...
oprofile is pretty good, but it can be difficult to set up. It also doesn't require you to rebuild your program.
Agreeing with Paul, I think Zoom is probably the best Linux profiler you can pay for.
However, for real results, I rely on this simple method, that I've been using since before profilers were invented.
Performance Counters for Linux is a new tool usable on kernels 2.6.31 and later; it's less intrusive (to both the program and the system as a whole) than valgrind or OProfile.
A nicer option than oprofile is Zoom. It's similar to Shark on Mac OS X, if you have ever used that. It's commercial ($199) but you can get a free trial from www.rotateright.com.

Resources