How do you get debugging symbols working in linux perf tool inside Docker containers?

How do you get debugging symbols working in linux perf tool inside Docker containers? - linux

I am using Docker containers based on the "ubuntu" tag and cannot get linux perf tool to display debugging symbols.
Here is what I'm doing to demonstrate the problem.
First I start a container, here with an interactive shell.
$ docker run -t -i ubuntu:14.04 /bin/bash
Then from the container prompt I install linux perf tool.
$ apt-get update
$ apt-get install -y linux-tools-common linux-tools-generic linux-tools-`uname -r`
I can now use the perf tool. My kernel is 3.16.0-77-generic.
Now I'll install gcc, compile a test program, and try to run it under perf record.
$ apt-get install -y gcc
I paste in the test program into test.c:
#include <stdio.h>
int function(int i) {
int j;
for(j = 2; j <= i / 2; j++) {
if (i % j == 0) {
return 0;
}
}
return 1;
}
int main() {
int i;
for(i = 2; i < 100000; i++) {
if(function(i)) {
printf("%d\n", i);
}
}
}
Then compile, run, and report:
$ gcc -g -O0 test.c && perf record ./a.out && perf report
The output looks something like this:
72.38% a.out a.out [.] 0x0000000000000544
8.37% a.out a.out [.] 0x000000000000055a
8.30% a.out a.out [.] 0x000000000000053d
7.81% a.out a.out [.] 0x0000000000000551
0.40% a.out a.out [.] 0x0000000000000540
This does not have symbols, even though the executable does have symbol information.
Doing the same general steps outside the container works fine, and shows something like this:
96.96% a.out a.out [.] function
0.35% a.out libc-2.19.so [.] _IO_file_xsputn##GLIBC_2.2.5
0.14% a.out [kernel.kallsyms] [k] update_curr
0.12% a.out [kernel.kallsyms] [k] update_cfs_shares
0.11% a.out [kernel.kallsyms] [k] _raw_spin_lock_irqsave
In the host system I have already turned on kernel symbols by becoming root and doing:
$ echo 0 > /proc/sys/kernel/kptr_restrict
How do I get the containerized version to work properly and show debugging symbols?

Running the container with -v /:/host flag and running perf report in the container with --symfs /host flag fixes it:
96.59% a.out a.out [.] function
2.93% a.out [kernel.kallsyms] [k] 0xffffffff8105144a
0.13% a.out [nvidia] [k] 0x00000000002eda57
0.11% a.out libc-2.19.so [.] vfprintf
0.11% a.out libc-2.19.so [.] 0x0000000000049980
0.09% a.out a.out [.] main
0.02% a.out libc-2.19.so [.] _IO_file_write
0.02% a.out libc-2.19.so [.] write
Part of the reason why it doesn't work as is? The output from perf script sort of sheds some light on this:
...
a.out 24 3374818.880960: cycles: ffffffff81141140 __perf_event__output_id_sample ([kernel.kallsyms])
a.out 24 3374818.881012: cycles: ffffffff817319fd _raw_spin_lock_irqsave ([kernel.kallsyms])
a.out 24 3374818.882217: cycles: ffffffff8109aba3 ttwu_do_activate.constprop.75 ([kernel.kallsyms])
a.out 24 3374818.884071: cycles: 40053d [unknown] (/var/lib/docker/aufs/diff/9bd2d4389cf7ad185405245b1f5c7d24d461bd565757880bfb4f970d3f4f7915/a.out)
a.out 24 3374818.885329: cycles: 400544 [unknown] (/var/lib/docker/aufs/diff/9bd2d4389cf7ad185405245b1f5c7d24d461bd565757880bfb4f970d3f4f7915/a.out)
...
Note the /var/lib/docker/aufs path. That's from the host so it won't exist in the container and you need to help perf report to locate it. This likely happens because the mmap events are tracked by perf outside of any cgroup and perf does not attempt to remap the paths.
Another option is to run perf host-side, like sudo perf record -a docker run -ti <container name>. But the collection has to be system-wide here (the -a flag) as containers are spawned by docker daemon process which is not in the process hierarchy of the docker client tool we run here.

Another way that doesn't require changing how you run the container (so you can profile an already running process) is to mount container's root on host using bindfs:
bindfs /proc/$(docker inspect --format {{.State.Pid}} $CONTAINER_ID)/root /foo
Then run perf report as perf report --symfs /foo
You'll have to run perf record system wide, but you can restrict it to only collect events for the specific container:
perf record -g -a -F 100 -e cpu-clock -G docker/$(docker inspect --format {{.Id}} $CONTAINER_ID) sleep 90

Related

How to Trace Rescheduling Interrupts

I would like to find out how to trace rescheduling interrupts.
As I noticed that my application had some involuntary context switches and then cat /proc/interrupts shows that rescheduling interrupts happened.
I suspected that it has nothing to do with my application and so I created a dummy application:
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cstdint>
int main(int argc, char** argv) {
srand(time(nullptr));
int32_t i = 0;
while (i != rand()) i = rand() * -1;
printf("%d", i);
}
Which basically never exits.
I compiled it:
/opt/rh/devtoolset-8/root/usr/bin/g++ -Wall dummy.cpp -o a -march=native -mtune=native -O0 -fno-omit-frame-pointer -std=c++17
Then, I ran taskset -c 5 ./a and then cat /proc/interrupts every second or so. I notice that Rescheduling Interrupts increases by 1-2 every second, which I don't expect. Local Timer Interrupt also increases by 1 every second, which is expected. I already isolated core 5 in boot parameter.
Another machine's rescheduling interrupts only increases by 1 every 30 mins or so.
Hence, I am looking for a generic way to trace down this kind of interrupt issues so that I can reapply the same methodology in the future for different kinds of unexpected interrupts.
My kernel version:
# uname -a
Linux localhost 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
I also have tried taskset -c 4 perf record -e cycles:pp -C 5 -g --freq=12000 to record the call graph but for some reason the call graph of the kernel functions wasn't created...only userspace's.
Update 1:
Followed #osgx's suggestion, I ran
taskset -c 4 perf record -e irq_vectors:reschedule_entry -c 1 --call-graph dwarf,4096 taskset -c 5 ./a
And then perf report --call-graph
- 100.00% 100.00% vector=253 ▒
- 76.47% _start ▒
__libc_start_main ▒
- main ▒
- 64.71% rand ▒
- 58.82% __random ▒
- 29.41% 0xffffffffb298e06a ▒
0xffffffffb2991864 ▒
0xffffffffb22a4565 ◆
0xffffffffb222f675 ▒
- 0xffffffffb299042c ▒
- 23.53% 0xffffffffb22a41e5 ▒
0xffffffffb298f8da ▒
0xffffffffb2258ec2 ▒
- 5.88% 0xffffffffb298f8da
kernel-debuginfo, kernel-debuginfo-common, glibc-debuginfo and glibc-debuginfo-common have been installed and -fno-omit-frame-pointer have been specified when compiling. Not sure why addresses show in the report instead of symbols. Idea?

Cannot set kernel dynamic debug on Linux?

I have already seen Cannot enable kernel dynamic debugging on linux ; https://www.kernel.org/doc/html/v4.11/admin-guide/dynamic-debug-howto.html .
I have rebuilt the Raspbian 9 kernel with CONFIG_DYNAMIC_DEBUG, and booted into it; the file /sys/kernel/debug/dynamic_debug/control and is populated with 2k+ dynamic debug rule statements:
pi#raspberrypi:~ $ sudo ls -la /sys/kernel/debug/dynamic_debug/control
-rw-r--r-- 1 root root 0 Jan 1 1970 /sys/kernel/debug/dynamic_debug/control
pi#raspberrypi:~ $ sudo cat /sys/kernel/debug/dynamic_debug/control | wc -l
2358
pi#raspberrypi:~ $ sudo grep 'snd_device' /sys/kernel/debug/dynamic_debug/control
sound/core/device.c:132 [snd]snd_device_disconnect =_ "device disconnect %p (from %pS), not found\012"
sound/core/device.c:156 [snd]snd_device_free =_ "device free %p (from %pS), not found\012"
Ok, so I want to trace the is_connected_output_ep function, which is in sound/soc/soc-dapm.c. So I do this:
pi#raspberrypi:~ $ sudo bash -c "echo -n 'func is_connected_output_ep +p' > /sys/kernel/debug/dynamic_debug/control"
pi#raspberrypi:~ $ sudo cat /sys/kernel/debug/dynamic_debug/control | grep is_conn
pi#raspberrypi:~ $
pi#raspberrypi:~ $ sudo bash -c "echo 'file sound/soc/soc-dapm.c line 1175 +p' > /sys/kernel/debug/dynamic_debug/control"
pi#raspberrypi:~ $ sudo cat /sys/kernel/debug/dynamic_debug/control | grep dapm
pi#raspberrypi:~ $
... and I get no errors - but seemingly, nothing "sticks". (and yeah, I don't see this function being traced either).
The documentation says that +p does:
p enables the pr_debug() callsite.
I'm not sure what they mean by this - does it mean that if there are already existing pr_debug statements in the function, then they will be enabled (i.e. will print to syslog) with this? If so, what happens in the case when there are no such statements in the function - as is the case with is_connected_output_ep? Can I still setup dynamic debug to somehow trace this function - without having to manually insert printk or other statements and recompiling the kernel module?

Well, I did some more reading, and it seems the answer to:
does it mean that if there are already existing pr_debug statements in the function, then they will be enabled (i.e. will print to syslog) with this?
... is likely "yes" - so you cannot do dynamic debug of a function that does not have pr_debug statements in it already.
Also, it seems that the /sys/kernel/debug/dynamic_debug/control (upon read) is actually a list of all possible dynamic debug "probe points" if you will, along with their status (enabled or not), though I'm not sure about this.
Anyways, here is some more reading where this stuff is mentioned:
The dynamic debugging interface [LWN.net] 2011
Dynamic Debug, conference paper, 2009
So I cannot trace is_connected_output_ep with dynamic debug - so maybe I should look into ftrace or kprobes (dynamic probes) facilities of the Linux kernel...
EDIT: It turns out, dynamic_debug/control lists debuggable statements ONLY from currently loaded modules in the kernel! Example, there is a dev_dbg in the dpcm_path_get function in the soc-pcm.c source file, which ends up in the snd_soc_core kernel module (snd-soc-core.ko). This module by default is not loaded by Raspbian 9, so we get this:
pi#raspberrypi:~ $ lsmod | grep snd
snd_bcm2835 32768 1
snd_pcm 98304 1 snd_bcm2835
snd_timer 32768 1 snd_pcm
snd 69632 5 snd_timer,snd_bcm2835,snd_pcm
pi#raspberrypi:~ $ sudo grep 'soc-pcm' /sys/kernel/debug/dynamic_debug/control
pi#raspberrypi:~ $
Ok, now if the kernel module is loaded with modprobe, now suddenly the debuggable callsites appear in dynamic_debug/control:
pi#raspberrypi:~ $ sudo modprobe snd_soc_core
pi#raspberrypi:~ $ lsmod | grep snd
snd_soc_core 200704 0
snd_compress 20480 1 snd_soc_core
snd_pcm_dmaengine 16384 1 snd_soc_core
snd_bcm2835 32768 1
snd_pcm 98304 3 snd_pcm_dmaengine,snd_bcm2835,snd_soc_core
snd_timer 32768 1 snd_pcm
snd 69632 7 snd_compress,snd_timer,snd_bcm2835,snd_soc_core,snd_pcm
pi#raspberrypi:~ $ sudo grep 'soc-pcm' /sys/kernel/debug/dynamic_debug/control
sound/soc/soc-pcm.c:1367 [snd_soc_core]dpcm_prune_paths =_ "ASoC: pruning %s BE %s for %s\012"
sound/soc/soc-pcm.c:1373 [snd_soc_core]dpcm_prune_paths =_ "ASoC: found %d old BE paths for pruning\012"
...
pi#raspberrypi:~ $ sudo grep 'dpcm_path_get' /sys/kernel/debug/dynamic_debug/control
sound/soc/soc-pcm.c:1331 [snd_soc_core]dpcm_path_get =_ "ASoC: found %d audio %s paths\012"
And finally, we can now enable this print statement:
pi#raspberrypi:~ $ sudo bash -c "echo 'func dpcm_path_get +p' > /sys/kernel/debug/dynamic_debug/control"
pi#raspberrypi:~ $ sudo grep 'dpcm_path_get' /sys/kernel/debug/dynamic_debug/control
sound/soc/soc-pcm.c:1331 [snd_soc_core]dpcm_path_get =p "ASoC: found %d audio %s paths\012"
Apparently, the disabled lines have a =_ symbol in the line, and the enabled lines have =p ...
Now all that I'd want, is to enable some statements before the driver is loaded, so I could monitor printouts in _probe functions of kernel module drivers...

You can add following argument to insmod
dyndbg="+p"

FreeBSD: lldb does crash even in hello.c

on FreeBSD I started to play around with LLDB, but it crashes right at the start.
user#host ~/sandbox % rake hello
cc -I/usr/local/include -g -O0 -o hello.o -c hello.c
cc -Wl,-L/usr/local/lib -o hello hello.o
user#host ~/sandbox % lldb
(lldb) target create hello
Current executable set to 'hello' (i386).
(lldb) source list
8 {
9 printf( "Hello, world!\n");
10 return 0;
11 }
12
(lldb) breakpoint set -f hello.c -l 9
Breakpoint 1: where = hello`main + 31 at hello.c:9, address = 0x080485af
(lldb) process launch
Process 2409 launching
Process 2409 stopped
(lldb) Process 2409 launched: '/usr/home/user/sandbox/hello' (i386)
Process 2409 stopped
* thread #1: tid = 100224, 0x0818188f, stop reason = hardware error
frame #0: 0x0818188f
-> 0x818188f: addb %al, (%eax)
0x8181891: addb %al, (%eax)
0x8181893: addb %al, (%eax)
0x8181895: addb %al, (%eax)
(lldb)
It is the same on three machines.
I have also tried Gdb on Linux. There, everything worked fine.
What did I do wrong?
Thanks in advance,
Bertram

LLDB doesn't support FreeBSD/i386 host for now. Use recent gdb from ports or switch to amd64.

What does perf's option to measure events at user and kernel levels mean?

The Linux perf tool provides access to CPU event counters. It lets you specify the events to be counted and when to count those events.
https://perf.wiki.kernel.org/index.php/Tutorial
By default, events are measured at both user and kernel levels:
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000
To measure only at the user level, it is necessary to pass a modifier:
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000
To measure both user and kernel (explicitly):
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000
From this, I expected that cycles:u meant "only count events while running non-kernel code" and recorded counts would not map to kernel symbols but that doesn't seem to be the case.
Here's an example:
perf record -e cycles:u du -sh ~
[...]
perf report --stdio -i perf.data
[...]
9.24% du [kernel.kallsyms] [k] system_call
[...]
0.70% du [kernel.kallsyms] [k] page_fault
[...]
If I do the same but use cycles:uk then I do get more kernel symbols reported so the event modifiers do have an effect. Using cycles:k produces reports with almost exclusively kernel symbols but it does include a few libc symbols.
What's going on here? Is this the expected behavior? Am I misunderstanding the language used in the linked document?
The linked document also includes this table which uses slightly different descriptions if that helps:
Modifiers | Description | Example
----------+--------------------------------------+----------
u | monitor at priv level 3, 2, 1 (user) | event:u
k | monitor at priv level 0 (kernel) | event:k
Edit: more info:
CPU is an Intel Haswell. The specific model is an i7-5820K.
Distro is up to date Arch Linux (rolling release schedule) with kernel 4.1.6.
The version of perf itself is 4.2.0.
Edit2:
More output from example runs. As you can see, cycles:u mostly reports non-kernel symbols. I know that perf sometimes mis-attributes counts to a neighboring instruction when you look at the annotated assembly output. Maybe this is related?
cycles:u
# perf record -e cycles:u du -sh ~
179G /home/khouli
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.116 MB perf.data (2755 samples) ]
# sudo perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:u'
# Event count (approx.): 661835375
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................
#
11.02% du libc-2.22.so [.] _int_malloc
9.73% du libc-2.22.so [.] _int_free
9.24% du du [.] fts_read
9.23% du [kernel.kallsyms] [k] system_call
4.17% du libc-2.22.so [.] strlen
4.17% du libc-2.22.so [.] __memmove_sse2
3.47% du libc-2.22.so [.] __readdir64
3.33% du libc-2.22.so [.] malloc_consolidate
2.87% du libc-2.22.so [.] malloc
1.83% du libc-2.22.so [.] msort_with_tmp.part.0
1.63% du libc-2.22.so [.] __memcpy_avx_unaligned
1.63% du libc-2.22.so [.] __getdents64
1.52% du libc-2.22.so [.] free
1.47% du libc-2.22.so [.] __memmove_avx_unaligned
1.44% du du [.] 0x000000000000e609
1.41% du libc-2.22.so [.] _wordcopy_bwd_dest_aligned
1.19% du du [.] 0x000000000000e644
0.93% du libc-2.22.so [.] __fxstatat64
0.85% du libc-2.22.so [.] do_fcntl
0.73% du [kernel.kallsyms] [k] page_fault
[lots more symbols, almost all in du...]
cycles:uk
# perf record -e cycles:uk du -sh ~
179G /home/khouli
[ perf record: Woken up 1 times to write data ]
[ext4] with build id 0f47443e26a238299e8a5963737da23dd3530376 not found,
continuing without symbols
[ perf record: Captured and wrote 0.120 MB perf.data (2856 samples) ]
# perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:uk'
# Event count (approx.): 3118065867
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................................
#
13.80% du [kernel.kallsyms] [k] __d_lookup_rcu
6.16% du [kernel.kallsyms] [k] security_inode_getattr
2.52% du [kernel.kallsyms] [k] str2hashbuf_signed
2.43% du [kernel.kallsyms] [k] system_call
2.35% du [kernel.kallsyms] [k] half_md4_transform
2.31% du [kernel.kallsyms] [k] ext4_htree_store_dirent
1.97% du [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.96% du libc-2.22.so [.] _int_malloc
1.93% du du [.] fts_read
1.90% du [kernel.kallsyms] [k] system_call_after_swapgs
1.83% du libc-2.22.so [.] _int_free
1.44% du [kernel.kallsyms] [k] link_path_walk
1.33% du libc-2.22.so [.] __memmove_sse2
1.19% du [kernel.kallsyms] [k] _raw_spin_lock
1.19% du [kernel.kallsyms] [k] __fget_light
1.12% du [kernel.kallsyms] [k] kmem_cache_alloc
1.12% du [kernel.kallsyms] [k] __ext4_check_dir_entry
1.05% du [kernel.kallsyms] [k] lockref_get_not_dead
1.02% du [kernel.kallsyms] [k] generic_fillattr
0.95% du [kernel.kallsyms] [k] do_dentry_open
0.95% du [kernel.kallsyms] [k] path_init
0.95% du [kernel.kallsyms] [k] lockref_put_return
0.91% du libc-2.22.so [.] do_fcntl
0.91% du [kernel.kallsyms] [k] ext4_getattr
0.91% du [kernel.kallsyms] [k] rb_insert_color
0.88% du [kernel.kallsyms] [k] __kmalloc
0.88% du libc-2.22.so [.] __readdir64
0.88% du libc-2.22.so [.] malloc
0.84% du [kernel.kallsyms] [k] ext4fs_dirhash
0.84% du [kernel.kallsyms] [k] __slab_free
0.84% du [kernel.kallsyms] [k] in_group_p
0.81% du [kernel.kallsyms] [k] get_empty_filp
0.77% du libc-2.22.so [.] malloc_consolidate
[more...]
cycles:k
# perf record -e cycles:k du -sh ~
179G /home/khouli
[ perf record: Woken up 1 times to write data ]
[ext4] with build id 0f47443e26a238299e8a5963737da23dd3530376 not found, continuing
without symbols
[ perf record: Captured and wrote 0.118 MB perf.data (2816 samples) ]
# perf report --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'cycles:k'
# Event count (approx.): 2438426748
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................................
#
17.11% du [kernel.kallsyms] [k] __d_lookup_rcu
6.97% du [kernel.kallsyms] [k] security_inode_getattr
4.22% du [kernel.kallsyms] [k] half_md4_transform
3.10% du [kernel.kallsyms] [k] str2hashbuf_signed
3.01% du [kernel.kallsyms] [k] system_call_after_swapgs
2.59% du [kernel.kallsyms] [k] ext4_htree_store_dirent
2.24% du [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.14% du [kernel.kallsyms] [k] lockref_get_not_dead
1.86% du [kernel.kallsyms] [k] ext4_getattr
1.85% du [kernel.kallsyms] [k] kfree
1.68% du [kernel.kallsyms] [k] __ext4_check_dir_entry
1.53% du [kernel.kallsyms] [k] __fget_light
1.34% du [kernel.kallsyms] [k] link_path_walk
1.34% du [kernel.kallsyms] [k] path_init
1.22% du [kernel.kallsyms] [k] __kmalloc
1.22% du [kernel.kallsyms] [k] kmem_cache_alloc
1.14% du [kernel.kallsyms] [k] do_dentry_open
1.11% du [kernel.kallsyms] [k] ext4_readdir
1.07% du [kernel.kallsyms] [k] __find_get_block_slow
1.07% du libc-2.22.so [.] do_fcntl
1.04% du [kernel.kallsyms] [k] _raw_spin_lock
0.99% du [kernel.kallsyms] [k] _raw_read_lock
0.95% du libc-2.22.so [.] __fxstatat64
0.94% du [kernel.kallsyms] [k] rb_insert_color
0.94% du [kernel.kallsyms] [k] generic_fillattr
0.93% du [kernel.kallsyms] [k] ext4fs_dirhash
0.93% du [kernel.kallsyms] [k] find_get_entry
0.89% du [kernel.kallsyms] [k] rb_next
0.89% du [kernel.kallsyms] [k] is_dx_dir
0.89% du [kernel.kallsyms] [k] in_group_p
0.89% du [kernel.kallsyms] [k] cp_new_stat
[more...]
perf_event_paranoid
$ cat /proc/sys/kernel/perf_event_paranoid
1
kernel config for perf
$ cat /proc/config.gz | gunzip | grep -A70 'Kernel Perf'
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_EVENT_MULTIPLEX is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
CONFIG_CC_STACKPROTECTOR_STRONG=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y

I understand your question to be: Why does perf for user mode recording show values from inside the kernel? Well, it's doing exactly what it's supposed to do, from a "system accounting" standpoint.
You did: perf record -e cycles:u du -sh ~ and you got stats on system_call and page_fault and you're wondering why that happened?
When you did the du, it had to traverse the file system. In doing so, it issued system calls for things it needed (e.g. open, readdir, etc.). du initiated these things for you, so it got "charged back" for them. Likewise, du page faulted a number of times.
perf is keeping track of any activity caused by a given process/program, even if it happens inside kernel address space. In other words, the program requested the activity, and the kernel performed it at the program's behest, so it gets charged appropriately. The kernel had to do "real work" to do FS I/O and/or resolve page faults, so you must "pay for the work you commissioned". Anything that a given program does that consumes system resources gets accounted for.
This is the standard accounting model for computer systems, dating back to the 1960's when people actually rented out time on mainframe computers. You got charged for everything you did [just like a lawyer :-)], directly or indirectly.
* charge per minute of connect time
* charge per cpu cycle consumed in user program
* charge per cpu cycle executed for program in kernel space
* charge for each network packet sent/received
* charge for any page fault caused by your program
* charge for each disk block read/written, either to a file or to the paging/swap disk
At the end of the month, they mailed you an itemized bill [just like a utility bill], and you had to pay:
Real money.
Note that there are some things that will not be charged for. For example, let's assume your program is compute bound but does not do [much] I/O and uses a relatively small amount of memory (i.e. it does not cause a page fault on its own). The program will get charged for user space CPU usage.
The OS may have to swap out (i.e. steal) one or more of your pages to make room for some other memory hog program. After the hog runs, your program will run again. Your program will need to fault back in the page or pages that were stolen from it.
Your program will not be charged for these because your program did not cause the page fault. In other words, for every page "stolen" from you, you're given a "credit" for that page when your program has to fault it back in.
Also, when trying to run a different process, the kernel does not charge the CPU time consumed by its process scheduler to any process. This is considered "overhead" and/or standard operating costs. For example, if you have a checking account with a bank, they don't charge you for the upkeep costs on the local branch office that you visit.
So perf, while useful to measure performance, it is using an accounting model to get the data.
It's like a car. You can drive your car to the store and pick up something, and you will consume some gasoline. Or, you can ask a friend to drive your car to the store. In either case, you have to pay for the gasoline because you drove the car, or because [when the friend drove the car] the gasoline was consumed when doing something for you. In this case, the kernel is your friend :-)
UPDATE:
My source for this is the source [kernel source]. And, I've been doing kernel programming for 40 years.
There are two basic types of perf counters. "Macro" ones such as page fault, that the kernel can generate. Others are the syscall counter.
The other time is the "micro" or "nano" type. These come from x86's PMC arch, and have counters for things like "cache miss", "branch mispredict", "data fetch mispredict", etc. that the kernel can't compute.
The PMC counters just free run. That's why you get your global stats, regardless of what recording mode you're doing. The kernel can interrogate them periodically, but it can't get control every time a PMC is incremented. Want the global/system-wide and/or per-CPU values for these? Just execute the appropriate RDPMC instruction.
To keep track of PMC for a process, when you start a process, do RDPMC and save the value in the task struct [for as many that are marked "of interest"] as "PMC value at start". When the given CPU core is rescheduled, the scheduler computes the "next" task, the scheduler gets the current PMC value, takes the difference between it and one it stored in the "old" task block when it started that task, and bumps up that task's "total count" for that PMC. The "current value" becomes the new task's "PMC value at start"
In Linux, when a task/context switch occurs, it generates two perf events, one for "entering new task on cpu X" and "stopping old task on cpu X".
Your question was why monitoring for "user mode" produced kernel addresses. That's because when recording (and this is not the perf program), it stores the temp data [as mentioned above] in the current task/context block, until a task switch actually occurs.
The key thing to note is that this context does not change simply because a syscall was executed--only when a context switch occurs. For example, the gettimeofday syscall justs gets wall clock time and returns it to user space. It does not do a context switch, so any perf event that it kicks off will be charged to active/current context. It doesn't matter whether it comes from kernel space or user space.
As a further example, suppose the process does a file read syscall. In traversing the file handle data, inode, etc. it may generate several perf events. Also, it will probably generate a few more cache misses and other PMC counter bumps. If the desired block is already in the FS block cache, the syscall will just do a copy_to_user and then reenter user space. No expensive context switch with the above PMC difference calculations as the pmc_value_at_start is still valid.
One of the reasons that it's done this way is performance [of the perf mechanism]. If you did the PMC save/restore immediately upon crossing to kernel space after a syscall starts [to separate kernel stats from user stats for a given process, as you'd like], the overhead would be enormous. You wouldn't be performance measuring the base kernel. You'd be performance measuring the kernel + a lot of perf overhead.
When I had to do performance analysis of a commercial hard realtime system based on Linux, I developed my own performance logging system. The system had 8 CPU cores, interacting with multiple custom hardware boards on the PCIe bus with multiple FPGAs. The FPGAs also had custom firmware running inside a Microblaze. Event logs from user space, kernel space, and microblaze could all be time coordinated to nanosecond resolution and the time to store an event record was 70ns.
To me, Linux's perf mechanism is a bit crude and bloated. If one were to use it to try and troubleshoot a performance/timing bug that involved race conditions, possible lock/unlock problems, etc. it might be problematic. That is, running the system without perf and you get the bug. Turn on perf, and you don't because you've changed the fundamental characteristic timing of the system. Turn perf off, and timing bug reappears.

What's going on here? Is this the expected behavior? Am I
misunderstanding the language used in the linked document?
There seems to be a wide difference in terms of kernel & processor stated in the link and that is being used for evaluation.
The introduction section in link https://perf.wiki.kernel.org/index.php/Tutorial states that "Output was obtained on a Ubuntu 11.04 system with kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU)" whereas the current evaluation is over Intel Haswell(i7-5820K - 6 core) on Arch Linux distro with kernel 4.1.6.
One of the option to rule out difference in behavior and documentation is to test on a system with an equivalent configuration that is mentioned in the introduction section of the link https://perf.wiki.kernel.org/index.php/Tutorial.

How to make profilers (valgrind, perf, pprof) pick up / use local version of library with debugging symbols when using mpirun?

Edit: added important note that it is about debugging MPI application
System installed shared library doesn't have debugging symbols:
$ readelf -S /usr/lib64/libfftw3.so | grep debug
$
I have therefore compiled and instaled in my home directory my owne version, with debugging enabled (--with-debug CFLAGS=-g):
$ $ readelf -S ~/lib64/libfftw3.so | grep debug
[26] .debug_aranges PROGBITS 0000000000000000 001d3902
[27] .debug_pubnames PROGBITS 0000000000000000 001d8552
[28] .debug_info PROGBITS 0000000000000000 001ddebd
[29] .debug_abbrev PROGBITS 0000000000000000 003e221c
[30] .debug_line PROGBITS 0000000000000000 00414306
[31] .debug_str PROGBITS 0000000000000000 0044aa23
[32] .debug_loc PROGBITS 0000000000000000 004514de
[33] .debug_ranges PROGBITS 0000000000000000 0046bc82
I have set both LD_LIBRARY_PATH and LD_RUN_PATH to include ~/lib64 first, and ldd program confirms that local version of library should be used:
$ ldd a.out | grep fftw
libfftw3.so.3 => /home/narebski/lib64/libfftw3.so.3 (0x00007f2ed9a98000)
The program in question is parallel numerical application using MPI (Message Passing Interface). Therefore to run this application one must use mpirun wrapper (e.g. mpirun -np 1 valgrind --tool=callgrind ./a.out). I use OpenMPI implementation.
Nevertheless, various profilers: callgrind tool in Valgrind, CPU profiling google-perfutils and perf doesn't find those debugging symbols, resulting in more or less useless output:
calgrind:
$ callgrind_annotate --include=~/prog/src --inclusive=no --tree=none
[...]
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
32,765,904,336 ???:0x000000000014e500 [/usr/lib64/libfftw3.so.3.2.4]
31,342,886,912 /home/narebski/prog/src/nonlinearity.F90:__nonlinearity_MOD_calc_nonlinearity_kxky [/home/narebski/prog/bin/a.out]
30,288,261,120 /home/narebski/gene11/src/axpy.F90:__axpy_MOD_axpy_ij [/home/narebski/prog/bin/a.out]
23,429,390,736 ???:0x00000000000fc5e0 [/usr/lib64/libfftw3.so.3.2.4]
17,851,018,186 ???:0x00000000000fdb80 [/usr/lib64/libmpi.so.1.0.1]
google-perftools:
$ pprof --text a.out prog.prof
Total: 8401 samples
842 10.0% 10.0% 842 10.0% 00007f200522d5f0
619 7.4% 17.4% 5025 59.8% calc_nonlinearity_kxky
517 6.2% 23.5% 517 6.2% axpy_ij
427 5.1% 28.6% 3156 37.6% nl_to_direct_xy
307 3.7% 32.3% 1234 14.7% nl_to_fourier_xy_1d
perf events:
$ perf report --sort comm,dso,symbol
# Events: 80K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... .................... ............................................
#
32.42% a.out libfftw3.so.3.2.4 [.] fdc4c
16.25% a.out 7fddcd97bb22 [.] 7fddcd97bb22
7.51% a.out libatlas.so.0.0.0 [.] ATL_dcopy_xp1yp1aXbX
6.98% a.out a.out [.] __nonlinearity_MOD_calc_nonlinearity_kxky
5.82% a.out a.out [.] __axpy_MOD_axpy_ij
Edit Added 11-07-2011:
I don't know if it is important, but:
$ file /usr/lib64/libfftw3.so.3.2.4
/usr/lib64/libfftw3.so.3.2.4: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped
and
$ file ~/lib64/libfftw3.so.3.2.4
/home/narebski/lib64/libfftw3.so.3.2.4: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, not stripped

If /usr/lib64/libfftw3.so.3.2.4 is listed in callgrind output, then your LD_LIBRARY_PATH=~/lib64 had no effect.
Try again with export LD_LIBRARY_PATH=$HOME/lib64. Also watch out for any shell scripts you invoke, which might reset your environment.

You and Employed Russian are almost certainly right; the mpirun script is messing things up here. Two options:
Most x86 MPI implementations, as a practical matter, treat just running the executable
./a.out
the same as
mpirun -np 1 ./a.out.
They don't have to do this, but OpenMPI certainly does, as does MPICH2 and IntelMPI. So if you can do the debug serially, you should just be able to
valgrind --tool=callgrind ./a.out.
However, if you do want to run with mpirun, the issue is probably that your ~/.bashrc
(or whatever) is being sourced, undoing your changes to LD_LIBRARY_PATH etc. Easiest is just to temporarily put your changed environment variables in your ~/.bashrc for the duration of the run.

The way recent profiling tools typically handle this situation is to consult an external, matching non-stripped version of the library.
On debian-based Linux distros this is typically done by installing the -dbg suffixed version of a package; on Redhat-based they are named -debuginfo.
In the case of the tools you mentioned above; they will typically Just Work (tm) and find the debug symbols for a library if the debug info package has been installed in the standard location.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string