How to measure how long file related system calls are taking in perf? - linux

I have an application which is running unusually slow. I have reason to suspect that this may be because some specific files are slower to read from and stat.
I know I can use strace to get timing information, but I'd prefer to use perf because the overhead is so much lower.
Brendan Gregg has some helpful examples of using perf:
http://www.brendangregg.com/perf.html
But his first system call example only shows aggregate timings across all calls and I need per-file timings. His second system call example doesn't include timings, and only shows fd numbers, not real paths.
How do I piece this together? I'm looking to get something similar to what I would get from “strace -Tttt -e trace=file”.
Added bonus would be if I can measure cycles at the same time.

Output from strace -Tttt -e trace=file is trace (tracing tool), and perf is usually used as profiling tool (aggregating time spent in function by name - in perf record by default on time or on hardware events + perf report, perf top modes) or as statistics tool (in perf stat mode - with counting of some events over all running time of the program).
You should try some tracing tool (trace-cmd, sysdig, lttng, ...) or you can try to use perf in tracing modes (perf record with tracepoints and perf script to output non-aggregated log over time)
There are tracepoints for syscalls (syscalls 614):
http://www.brendangregg.com/perf.html#Tracepoints
http://www.brendangregg.com/perf.html#StaticKernelTracing
perf record -e 'syscalls:sys_*' ./program
perf script
Example of output with format: process_name, pid, [cpu_core], time_since_boot_seconds.microseconds:, tracepoint_name, tracepoint_arguments
ls 19178 [001] 16529.466566: syscalls:sys_enter_open: filename: 0x7f6ae4c38f82, flags: 0x00080000, mode: 0x
ls 19178 [001] 16529.466570: syscalls:sys_exit_open: 0x4
ls 19178 [001] 16529.466570: syscalls:sys_enter_newfstat: fd: 0x00000004, statbuf: 0x7ffe22df92f0
ls 19178 [001] 16529.466572: syscalls:sys_exit_newfstat: 0x0
ls 19178 [001] 16529.466573: syscalls:sys_enter_mmap: addr: 0x00000000, len: 0x00042e6f, prot: 0x00000001,
ls 19178 [001] 16529.466767: syscalls:sys_exit_mmap: 0x7f6ae4df8000
ls 19178 [001] 16529.466768: syscalls:sys_enter_close: fd: 0x00000004
ls 19178 [001] 16529.466769: syscalls:sys_exit_close: 0x0
You may also try perf record -g -e 'syscalls:sys_*' to record function backtrace (to get information which function of the application did the syscall, and who did call this function; works better with debug information).
perf as tracing tool may not decode syscall (tracepoint) arguments; and tracing tools like trace-cmd or lttng may decode them better (at least decoding file name in open syscall).

Related

Measure LLC/L3 Cache Miss Rate on AMD Zen2 CPU

I have question related to this one.
I want to (programatically) measure L3 Hits (Accesses) and Misses on an AMD EPYC 7742 CPU (Zen2). I run Linux Kernel 5.4.0-66-generic on Ubuntu Server 20.04.2 LTS. According to the question linked above, the events rFF04 (L3LookupState) and r0106 (L3CombClstrState) should represent the L3 accesses and misses, respectively. Furthermore, Kernel 5.4 should support these events.
However, when measuring it with perf, I run into issues. Similar to the question linked above, if I run numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, I only measure 0 values. If I try to use numactl -C 0 -m 0 perf stat -e instructions,cycles,amd_l3/r8001/,amd_l3/r0106/, perf complains about "unknown terms". If I use the perf event names, i.e. numactl -C 0 -m 0 perf stat -e instructions,cycles,l3_request_g1.caching_l3_cache_accesses, l3_comb_clstr_state.request_miss perf outputs <not supported> for these events.
Furthermore, I actually want to measure this using perf's C API. Currently, I dispatch a perf_event_attr with type PERF_TYPE_RAW and config set to, e.g., 0x8001. How do I get the amd_l3 PMU stuff into my perf_event_attr object? Otherwise, it would be equivalent to numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, which is measuring undefined values.
Thank you so much for your help.

Why does perf fail to collect any samples?

sudo perf top shows "Events: 0 cycles".
sudo perf record -ag sleep 10 shows
[ perf record: Woken up 1 time to write data ]
[ perf record: Captured and wrote 0.154 MB perf.data (~6725 samples) ]
However, sudo perf report shows "The perf.data file has no samples!". Also I checked the perf.data recorded and confirmed there is no any samples in it.
The system is "3.2.0-86-virtual #123-Ubuntu SMP Sun Jun 14 18:25:12 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux".
perf version 3.2.69
Inputs are appreciated.
There may be no real samples on idle virtualized system (your linux kernel version has "-virtual" suffix); or there may be no access to hardware counters (-e cycles), which are used by default.
Try to profile some real application like
echo '2^2345678%2'| sudo perf record /usr/bin/bc
Also check software counters like -e cpu-clock:
echo '2^2345678%2'| sudo perf record -e cpu-clock /usr/bin/bc
You may try perf stat (perf stat -d) with same example to find which basic counters are really incremented in your system:
echo '2^2345678%2'| sudo perf stat /usr/bin/bc
About "(~6725 samples)" output - perf record doesn't not count samples in its output, it just estimates their count but this estimation is always wrong. There is some fixed part of any perf.data file without any sample, it may use tens of kb in system-wide mode; and estimation incorrectly counts this part as containing some events of mean length.

Identify other end of a unix domain socket connection

I'm trying to figure out what process is holding the other end of a unix domain socket. In some strace output I've identified a given file descriptor which is involved in the problem I'm currently debugging, and I'd like to know which process is on the other end of that. As there are multiple connections to that socket, simply going by path name won't work.
lsof provides me with the following information:
dbus-daem 4175 mvg 10u unix 0xffff8803e256d9c0 0t0 12828 #/tmp/dbus-LyGToFzlcG
So I know some address (“kernel address”?), I know some socket number, and I know the path. I can find that same information in other places:
$ netstat -n | grep 12828
unix 3 [ ] STREAM CONNECTED 12828 #/tmp/dbus-LyGToFzlcG
$ grep -E '12828|ffff8803e256d9c0' /proc/net/unix
ffff8803e256d9c0: 00000003 00000000 00000000 0001 03 12828 #/tmp/dbus-LyGToFzlcG
$ ls -l /proc/*/fd/* 2>/dev/null | grep 12828
lrwx------ 1 mvg users 64 10. Aug 09:08 /proc/4175/fd/10 -> socket:[12828]
However, none of this tells me what the other end of my socket connection is. How can I tell which process is holding the other end?
Similar questions have been asked on Server Fault and Unix & Linux. The accepted answer is that this information is not reliably available to the user space on Linux.
A common suggestion is to look at adjacent socket numbers, but ls -l /proc/*/fd/* 2>/dev/null | grep 1282[79] gave no results here. Perhaps adjacent lines in the output from netstat can be used. It seems like there was a pattern of connections with and without an associated socket name. But I'd like some kind of certainty, not just guesswork.
One answer suggests a tool which appears to be able to address this by digging through kernel structures. Using that option requires debug information for the kernel, as generated by the CONFIG_DEBUG_INFO option and provided as a separate package by some distributions. Based on that answer, using the address provided by lsof, the following solution worked for me:
# gdb /usr/src/linux/vmlinux /proc/kcore
(gdb) p ((struct unix_sock*)0xffff8803e256d9c0)->peer
This will print the address of the other end of the connection. Grepping lsof -U for that number will provide details like the process id and the file descriptor number.
If debug information is not available, it might be possible to access the required information by knowing the offset of the peer member into the unix_sock structure. In my case, on Linux 3.5.0 for x86_64, the following code can be used to compute the same address without relying on debugging symbols:
(gdb) p ((void**)0xffff8803e256d9c0)[0x52]
I won't make any guarantees about how portable that solution is.
Update: It's been possible to to do this using actual interfaces for a while now. Starting with Linux 3.3, the UNIX_DIAG feature provides a netlink-based API for this information, and lsof 4.89 and later support it. See https://unix.stackexchange.com/a/190606/1820 for more information.

How to monitor number of syscalls executed by kernel?

I need to monitor amount of system calls executed by Linux.
I'm aware that vmstat has ability to show this for BSD and AIX systems, but for Linux it can't (according to man page).
Is there any counter in /proc? Or is there any other way to monitor it?
I wrote a simple SystemTap script(based on syscalls_by_pid.stp).
It produces output like this:
ProcessName #SysCalls
munin-graph 38609
munin-cron 8160
fping 4502
check_http_demo 2584
check_nrpe 2045
sh 1836
nagios 886
sendmail 747
smokeping 649
check_http 571
check_nt 376
pcscd 216
ping 108
check_ping 100
crond 87
stapio 69
init 56
syslog-ng 27
sshd 17
ntpd 9
hp-asrd 8
hald-addon-stor 7
automount 6
httpd 4
stap 3
flow-capture 2
gam_server 2
Total 61686
The script itself:
#! /usr/bin/env stap
#
# Print the system call count by process name in descending order.
#
global syscalls
probe begin {
print ("Collecting data... Type Ctrl-C to exit and display results\n")
}
probe syscall.* {
syscalls[execname()]++
}
probe end {
printf ("%-20s %-s\n\n", "ProcessName", "#SysCalls")
summary = 0
foreach (procname in syscalls-) {
printf("%-20s %-10d\n", procname, syscalls[procname])
summary = summary + syscalls[procname]
}
printf ("\n%-20s %-d\n", "Total", summary)
}
You can use pstrace as said Jeff Foster to trace the system call.
Also, you can use strace and ltrace
strace - trace system calls and signals
ltrace - A library call tracer
You can use ptrace to monitor all syscalls (see here)
I believe OProfile can do this.
I am not aware of a centralized way to monitor syscalls throughout the entire OS. Maybe do a ptrace on the init process and follow all children? But I don't know if that will work.
Your best bet is to write a patch to the kernel itself to do this. The closest thing to this that I've seen is a cgroup implementation for enforcing permissions on what syscalls can be executed at runtime. You can find the patch here:
https://github.com/luksow/syscalls-cgroup
It shouldn't be too much more work to throw a counter in there, from a kernel programming perspective.

What is the significance of the numbers in the name of the flush processes for newer linux kernels?

I am running kernel 2.6.33.7.
Previously, I was running v2.6.18.x. On 2.6.18, the flush processes were named pdflush.
After upgrading to 2.6.33.7, the flush processes have a format of "flush-:".
For example, currently I see flush process "flush-8:32" popping up in top.
In doing a google search to try to determine an answer to this question, I saw examples of "flush-8:38", "flush-8:64" and "flush-253:0" just to name a few.
I understand what the flush process itself does, my question is what is the significance of the numbers on the end of the process name? What do they represent?
Thanks
Device numbers used to identify block devices. A kernel thread may be spawned to handle a particular device.
(On one of my systems, block devices are currently numbered as shown below. They may change from boot to boot or hotplug to hotplug.)
$ grep ^ /sys/class/block/*/dev
/sys/class/block/dm-0/dev:254:0
/sys/class/block/dm-1/dev:254:1
/sys/class/block/dm-2/dev:254:2
/sys/class/block/dm-3/dev:254:3
/sys/class/block/dm-4/dev:254:4
/sys/class/block/dm-5/dev:254:5
/sys/class/block/dm-6/dev:254:6
/sys/class/block/dm-7/dev:254:7
/sys/class/block/dm-8/dev:254:8
/sys/class/block/dm-9/dev:254:9
/sys/class/block/loop0/dev:7:0
/sys/class/block/loop1/dev:7:1
/sys/class/block/loop2/dev:7:2
/sys/class/block/loop3/dev:7:3
/sys/class/block/loop4/dev:7:4
/sys/class/block/loop5/dev:7:5
/sys/class/block/loop6/dev:7:6
/sys/class/block/loop7/dev:7:7
/sys/class/block/md0/dev:9:0
/sys/class/block/md1/dev:9:1
/sys/class/block/sda/dev:8:0
/sys/class/block/sda1/dev:8:1
/sys/class/block/sda2/dev:8:2
/sys/class/block/sdb/dev:8:16
/sys/class/block/sdb1/dev:8:17
/sys/class/block/sdb2/dev:8:18
/sys/class/block/sdc/dev:8:32
/sys/class/block/sdc1/dev:8:33
/sys/class/block/sdc2/dev:8:34
/sys/class/block/sdd/dev:8:48
/sys/class/block/sdd1/dev:8:49
/sys/class/block/sdd2/dev:8:50
/sys/class/block/sde/dev:8:64
/sys/class/block/sdf/dev:8:80
/sys/class/block/sdg/dev:8:96
/sys/class/block/sdh/dev:8:112
/sys/class/block/sdi/dev:8:128
/sys/class/block/sr0/dev:11:0
/sys/class/block/sr1/dev:11:1
/sys/class/block/sr2/dev:11:2
You should also be able to figure this out by searching for those numbers in /proc/self/mountinfo, eg:
$ grep 8:32 /proc/self/mountinfo
25 22 8:32 / /var rw,relatime - ext4 /dev/mapper/sysvg-var rw,barrier=1,data=ordered
This has the side benefit of working with nfs as well:
$ grep 0:73 /proc/self/mountinfo
108 42 0:73 /foo /mnt/foo rw,relatime - nfs host.domain.com:/volume/path rw, ...
Note, the data I included here is fabricated, but the mechanism works just fine.

Resources