I am trying to debug some issue on my Linux server:
Every 2 minutes I have some igmp query that I want to find source of.
From the Wireshark capture I see the exec time of the query:
**1488124556.773784** IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
this time I can convert to real time without any problem:
**[root#server ~]# date -d #1488124556.773784: ->
Sun Feb 26 10:55:56 EST 2017**
But When I try to trace the linux proccess with perf command:
perf trace -a -T -o trace.out
In the trace.out:
**15471296674.961** ( 0.023 ms): qemu-kvm/6621 ioctl(fd: 12<anon_inode:kvm-vcpu>, cmd: 0xae80 ) = 0
**15471296674.979** ( 0.011 ms): qemu-kvm/6621 ioctl(fd: 12<anon_inode:kvm-vcpu>, cmd: 0xae80
) = 0
**15471296674.986** ( 0.004 ms): qemu-kvm/6621 ioctl(fd: 6<anon_inode:kvm-vm>, cmd: 0xc008ae67, arg: 0x7fbba66bf9d0
15471296674.990 ( 0.002 ms): qemu-kvm/6621 ioctl(fd: 6<anon_inode:kvm-vm>, cmd: 0xc008ae67, arg: 0x7fbba66bf9f0 ) = 0
**15471296675.002** ( 0.010 ms): qemu-kvm/6621 ioctl(fd: 12<anon_inode:kvm-vcpu>, cmd: 0xae80
**15471296675.009** ( 0.003 ms): qemu-kvm/6621 ioctl(fd: 6<anon_inode:kvm-vm>, cmd: 0xc008ae67, arg: 0x7fbba66bf9f0
**15471296675.021** ( 0.010 ms): qemu-kvm/6621 ioctl(fd: 12<anon_inode:kvm-vcpu>, cmd: 0xae80 ) = 0
I found the time is not in epoch format:
From the manual of perf page I found
-T --time Print full timestamp rather time relative to first sample.
My question:
Is it possible to find the same timestamp format or type in Wireshark and perf trace tool ?
Thanks.
Short answer
Unfortuately, as of kernel 4.4, it is not possible to precisely convert perf-trace timestamps into unix time. You can roughly match them against time seen in /proc/uptime, but they drift apart since the boot with the difference measured in seconds or even minutes.
Long answer
perf relies on sched_clock() which is tied to the number of nanoseconds since boot. For some reason perf puts a decimal point in a wrong place, so you see a number of seconds multiplied by 1000.
The problem is, numbers reported by sched_clock() are not actually accessible from the userspace. They're different from both CLOCK_MONOTONIC (can be seen in /proc/uptime) and CLOCK_MONOTONIC_RAW (can get it e.g. via clock_gettime syscall).
At least two patches were proposed to add kernel interfaces exposing sched_clock() into userspace, unfortuately both were rejected. I'm not aware of any successful attempts since then.
Rejected kernel patches:
https://patchwork.kernel.org/patch/2273441/
https://patchwork.kernel.org/patch/3320271/
Related
I have question related to this one.
I want to (programatically) measure L3 Hits (Accesses) and Misses on an AMD EPYC 7742 CPU (Zen2). I run Linux Kernel 5.4.0-66-generic on Ubuntu Server 20.04.2 LTS. According to the question linked above, the events rFF04 (L3LookupState) and r0106 (L3CombClstrState) should represent the L3 accesses and misses, respectively. Furthermore, Kernel 5.4 should support these events.
However, when measuring it with perf, I run into issues. Similar to the question linked above, if I run numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, I only measure 0 values. If I try to use numactl -C 0 -m 0 perf stat -e instructions,cycles,amd_l3/r8001/,amd_l3/r0106/, perf complains about "unknown terms". If I use the perf event names, i.e. numactl -C 0 -m 0 perf stat -e instructions,cycles,l3_request_g1.caching_l3_cache_accesses, l3_comb_clstr_state.request_miss perf outputs <not supported> for these events.
Furthermore, I actually want to measure this using perf's C API. Currently, I dispatch a perf_event_attr with type PERF_TYPE_RAW and config set to, e.g., 0x8001. How do I get the amd_l3 PMU stuff into my perf_event_attr object? Otherwise, it would be equivalent to numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, which is measuring undefined values.
Thank you so much for your help.
Experiment:
I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:
1.813 ms on the host vs
3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?
Update:
With cpupower frequency-set --governor performance the timings are:
Host: 0.922ms
Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
Host: 0.922ms
Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0
It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).
That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.
I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)
Running qemu on my mac, I found 'sleep 1' at the bash command line usually taking 10 seconds while 'sleep 2' usually taking 5 seconds. At least as measured by time on a 6.0.8 archlinux. Oddly time seemed to be measuring the passage of time correctly while sleep was not working.
But I had been running
qemu-system-x86_64 \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
Then reading about the -icount parameter, I found the following makes the sleep pretty accurate.
qemu-system-x86_64 \
-icount shift=auto,sleep=on \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
I mention it here because my search for qemu and slow sleep 1 led me here first.
I am trying to profile my userspace program on aria10 fpga board (with 2 ARM Cortex A9 CPUs) which has PMU support. I am running windriver linux version 9.x. I built my kernel with almost all of the CONFIG_ options people suggested over the internet. Also, my pgm is compiled with –fno-omit-frame-pointer and –g options.
What I see is that ‘perf record’ doesn’t generate any samples at all. ‘perf stat true’ output looks to be valid though (not sure what to make out of it). Does anyone have suggestion/ideas why I am not seeing any sample being generated?
~: perf record --call-graph dwarf -- my_app
^C
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]
~: perf report -g graph --no-children
Error:
The perf.data file has no samples!
To display the perf.data header info, please use --header/--header-only options.
~: perf stat true
Performance counter stats for 'true':
1.095300 task-clock (msec) # 0.526 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
22 page-faults # 0.020 M/sec
1088056 cycles # 0.993 GHz
312708 instructions # 0.29 insn per cycle
29159 branches # 26.622 M/sec
16386 branch-misses # 56.20% of all branches
0.002082030 seconds time elapsed
I don't use a VM in this setup. Arria10 is intel FPGA with 2 ARM CPUs that supports PMU.
Edit:
1. I realize now that ARM CPU has HW PMU support (opposite to what I mentioned earlier). Even with HW PMU support, I am not able to do 'perf record' successfully.
This is an old question, but for people who find this via search:
perf record -e cpu-clock <command>
works for me. The problem seems to be that th default event (cycles) is not available
sudo perf top shows "Events: 0 cycles".
sudo perf record -ag sleep 10 shows
[ perf record: Woken up 1 time to write data ]
[ perf record: Captured and wrote 0.154 MB perf.data (~6725 samples) ]
However, sudo perf report shows "The perf.data file has no samples!". Also I checked the perf.data recorded and confirmed there is no any samples in it.
The system is "3.2.0-86-virtual #123-Ubuntu SMP Sun Jun 14 18:25:12 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux".
perf version 3.2.69
Inputs are appreciated.
There may be no real samples on idle virtualized system (your linux kernel version has "-virtual" suffix); or there may be no access to hardware counters (-e cycles), which are used by default.
Try to profile some real application like
echo '2^2345678%2'| sudo perf record /usr/bin/bc
Also check software counters like -e cpu-clock:
echo '2^2345678%2'| sudo perf record -e cpu-clock /usr/bin/bc
You may try perf stat (perf stat -d) with same example to find which basic counters are really incremented in your system:
echo '2^2345678%2'| sudo perf stat /usr/bin/bc
About "(~6725 samples)" output - perf record doesn't not count samples in its output, it just estimates their count but this estimation is always wrong. There is some fixed part of any perf.data file without any sample, it may use tens of kb in system-wide mode; and estimation incorrectly counts this part as containing some events of mean length.
I'm profiling a program on Linux, using the "time" command. The problem is it's output is not very statistically relevant as it does only run the program once. Is there a tool or a way to get an average of several "time" runs? Possibly aswel together with statistical information such as deviation?
Here is a script I wrote to do something similar to what you are looking for. It runs the provided command 10 times, logging the real, user CPU and system CPU times to a file, and echoing tham after each command output. It then uses awk to provide averages of each of the 3 columns in the file, but does not (yet) include standard deviation.
#!/bin/bash
rm -f /tmp/mtime.$$
for x in {1..10}
do
/usr/bin/time -f "real %e user %U sys %S" -a -o /tmp/mtime.$$ $#
tail -1 /tmp/mtime.$$
done
awk '{ et += $2; ut += $4; st += $6; count++ } END { printf "Average:\nreal %.3f user %.3f sys %.3f\n", et/count, ut/count, st/count }' /tmp/mtime.$$
Use hyperfine.
For example:
hyperfine 'sleep 0.3'
Will run the command sleep 0.3 multiple times, then output something like this:
hyperfine 'sleep 0.3'
Benchmark #1: sleep 0.3
Time (mean ± σ): 306.7 ms ± 3.0 ms [User: 2.8 ms, System: 3.5 ms]
Range (min … max): 301.0 ms … 310.9 ms 10 runs
perf stat does this for you with the -r (-repeat=<n>) option, with average and variance.
e.g. using a short loop in awk to simulate some work, short enough that CPU frequency ramp-up and other startup overhead might be a factor (Idiomatic way of performance evaluation?), although it seems my CPU ramped up to 3.9GHz pretty quickly, averaging 3.82 GHz.
$ perf stat -r5 awk 'BEGIN{for(i=0;i<1000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<1000000;i++){}}' (5 runs):
37.90 msec task-clock # 0.968 CPUs utilized ( +- 2.18% )
1 context-switches # 31.662 /sec ( +-100.00% )
0 cpu-migrations # 0.000 /sec
181 page-faults # 4.776 K/sec ( +- 0.39% )
144,802,875 cycles # 3.821 GHz ( +- 0.23% )
343,697,186 instructions # 2.37 insn per cycle ( +- 0.05% )
93,854,279 branches # 2.476 G/sec ( +- 0.04% )
29,245 branch-misses # 0.03% of all branches ( +- 12.79% )
0.03917 +- 0.00182 seconds time elapsed ( +- 4.63% )
(Scroll to the right for variance.)
You can use taskset -c3 perf stat ... to pin the task to a specific core (#3 in that case) if you have a single-threaded task and want to minimize context-switches.
By default, perf stat uses hardware perf counters to profile things like instructions, core clock cycles (not the same thing as time on modern CPUs), and branch misses. This has pretty low overhead, especially with the counters in "counting" mode instead of perf record causing interrupts to statistically sample hot spots for events.
You could use -e task-clock to just use that event without using HW perf counters. (Or if your system is in a VM, or you didn't change the default /proc/sys/kernel/perf_event_paranoid, perf might not be able to ask the kernel to program any anyway.)
For more about perf, see
https://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Main_Page
For programs that print output, it looks like this:
$ perf stat -r5 echo hello
hello
hello
hello
hello
hello
Performance counter stats for 'echo hello' (5 runs):
0.27 msec task-clock # 0.302 CPUs utilized ( +- 4.51% )
...
0.000890 +- 0.000411 seconds time elapsed ( +- 46.21% )
For a single run, (the default with no -r), perf stat will show time elapsed, and user / sys. But -r doesn't average those, for some reason.
Like the commenter above mentioned, it sounds like you may want to use a loop to run your program multiple times, to get more data points. You can use the time command with the -o option to output the results of the time command to a text file, like so:
time -o output.txt myprog