Experiment:
I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:
1.813 ms on the host vs
3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?
Update:
With cpupower frequency-set --governor performance the timings are:
Host: 0.922ms
Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
Host: 0.922ms
Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0
It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).
That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.
I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)
Running qemu on my mac, I found 'sleep 1' at the bash command line usually taking 10 seconds while 'sleep 2' usually taking 5 seconds. At least as measured by time on a 6.0.8 archlinux. Oddly time seemed to be measuring the passage of time correctly while sleep was not working.
But I had been running
qemu-system-x86_64 \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
Then reading about the -icount parameter, I found the following makes the sleep pretty accurate.
qemu-system-x86_64 \
-icount shift=auto,sleep=on \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
I mention it here because my search for qemu and slow sleep 1 led me here first.
I am trying to profile my userspace program on aria10 fpga board (with 2 ARM Cortex A9 CPUs) which has PMU support. I am running windriver linux version 9.x. I built my kernel with almost all of the CONFIG_ options people suggested over the internet. Also, my pgm is compiled with –fno-omit-frame-pointer and –g options.
What I see is that ‘perf record’ doesn’t generate any samples at all. ‘perf stat true’ output looks to be valid though (not sure what to make out of it). Does anyone have suggestion/ideas why I am not seeing any sample being generated?
~: perf record --call-graph dwarf -- my_app
^C
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data ]
~: perf report -g graph --no-children
Error:
The perf.data file has no samples!
To display the perf.data header info, please use --header/--header-only options.
~: perf stat true
Performance counter stats for 'true':
1.095300 task-clock (msec) # 0.526 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
22 page-faults # 0.020 M/sec
1088056 cycles # 0.993 GHz
312708 instructions # 0.29 insn per cycle
29159 branches # 26.622 M/sec
16386 branch-misses # 56.20% of all branches
0.002082030 seconds time elapsed
I don't use a VM in this setup. Arria10 is intel FPGA with 2 ARM CPUs that supports PMU.
Edit:
1. I realize now that ARM CPU has HW PMU support (opposite to what I mentioned earlier). Even with HW PMU support, I am not able to do 'perf record' successfully.
This is an old question, but for people who find this via search:
perf record -e cpu-clock <command>
works for me. The problem seems to be that th default event (cycles) is not available
I want to continuously increase /prco/sys/random/entropy_avail when it reduced.
I first check the rngd (https://wiki.archlinux.org/index.php/Rng-tools)
It says /dev/random is very slow since it only collects entropy from device drivers and other (slow) sources and I think that is why we use rngd.
And it says rngd mainly uses hardware random number generators (TRNG), present in modern hardware like recent AMD/Intel processors, VIA Nano or even Raspberry Pi.
However, when I start rngd it says
[root#localhost init.d]# rngd
can't open entropy source(tpm or intel/amd rng)
Maybe RNG device modules are not loaded
But I don't have Intel RDRAND confirmed by cat /proc/cpuinfo | grep rdrand:
[root#localhost init.d]# cat /proc/cpuinfo | grep rdrand | wc -l
0
If there is any possible resources that I can use?
Alternatively, is it possible making script to increase /proc/sys/random/entropy_avail?
Try this:
sudo apt-get install haveged
sudo perf top shows "Events: 0 cycles".
sudo perf record -ag sleep 10 shows
[ perf record: Woken up 1 time to write data ]
[ perf record: Captured and wrote 0.154 MB perf.data (~6725 samples) ]
However, sudo perf report shows "The perf.data file has no samples!". Also I checked the perf.data recorded and confirmed there is no any samples in it.
The system is "3.2.0-86-virtual #123-Ubuntu SMP Sun Jun 14 18:25:12 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux".
perf version 3.2.69
Inputs are appreciated.
There may be no real samples on idle virtualized system (your linux kernel version has "-virtual" suffix); or there may be no access to hardware counters (-e cycles), which are used by default.
Try to profile some real application like
echo '2^2345678%2'| sudo perf record /usr/bin/bc
Also check software counters like -e cpu-clock:
echo '2^2345678%2'| sudo perf record -e cpu-clock /usr/bin/bc
You may try perf stat (perf stat -d) with same example to find which basic counters are really incremented in your system:
echo '2^2345678%2'| sudo perf stat /usr/bin/bc
About "(~6725 samples)" output - perf record doesn't not count samples in its output, it just estimates their count but this estimation is always wrong. There is some fixed part of any perf.data file without any sample, it may use tens of kb in system-wide mode; and estimation incorrectly counts this part as containing some events of mean length.
I am trying to get a list of available frequencies for my cpu using the cpupower tool.
I am executing following command-
cpupower -c 0,1,4,5 frequency-info
This gives me much information but I need to see a list of available frequencies to which I can set these CPUs to.
On older versions of Fedora, I used to do this
$ cat /system/cpu/cpu3/cpufreq/scaling_available_frequencies
2201000 2200000 2100000 2000000 1800000 1700000 1600000 1500000 1300000 1200000 1100000
but on Fedora 20, cpufreq is obsolete. I googled and found that cpupower has same functionality like cpufreq.
How do you use it to get a list of available frequencies?
cpupower frequency-info
should give you your info