What causes overhead in QEMU in case of trivial `sleep 1`? - linux

Experiment:
I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:
1.813 ms on the host vs
3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?
Update:
With cpupower frequency-set --governor performance the timings are:
Host: 0.922ms
Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
Host: 0.922ms
Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0

It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).

That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.
I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)

Running qemu on my mac, I found 'sleep 1' at the bash command line usually taking 10 seconds while 'sleep 2' usually taking 5 seconds. At least as measured by time on a 6.0.8 archlinux. Oddly time seemed to be measuring the passage of time correctly while sleep was not working.
But I had been running
qemu-system-x86_64 \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
Then reading about the -icount parameter, I found the following makes the sleep pretty accurate.
qemu-system-x86_64 \
-icount shift=auto,sleep=on \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
I mention it here because my search for qemu and slow sleep 1 led me here first.

Related

Docker desktop on linux is using a lot of ram

I'm running Docker desktop on Ubuntu 22.04. Every time I start it, it eats a lot of RAM.
PID USER %MEM COMMAND
135264 user 26.0 qemu-system-x86_64 -accel kvm -cpu host -machine q35 -m 3849 -smp 8 -kernel /opt/docker-desktop/linuxkit/kernel -append page_poison=1 vsyscall=emulate panic=1 nospec_store_bypass_disable noibrs noibpb no_stf_barrier mitigations=off linuxkit.unified_cgroup_hierarchy=1 vpnkit.connect=tcp+bootstrap+client://gateway.docker.internal:35817/95d4e7d4090b2d25b84ed2f2bd2e54523bafd0dfc2e2388838f04b9d045e0fe2 vpnkit.disable=osxfs-data console=ttyS0 -initrd /opt/docker-desktop/linuxkit/initrd.img -serial pipe:/tmp/qemu-console1696356651/fifo -drive if=none,file=/home/lev/.docker/desktop/vms/0/data/Docker.raw,format=raw,id=hd0 -device virtio-blk-pci,drive=hd0,serial=dummyserial -netdev user,id=net0,ipv6=off,net=192.168.65.0/24,dhcpstart=192.168.65.9 -device virtio-net-pci,netdev=net0 -vga none -nographic -monitor none -object memory-backend-memfd,id=mem,size=3849M,share=on -numa node,memdev=mem -chardev socket,id=char0,path=virtiofs.sock0 -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=virtiofs0
10422 user 2.3 /snap/firefox/1883/usr/lib/firefox/firefox
...
While docker ps shows that there are no containers running.
I've noticed that there is a mention of 3849M of memory in the command but I can't be entirely sure if it's related, plus it eats way more than 4 gigs.
Well, Docker uses all allocated memory at start, please see
https://github.com/docker/for-mac/issues/4229
You can set memory Limit on:
Dodcker Dashboard >> Settings >> Resources >> Apply and Restart
Otherwise, if you want to check how Resources are splitted between running container,
run docker stats to see memory usage of current running containers
See https://docs.docker.com/engine/reference/commandline/stats/
For Example:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
db6115785a9e 001_jan_twit 0.00% 35.71MiB / 7.774GiB 0.45% 38.6MB / 659kB 16.4kB / 222MB 2

Measure LLC/L3 Cache Miss Rate on AMD Zen2 CPU

I have question related to this one.
I want to (programatically) measure L3 Hits (Accesses) and Misses on an AMD EPYC 7742 CPU (Zen2). I run Linux Kernel 5.4.0-66-generic on Ubuntu Server 20.04.2 LTS. According to the question linked above, the events rFF04 (L3LookupState) and r0106 (L3CombClstrState) should represent the L3 accesses and misses, respectively. Furthermore, Kernel 5.4 should support these events.
However, when measuring it with perf, I run into issues. Similar to the question linked above, if I run numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, I only measure 0 values. If I try to use numactl -C 0 -m 0 perf stat -e instructions,cycles,amd_l3/r8001/,amd_l3/r0106/, perf complains about "unknown terms". If I use the perf event names, i.e. numactl -C 0 -m 0 perf stat -e instructions,cycles,l3_request_g1.caching_l3_cache_accesses, l3_comb_clstr_state.request_miss perf outputs <not supported> for these events.
Furthermore, I actually want to measure this using perf's C API. Currently, I dispatch a perf_event_attr with type PERF_TYPE_RAW and config set to, e.g., 0x8001. How do I get the amd_l3 PMU stuff into my perf_event_attr object? Otherwise, it would be equivalent to numactl -C 0 -m 0 perf stat -e instructions,cycles,r0106,rFF04 ./benchmark, which is measuring undefined values.
Thank you so much for your help.

What does `--oom-kill-disable` do for a Docker container?

I have understood that docker run -m 256m --memory-swap 256m will limit a container so that it can use at most 256 MB of memory and no swap. If it allocates more, then a process in the container (not "the container") will be killed. For example:
$ sudo docker run -it --rm -m 256m --memory-swap 256m \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 7 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 1s
Apparently one of the workers allocates more memory than is allowed and receives a SIGKILL. Note that the parent process stays alive.
Now if the effect of -m is to invoke the OOM killer if a process allocates too much memory, then what happens when specifying -m and --oom-kill-disable? Trying it like above has the following result:
$ sudo docker run -it --rm -m 256m --memory-swap 256m --oom-kill-disable \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
(waits here)
In a different shell:
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f5e4c30d75c9 0.00% 256 MiB / 256 MiB 100.00% 0 B / 508 B 0 B / 0 B 2
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19391 root 20 0 2055904 262352 340 D 0.0 0.1 0:00.05 stress
I see the docker stats shows a memory consumption of 256 MB, and top shows a RES of 256 MB and a VIRT of 2000 MB. But, what does that actually mean? What will happen to a process inside the container that tries to use more memory than allowed? In which sense it is constrained by -m?
As i understand the docs --oom-kill-disable is not constrained by -m but actually requires it:
By default, kernel kills processes in a container if an out-of-memory
(OOM) error occurs. To change this behaviour, use the
--oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not
set, this can result in the host running out of memory and require
killing the host’s system processes to free memory.
A developer stated back in 2015 that
The host can run out of memory with or without the -m flag set. But
it's also irrelevant as --oom-kill-disable does nothing unless -m is
passed.
In regard to your update, what happens when OOM-killer is disabled and yet the memory limit is hit (intresting OOM article), id say that new calls to malloc and such will just fail as described here but it also depends on the swap configuration and the hosts available memory. If your -m limit is above the actual available memory, the host will start killing processes, one of which might be the docker daemon (which they try to avoid by changing its OOM priority).
The kernel docs (cgroup/memory.txt) say
If OOM-killer is disabled, tasks under cgroup will hang/sleep in
memory cgroup's OOM-waitqueue when they request accountable memory
For the actual implementation (which docker utilizes as well) of cgroups, youd have to check the sourcecode.
The job of the 'oom killer' in Linux is to sacrifice one or more processes in order to free up memory for the system when all else fails. OOM killer is only enabled if the host has memory overcommit enabled
The setting of --oom-kill-disable will set the cgroup parameter to disable the oom killer for this specific container when a condition specified by -m is met. Without the -m flag, oom killer will be irrelevant.
The -m flag doesn’t mean stop the process when it uses more than xmb of ram, it’s only that you’re ensuring that docker container doesn’t consume all host memory, which can force the kernel to kill its process. With -m flag, the container is not allowed to use more than a given amount of user or system memory.
When container hits OOM, it won’t be killed but it can hang and stay in defunct state hence processes inside the container can’t respond until you manually intervene and do a restart or kill the container. Hope this helps clear your questions.
For more details on how kernel act on OOM, check Linux OOM management and Docker memory Limitations page.

Why is the system CPU time (% sy) high?

I am running a script that loads big files. I ran the same script in a single core OpenSuSe server and quad core PC. As expected in my PC it is much more faster than in the server. But, the script slows down the server and makes it impossible to do anything else.
My script is
for 100 iterations
Load saved data (about 10 mb)
time myscript (in PC)
real 0m52.564s
user 0m51.768s
sys 0m0.524s
time myscript (in server)
real 32m32.810s
user 4m37.677s
sys 12m51.524s
I wonder why "sys" is so high when i run the code in server. I used top command to check the memory and cpu usage.
It seems there is still free memory, so swapping is not the reason. % sy is so high, its probably the reason for the speed of server but I dont know what is causing % sy so high. The process that is using highest percent of CPU (99%) is "myscript". %wa is zero in the screenshot but sometimes it gets very high (50 %).
When the script is running, load average is greater than 1 but have never seen to be as high as 2.
I also checked my disc:
strt:~ # hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 16480 MB in 2.00 seconds = 8247.94 MB/sec
Timing buffered disk reads: 20 MB in 3.44 seconds = 5.81 MB/sec
john#strt:~> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 245G 102G 131G 44% /
udev 4.0G 152K 4.0G 1% /dev
tmpfs 4.0G 76K 4.0G 1% /dev/shm
I have checked these things but I am still not sure what is the real problem in my server and how to fix it. Can anyone identify a probable reason for the slowness? What could be the solution?
Or is there anything else I should check?
Thanks!
You're getting a high sys activity because the load of the data you're doing takes system calls that happen in kernel. To resolve your slowness problems without upgrading the server might be possible. You can modify scheduling priority. See the man pages for nice and renice. See here and especially:
Niceness values range from -20 (the highest priority, lowest niceness) and 19 (the lowest priority, highest niceness).
$ ps -lp 941
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 941 1 0 70 -10 - 1713 poll_s ? 00:00:00 sshd
$ nice -n 19 ./test.sh
My niceness value is 19
$ renice -n 10 -p 941
941 (process ID) old priority -10, new priority 10

Who is refreshing hardware watchdog in Linux?

I have a processor AT91SAM9G20 running a 2.6 kernel. Watchdog is enabled at bootstrap level and configured for 16 seconds. Watchdog mode register can be configured only once.
When code hangs either in bootstrap, bootloader or kernel, the board reboots. But once kernel comes up even though watchdog is not refreshed in any of the applications, the board is not being reset after 16 seconds, but 15 minutes.
Who is refreshing the watchdog?
In our case, the watchdog should be influenced by applications, so that the board can reset if our application hangs.
These are the running processes:
1 root init
2 root [kthreadd]
3 root [ksoftirqd/0]
4 root [watchdog/0]
5 root [events/0]
6 root [khelper]
63 root [kblockd/0]
72 root [ksuspend_usbd]
78 root [khubd]
85 root [kmmcd]
107 root [pdflush]
108 root [pdflush]
109 root [kswapd0]
110 root [aio/0]
740 root [mtdblockd]
828 root [rpciod/0]
982 root [jffs2_gcd_mtd10]
1003 root /sbin/udevd -d
1145 daemon portmap
1158 dbus dbus-daemon --system
1178 root /usr/sbin/ifplugd -i eth0 -fwI -u0 -d5 -l -q
1190 root /usr/sbin/ifplugd -i eth1 -fwI -u0 -d5 -l -q
1221 default avahi-daemon: running [SP14.local]
1226 root /usr/sbin/dropbear
1246 root /root/bin/host_app
1254 root /root/bin/mini_httpd -c *.cgi -d /root/bin -u root -E /root/bin/
1256 root -sh
1257 root /sbin/syslogd -n -m 0
1258 root /sbin/klogd -n
1259 root /usr/bin/tail -f /var/log/messages
1265 root ps -e
We are using the watchdog for soft lockups available in kernel-2.6.25-ts.at91sam9g20/kernel/softlockup.c
If you enabled the watchdog driver in your kernel, the watchdog driver sets up a kernel timer, in charge of resetting the watchdog. The corresponding code is linux/drivers/watchdog/at91sam9_wdt.c. So it works like this:
If no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog. Since it is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread. Now, if an application opens this file, it becomes responsible of the watchdog, and can reset it by writing to the file, as documented by the documentation linked in Richard's post.
Is the watchdog driver configured in your kernel?
If not, you should configure it, and see if the reset still happens. If it still happens, it is likely that your reset comes from somewhere else.
If your kernel is too old to have a proper watchdog driver (not present in 2.6.25) you should backport it from 2.6.28. Or you can try to disable the watchdog in your bootloader and see if the reset still occurs.
In July 2016 commit 3fbfe92647 (watchdog: change watchdog_need_worker logic) in the 4.7 kernel to watchdog_dev.c enabled the same behavior as shodanex's answer for all watchdog timer drivers. This doesn't seem to be documented anywhere other than this thread and the source code.
/*
* A worker to generate heartbeat requests is needed if all of the
* following conditions are true.
* - Userspace activated the watchdog.
* - The driver provided a value for the maximum hardware timeout, and
* thus is aware that the framework supports generating heartbeat
* requests.
* - Userspace requests a longer timeout than the hardware can handle.
*
* Alternatively, if userspace has not opened the watchdog
* device, we take care of feeding the watchdog if it is
* running.
*/
return (hm && watchdog_active(wdd) && t > hm) ||
(t && !watchdog_active(wdd) && watchdog_hw_running(wdd));
This may give you a hint: http://www.mjmwired.net/kernel/Documentation/watchdog/watchdog-api.txt
It makes perfect sense to have a user space daemon handling the watchdog. It probably defaults to a 15 minute timeout.
we had a similar problem regarding WDT on AT91SAM9263. Problem was with bit 29 WDIDLEHLT of WDT_MR (Address: 0xFFFFFD44) register. This bit was set to 1 but it should be 0 for our application needs.
Bit explanation from datasheet documentation:
• WDIDLEHLT: Watchdog Idle Halt
0: The Watchdog runs when the system is in idle mode.
1: The Watchdog stops when the system is in idle state.
This means that WDT counter does not increment when kernel is in idle state, hence the 15 or more delay until reset happens.
You can try "dd if=/dev/zero of=/dev/null" which will prevent kernel from entering idle state and you should get a reset in 16 seconds (or whatever period you have set in WDT_MR register).
So, the solution is to update u-boot code or other piece of code that sets WDT_MR register. Remember this register is write once...
Wouldn't the kernel be refreshing the watchdog timer? The watchdog is designed to reset the board if the whole system hangs, not just a single application.

Resources