Docker desktop on linux is using a lot of ram - linux

I'm running Docker desktop on Ubuntu 22.04. Every time I start it, it eats a lot of RAM.
PID USER %MEM COMMAND
135264 user 26.0 qemu-system-x86_64 -accel kvm -cpu host -machine q35 -m 3849 -smp 8 -kernel /opt/docker-desktop/linuxkit/kernel -append page_poison=1 vsyscall=emulate panic=1 nospec_store_bypass_disable noibrs noibpb no_stf_barrier mitigations=off linuxkit.unified_cgroup_hierarchy=1 vpnkit.connect=tcp+bootstrap+client://gateway.docker.internal:35817/95d4e7d4090b2d25b84ed2f2bd2e54523bafd0dfc2e2388838f04b9d045e0fe2 vpnkit.disable=osxfs-data console=ttyS0 -initrd /opt/docker-desktop/linuxkit/initrd.img -serial pipe:/tmp/qemu-console1696356651/fifo -drive if=none,file=/home/lev/.docker/desktop/vms/0/data/Docker.raw,format=raw,id=hd0 -device virtio-blk-pci,drive=hd0,serial=dummyserial -netdev user,id=net0,ipv6=off,net=192.168.65.0/24,dhcpstart=192.168.65.9 -device virtio-net-pci,netdev=net0 -vga none -nographic -monitor none -object memory-backend-memfd,id=mem,size=3849M,share=on -numa node,memdev=mem -chardev socket,id=char0,path=virtiofs.sock0 -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=virtiofs0
10422 user 2.3 /snap/firefox/1883/usr/lib/firefox/firefox
...
While docker ps shows that there are no containers running.
I've noticed that there is a mention of 3849M of memory in the command but I can't be entirely sure if it's related, plus it eats way more than 4 gigs.

Well, Docker uses all allocated memory at start, please see
https://github.com/docker/for-mac/issues/4229
You can set memory Limit on:
Dodcker Dashboard >> Settings >> Resources >> Apply and Restart
Otherwise, if you want to check how Resources are splitted between running container,
run docker stats to see memory usage of current running containers
See https://docs.docker.com/engine/reference/commandline/stats/
For Example:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
db6115785a9e 001_jan_twit 0.00% 35.71MiB / 7.774GiB 0.45% 38.6MB / 659kB 16.4kB / 222MB 2

Related

What causes overhead in QEMU in case of trivial `sleep 1`?

Experiment:
I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:
1.813 ms on the host vs
3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?
Update:
With cpupower frequency-set --governor performance the timings are:
Host: 0.922ms
Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
Host: 0.922ms
Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0
It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).
That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.
I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)
Running qemu on my mac, I found 'sleep 1' at the bash command line usually taking 10 seconds while 'sleep 2' usually taking 5 seconds. At least as measured by time on a 6.0.8 archlinux. Oddly time seemed to be measuring the passage of time correctly while sleep was not working.
But I had been running
qemu-system-x86_64 \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
Then reading about the -icount parameter, I found the following makes the sleep pretty accurate.
qemu-system-x86_64 \
-icount shift=auto,sleep=on \
-m 1G \
-nic user,hostfwd=tcp::10022-:22 \
img1.cow
I mention it here because my search for qemu and slow sleep 1 led me here first.

How to boot the ubuntu cloud image with a customized kernel

I try to use my own Linux kernel built from source for an ubuntu cloud image and boot it using QEMU. My customized kernel is outside of the ubuntu image:
$ ls
kernel ubuntu-20.04-amd64.img ...
Here is the command line I used:
sudo qemu-system-x86_64 \
-enable-kvm -cpu host -smp 2 -m 4096 -nographic \
-drive id=root,media=disk,file=ubuntu-20.04-amd64.img \
-kernel ./kernel/arch/x86/boot/bzImage \
-append "root=/dev/sda console=ttyS0" \
-device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp:127.0.0.1:5555-:22
When I boot it, I can see the following log:
[ 0.875446] List of all partitions:
[ 0.875736] 0800 4194304 sda
[ 0.875736] driver: sd
[ 0.876259] 0801 4194303 sda1 00000000-01
[ 0.876259]
[ 0.876893] No filesystem could mount root, tried:
[ 0.876893] ext3
[ 0.877435] ext2
[ 0.877610] ext4
[ 0.877834]
[ 0.878149] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,0)
Any suggestions?
The log you quote says that the disk image is partitioned: that is, sda is the entire (virtual) disk, and it has a partition table with one partition named sda1. Your "append" command line asks to use '/dev/sda' as if the disk image had only a single filesystem on it. Try '/dev/sda1' instead.

What does `--oom-kill-disable` do for a Docker container?

I have understood that docker run -m 256m --memory-swap 256m will limit a container so that it can use at most 256 MB of memory and no swap. If it allocates more, then a process in the container (not "the container") will be killed. For example:
$ sudo docker run -it --rm -m 256m --memory-swap 256m \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 7 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 1s
Apparently one of the workers allocates more memory than is allowed and receives a SIGKILL. Note that the parent process stays alive.
Now if the effect of -m is to invoke the OOM killer if a process allocates too much memory, then what happens when specifying -m and --oom-kill-disable? Trying it like above has the following result:
$ sudo docker run -it --rm -m 256m --memory-swap 256m --oom-kill-disable \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
(waits here)
In a different shell:
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f5e4c30d75c9 0.00% 256 MiB / 256 MiB 100.00% 0 B / 508 B 0 B / 0 B 2
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19391 root 20 0 2055904 262352 340 D 0.0 0.1 0:00.05 stress
I see the docker stats shows a memory consumption of 256 MB, and top shows a RES of 256 MB and a VIRT of 2000 MB. But, what does that actually mean? What will happen to a process inside the container that tries to use more memory than allowed? In which sense it is constrained by -m?
As i understand the docs --oom-kill-disable is not constrained by -m but actually requires it:
By default, kernel kills processes in a container if an out-of-memory
(OOM) error occurs. To change this behaviour, use the
--oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not
set, this can result in the host running out of memory and require
killing the host’s system processes to free memory.
A developer stated back in 2015 that
The host can run out of memory with or without the -m flag set. But
it's also irrelevant as --oom-kill-disable does nothing unless -m is
passed.
In regard to your update, what happens when OOM-killer is disabled and yet the memory limit is hit (intresting OOM article), id say that new calls to malloc and such will just fail as described here but it also depends on the swap configuration and the hosts available memory. If your -m limit is above the actual available memory, the host will start killing processes, one of which might be the docker daemon (which they try to avoid by changing its OOM priority).
The kernel docs (cgroup/memory.txt) say
If OOM-killer is disabled, tasks under cgroup will hang/sleep in
memory cgroup's OOM-waitqueue when they request accountable memory
For the actual implementation (which docker utilizes as well) of cgroups, youd have to check the sourcecode.
The job of the 'oom killer' in Linux is to sacrifice one or more processes in order to free up memory for the system when all else fails. OOM killer is only enabled if the host has memory overcommit enabled
The setting of --oom-kill-disable will set the cgroup parameter to disable the oom killer for this specific container when a condition specified by -m is met. Without the -m flag, oom killer will be irrelevant.
The -m flag doesn’t mean stop the process when it uses more than xmb of ram, it’s only that you’re ensuring that docker container doesn’t consume all host memory, which can force the kernel to kill its process. With -m flag, the container is not allowed to use more than a given amount of user or system memory.
When container hits OOM, it won’t be killed but it can hang and stay in defunct state hence processes inside the container can’t respond until you manually intervene and do a restart or kill the container. Hope this helps clear your questions.
For more details on how kernel act on OOM, check Linux OOM management and Docker memory Limitations page.

QEMU fails to initialize NVMe device

I want to learn the NVMe driver in Linux, but I don't have a physical NVMe drive. So, I think QEMU is my current only choice. I setup the system in these steps logined as "root":
built QEMU-2.2.1 from source code cloned from stable branch
git clone -b stable-2.2 git://git.qemu-project.org/qemu
./configure
--enable-linux-aio --target-list=x86_64-softmmu
make clean
make -j8
make install
create an img and install CentOS6.6 into this image:
qemu-img create -f raw ./vdisk/16GB.img 16G
qemu-system-x86_64 -m 1024 -cdrom ./vdisk/CentOS-6.6-x86_64-minimal.iso -hda ./vdisk/16GB.img
run CentOS6.6 in QEMU with nvme device
qemu-system-x86_64 -m 1024 -hda ./vdisk/16GB.img -device nvme
But it shows the error message below:
qemu-system-x86_64: -device nvme: Device initialization failed.
qemu-system-x86_64: -device nvme: Device 'nvme' could not be initialized
I also run the CentOS6.6 in QEMU without nvme device, it just runs very well.
qemu-system-x86_64 -m 1024 -hda ./vdisk/16GB.img
So, how can I get more debug information on this error? Or, how can I solve this issue if you also have similar experience?
Thanks,
-Crane
Find the solution: create a img for nvme device, and start qemu like this:
qemu-system-x86_64 -m 2048 -hda ./vdisk/16GB.img -drive file=./vdisk/nvme_dut.img,if=none,id=drv0 -device nvme,drive=drv0,serial=foo --enable-kvm -smp 2

Using qemu to boot OpenSUSE (or any other OS) with custom kernel?

Duplicate;
Could not find an answer, so posting here.
I want to run OpenSUSE as guest with a custom kernel image which is on my host machine. I'm trying:
$ qemu-system-x86_64 -hda opensuse.img -m 512 -kernel
~/kernel/linux-git/arch/x86_64/boot/bzImage -initrd
~/kernel/linux-git/arch/x86_64/boot/initrd.img -boot c
But it boots into BusyBox instead. Using uname -a shows Linux (none). Also, using -append "root=/dev/sda" (as suggested on the link above) does not seem to work. How do I tell the kernel image to boot with OpenSUSE?
I have OpenSUSE installed into opensuse.img, and:
$ qemu-system-x86_64 -hda opensuse.img -m 512 -boot c
boots it with the stock kernel.
Most virtual machines are booted from a disk image or an ISO file, but KVM can directly load a Linux kernel into memory skipping the bootloader. This means you don't need an image file containing the kernel and boot files. Instead, you can run a kernel directly like this:
qemu-kvm -kernel arch/x86/boot/bzImage -initrd initramfs.gz -append "console=ttyS0" -nographic
These flags directly load a kernel and initramfs from the host filesystem without the need to generate a disk image or configure a bootloader.
The optional -initrd flag loads an initramfs for the kernel to use as the root filesystem.
The -append flags adds kernel parameters and can be used to enable the serial console.
The -nographic option restricts the virtual machine to just a serial console and therefore keeps all test kernel output in your terminal rather than in a graphical window.
Take a look in the below link. It has lot more info [thanks to the Guy who wrote all that]
http://blog.vmsplice.net/2011/02/near-instant-kernel-development-cycle.html
Usually for arm architecture like raspberry pi or any board .
To boot with your custom kernel
qemu-system-arm -kernel kernel-qemu -cpu arm1176 -m 256 -M versatilepb -no-reboot -serial stdio -append "root=/dev/sda2 panic=1" -hda 2013-05-25-wheezy-raspbian.img
where -hda your suse.img here u have to find in which partition your rootfs present u can check
fdisk -l your image
if only one partition then pass /dev/sda or if its in 2nd /dev/sda2
I think no need of initrd image required here. usually it will mount main rootfs so no need when u boot it main rootfs.
So try this
qemu-system-x86_64 -hda opensuse.img -m 512 -kernel ~/kernel/linux-git/arch/x86_64/boot/bzImage -append "root=/dev/sda" -boot c
Note check exactly in which partition your rootfs is present then pass /dev/sda*
Im not sure you just try above one . Also you mention that uname -a
gives linux none This is bcoz while configuring your kernel you have to mention name otherwise it ll default take as none

Resources