Mounting cgroups inside a docker container

Mounting cgroups inside a docker container - linux

I dockerized a component that follows a process model. The master process forks itself many times. I want to establish a cgroup hierarchy inside the docker container to vary the CPU and memory limit on a per process basis.
Is there a way I can do this without using '--privileged' or 'CAP_SYTEM_ADMIN'?
Is there a way I can make the cgroup that the container belongs to as the root of the cgroup subsytem that I am implementing for the processes? (Divide the resources allocated to the container among the processes).

The conclusion that I came to was that there is no current solution for this since Docker does not support cgroup virtualization nor does the Linux kernel. We need some form of cgroup virtualization in order to implement cgroups inside a container.
lxc does this using a FUSE based solution called lxcfs : https://linuxcontainers.org/lxcfs/introduction/
Also, there is a kernel patch that supports cgroup namespaces which as far as I can see have not been approved : https://lwn.net/Articles/605903/.

Related

Difference between `--privileged` and `--cap-add=all` in docker

Background: I am running a docker container which needs to load/remove a kernel module which makes USB devices attached to a remote server available on the host which I then want to make available in the container.
It works when running the container with —-privileged and bind mounts for /lib/modules and /dev.
Now I want to remove privileged mode and just allow the minimum necessary access. I tried —-cap-add=all as a start, but that doesn’t seem enough. What else does —-privileged allow?

Setting privileged should modify:
capabilities: removing any capability restrictions
devices: the host devices will be visible
seccomp: removing restrictions on allowed syscalls
apparmor/selinux: policies aren't applied
cgroups: I don't believe the container is limited within a cgroup
That's from memory, I might be able to find some more digging in the code if this doesn't point you too your issue.

What's the difference between docker run --device and docker run --volume?

If everything is "just" a file in linux, how do files/nodes in /dev differ from other files such that docker must handle them differently?
What does docker do differently for device files? I expect it to be a shorthand for a more verbose bind command?
In fact, after just doing a regular bind mount for a device file such as --volume /dev/spidev0.0:/dev/spidev0.0 , the user get's a "permission denied" within the docker container when trying to access the device. When binding via --device /dev/spidev0.0:/dev/spidev0.0, it works as expected.

The Docker run reference page has a link to Linux kernel documentation on the cgroup device whitelist controller. In several ways, a process running as root in a container is a little bit more limited than the same process running as root on the host: without special additional permissions (capabilities), you can't reboot the host, mount filesystems, create virtual NICs, or any of a variety of other system-administration tasks. The device system is separate from the capability system, but it's in the same spirit.
The other way to think about this is as a security feature. A container shouldn't usually be able to access the host's filesystem or other processes, even if it's running as root. But if the container process can mknod kmem c 1 2 and access kernel memory, or mknod sda b 8 0 guessing that the host's hard drive looks like a SCSI disk, it could in theory escape these limitations by directly accessing low-level resources. The cgroup device limit protects against this.
Since Docker is intended as an isolation system where containers are restricted environments that can't access host resources, it can be inconvenient at best to run tasks that need physical devices or host files. If Docker's isolation features don't make sense, then the process might run better directly on the host, without involving Docker.

Mounting cgroups for Resource Management in Docker

This is in reference to https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory. I have already created working containers, running Docker version 18.05.0-ce on a Raspberry Pi (64-bit) using Raspbian Jessie Lite (essentially GUI-less Debian Jessie).
The documentation claims that you can just pass memory/cpu flags on the docker run command. But when I try something like docker run -it --name test --memory=512m container_os, it says:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap
I get a similar message about not having cpuset mounted if I pass a cpu-based flag, such as --cpuset-cpus. This obviously means that I don't have these different cgroups mounted for Docker to manage resources correctly, right?
Now referring to https://docs.docker.com/config/containers/runmetrics/#control-groups, I read the section about cgroups, but it wasn't super helpful to my understanding of the situation. So rather than just trying random kernel commands, does anyone with experience have a step-by-step explanation of how to do this the right way?

After quite a bit of research, I figured this out, in-case anyone else out there has this same problem.
In reference to https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, which is extremely helpful on understanding cgroups, a kernel with all of the proper support should have most of the cgroups for docker mounted by default. If not, there's a command to do so:
From section 2.1 - Basic Usage
"To mount a cgroup hierarchy with all available subsystems, type:
mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.
Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used."
For this particular case, however, trying to mount an individual cgroup, such as cpuset, results in an error saying that the "cpuset special device does not exist". This is because the devs of Raspbian Jessie 8 didn't configure the kernel to support the cgroups that Docker uses for resource management by default. This can easily be determined by typing the docker info command, and seeing this at the bottom of the output:
WARNING: No swap limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpuset support
These are all of the cgroups that are needed for Docker to manage memory and CPU resources for containers. Testing to see if your kernel supports something like cpuset is easy. If the file /proc/filesystems has an entry that says nodev cpuset, then that means your kernel has cpuset support, but if you're reading this then it probably means it's just not configured in your kernel. That would call for a kernel reconfiguration and rebuild however, which is not so easy.
With the right kernel configurations, it just works automatically like it seems from the Docker Docs.

Docker CPU/Mem allocation in Mac/Win

As far as I understood, at the moment, Docker for Mac requires that I decide upfront how much memory and CPU cores to statically allocate to the virtualized linux it runs on.
So that means that even when Docker is idle, my other programs will run on (N-3) CPU cores and (M-3)GB of memory. Right?
This is very suboptimal!
In Linux, it's ideal because a container is just another process. So it uses and releases the system memory as containers starts and stop.
Is my mental model correct?
Will one day Docker for Mac or Windows dynamically allocate CPU and Memory resources?

The primary issue here is that, for the moment, Docker can only run Linux containers on Linux. That means on OS X or Windows, Docker is running in a Linux VM, and it's ability to allocate resources is limited by the facilities provided by the virtualization software in use.
Of course, Docker can natively on Windows, as long as you want to run Windows containers, and in this situation may more closely match the Linux "a container is just a process" model.
It is possible that this will change in the future, but that's how things stand right now.
So that means that even when Docker is idle, my other programs will run on (N-3) CPU cores and (M-3)GB of memory. Right?
I suspect that's true for memory. I believe that if the docker vm is idle it isn't actually using much in the way of CPU resources (that is, you are not dedicating CPUs to the VM; rather, you are setting maximum limits on how many resources the vm can consume).

LXC - Cgroup memory controller: missing

I'm trying install LXC (0.7.4.1) on my Debian 6 but when I run the lxc-checkconfig I get "Cgroup memory controller: missing"
root#lxcsrv01:~# lxc-checkconfig
Kernel config /proc/config.gz not found, looking in other places...
Found kernel config file /boot/config-2.6.32-5-686
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled
--- Control groups ---
Cgroup: enabled
Cgroup namespace: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: missing
Cgroup cpuset: enabled
--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled
enabled
Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
According google search I need to recompile my kernel but I don't know how.
Someone can explain me how to do this?
Best regards

The kernel of Debian 6 has no memory cgroup feature.
However you can run lxc without it.
If you NEED memory cgroup, it's easy to install the new kernel
from backports.
Add apt-line of backports
Run "apt-get install linux-image-3.2.0-0.bpo.4-amd64" (or -686 for i386)
Add a kernel boot option "cgroup_enable=memory" to your bootloader setting
(e.g. /etc/default/grub) to enable it.
reboot
Or, if you'd like to re-compile the kernel, you can use kernel-package system of Debian;
http://newbiedoc.sourceforge.net/system/kernel-pkg.html

I am having similar memory cgroup issues, and have looked into it quite a bit. I wrote a blog entry about here:
http://blog.raymond.burkholder.net/index.php?/archives/639-Debian-Stretch-LXC-Memory-Controller.html
In summary, the kernel is compiled with the necessary memory cgroup support. The fly-in-the-ointment: lxc-checkconfig has a bug in it, and will not properly show the status of the memory cgroup. CONFIG_CGROUP_MEM_RES_CTLR=y is applicable for older kernels only (sometime before 3.6, I believe).
I end up making two adjustments: one adjustment to the /boot/config-$version, and one adjustment to /etc/default/grub. Both are explained in the article.
But bottom line, the general recommendation appears to be: don't enable it if you really don't need to perform memory limitation management on containers. There is some performance and memory overhead.

Update kernel from here.
Then reboot your system. This problem is solved automatically, but if not go to /boot/config-<versionnumber>-generic. For instance: /boot/config-3.11.0-13-generic
Here see if CONFIG_CGROUP_MEM_RES_CTLR=y is available or not. If 'yes' then ok, otherwise paste it that in.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string