If everything is "just" a file in linux, how do files/nodes in /dev differ from other files such that docker must handle them differently?
What does docker do differently for device files? I expect it to be a shorthand for a more verbose bind command?
In fact, after just doing a regular bind mount for a device file such as --volume /dev/spidev0.0:/dev/spidev0.0 , the user get's a "permission denied" within the docker container when trying to access the device. When binding via --device /dev/spidev0.0:/dev/spidev0.0, it works as expected.
The Docker run reference page has a link to Linux kernel documentation on the cgroup device whitelist controller. In several ways, a process running as root in a container is a little bit more limited than the same process running as root on the host: without special additional permissions (capabilities), you can't reboot the host, mount filesystems, create virtual NICs, or any of a variety of other system-administration tasks. The device system is separate from the capability system, but it's in the same spirit.
The other way to think about this is as a security feature. A container shouldn't usually be able to access the host's filesystem or other processes, even if it's running as root. But if the container process can mknod kmem c 1 2 and access kernel memory, or mknod sda b 8 0 guessing that the host's hard drive looks like a SCSI disk, it could in theory escape these limitations by directly accessing low-level resources. The cgroup device limit protects against this.
Since Docker is intended as an isolation system where containers are restricted environments that can't access host resources, it can be inconvenient at best to run tasks that need physical devices or host files. If Docker's isolation features don't make sense, then the process might run better directly on the host, without involving Docker.
Related
Background: I am running a docker container which needs to load/remove a kernel module which makes USB devices attached to a remote server available on the host which I then want to make available in the container.
It works when running the container with —-privileged and bind mounts for /lib/modules and /dev.
Now I want to remove privileged mode and just allow the minimum necessary access. I tried —-cap-add=all as a start, but that doesn’t seem enough. What else does —-privileged allow?
Setting privileged should modify:
capabilities: removing any capability restrictions
devices: the host devices will be visible
seccomp: removing restrictions on allowed syscalls
apparmor/selinux: policies aren't applied
cgroups: I don't believe the container is limited within a cgroup
That's from memory, I might be able to find some more digging in the code if this doesn't point you too your issue.
I've searched through wiki of AppArmor's wiki as well as tried Internet searches for "apparmor mount namespace" (or similar). However, I always draw a blank as how AppArmor deals with them, which is especially odd considering that OCI containers could not exist without mount namespaces. Does AppArmor take mount namespaces into any account at all, or does it simply check for the filename passed to some syscall?
If a process inside a container switches mount namespaces does AppArmor take notice at all, or is it simply mount namespace-agnostic in that it doesn't care? For instance, if a container process switches into the initial mount namespace, can I write AppArmor MAC rules to prevent such a process from accessing senstive host files, while the same files inside its own container are allowed for access?
can I write AppArmor MAC rules to prevent such a process from
accessing senstive host files.
Just don't give container access to sensitive host filesystem part. That means don't mount them into container. This is out of scope of AppArmor to take care of if you do.
I would say that AppArmor is partially linux kernel mount namespace aware.
I think the attach_disconnected flag in apparmor is an indication that apparmor knows if you are in the main OS mount namespace or a separate mount namespace.
The attach_disconnected flag is briefly described at this link (despite the warning at the top of the page claiming to be a draft):
https://gitlab.com/apparmor/apparmor/-/wikis/AppArmor_Core_Policy_Reference
The following reference, from a ubuntu apparmor discussion, provides useful and related information although not directly answering your question.
https://lists.ubuntu.com/archives/apparmor/2018-July/011722.html
The following references, from a usenix presentation, provides a proposal to add security namespaces to the Linux kernel for use by frameworks such as apparmor. This does not directly show how / if apparmor currently uses kernel mount namespaces for decision making, but it's related enough to be of interest.
https://www.usenix.org/sites/default/files/conference/protected-files/security18_slides_sun.pdf
https://www.usenix.org/conference/usenixsecurity18/presentation/sun
I don't know if my response here is complete enough to be considered a full answer to your questions, however I don't have enough reputation points to put this into a comment. I also found it difficult to know when the AppArmor documentation meant "apparmor policy namespace" vs "linux kernel mount namespace", when the word "namespace" was specified alone.
This is in reference to https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory. I have already created working containers, running Docker version 18.05.0-ce on a Raspberry Pi (64-bit) using Raspbian Jessie Lite (essentially GUI-less Debian Jessie).
The documentation claims that you can just pass memory/cpu flags on the docker run command. But when I try something like docker run -it --name test --memory=512m container_os, it says:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap
I get a similar message about not having cpuset mounted if I pass a cpu-based flag, such as --cpuset-cpus. This obviously means that I don't have these different cgroups mounted for Docker to manage resources correctly, right?
Now referring to https://docs.docker.com/config/containers/runmetrics/#control-groups, I read the section about cgroups, but it wasn't super helpful to my understanding of the situation. So rather than just trying random kernel commands, does anyone with experience have a step-by-step explanation of how to do this the right way?
After quite a bit of research, I figured this out, in-case anyone else out there has this same problem.
In reference to https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, which is extremely helpful on understanding cgroups, a kernel with all of the proper support should have most of the cgroups for docker mounted by default. If not, there's a command to do so:
From section 2.1 - Basic Usage
"To mount a cgroup hierarchy with all available subsystems, type:
mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.
Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used."
For this particular case, however, trying to mount an individual cgroup, such as cpuset, results in an error saying that the "cpuset special device does not exist". This is because the devs of Raspbian Jessie 8 didn't configure the kernel to support the cgroups that Docker uses for resource management by default. This can easily be determined by typing the docker info command, and seeing this at the bottom of the output:
WARNING: No swap limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpuset support
These are all of the cgroups that are needed for Docker to manage memory and CPU resources for containers. Testing to see if your kernel supports something like cpuset is easy. If the file /proc/filesystems has an entry that says nodev cpuset, then that means your kernel has cpuset support, but if you're reading this then it probably means it's just not configured in your kernel. That would call for a kernel reconfiguration and rebuild however, which is not so easy.
With the right kernel configurations, it just works automatically like it seems from the Docker Docs.
I've read that on linux, Docker uses the underlying linux kernal to create containers. So this is an advantage because resources aren't wasted on creating virtual machines that each contain an OS.
I'm confused, though, as to why most Dockerfiles specify the OS in the FROM line of the Dockerfile. I thought that as it was using the underlying OS, then the OS wouldn't have to be defined.
I would like to know what actually happens if the OS specified doesn't match the OS flavour of the machine it's running on. So if the machine is CentOS but the Dockerfile has FROM Debian:latest in the first line, is a virtual machine containing a Debian OS actually created.
In other words, does this result in a performance reduction because it needs to create a virtual machine containing the specified OS?
I'm confused, though, as to why most Dockerfiles specify the OS in the
FROM line of the Dockerfile. I thought that as it was using the
underlying OS, then the OS wouldn't have to be defined.
I think your terminology may be a little confused.
Docker indeed uses the host kernel, because Docker is nothing but a way of isolating processes running on the host (that is, it's not any sort of virtualization, and it can't run a different operating system).
However, the filesystem visible inside the container has nothing to do with the host. A Docker container can run programs from any Linux distribution. So if I am on a Fedora 24 Host, I can build a container that uses an Ubuntu 14.04 userspace by starting my Dockerfile with:
FROM ubuntu:14.04
Processes running in this container are still running on the host kernel, but there entire userspace comes from the Ubuntu distribution. This isn't another "operating system" -- it's still the same Linux kernel -- but it is a completely separate filesystem.
The fact that my host is running a different kernel version than maybe you would find in an actual Ubuntu 14.04 host is almost irrelevant. There are going to be a few utilities that expect a particular kernel version, but most applications just don't care as long as the kernel is "recent enough".
So no, there is no virtualization in Docker. Just various (processes, filesystem, networking, etc) sorts of isolation.
Due to the peculiar nature of the application, I'm thinking of running servers such as Apache, Tomcat from within a chroot environment.
Using schroot and debootstrap, I'm able to create a clone of my 10.04 ubuntu(minimal ubuntu) inside chroot directory. I've install tomcat and apache inside chroot . But how do I access these two servers?
Can I access them like a normal apache/tomcat installed on parent server?
Can the parent OS access the apache/tomcat of chroot os?
First, which of these options is possible. Second, any caveats that I should handle with each of these options.
I want something like
Internet ---> [Main host Ubuntu 10.04 Apache ----> (chroot ubuntu Tomcat) ]
chrooting is one of the simplest forms of virtual machines. If your application is security-sensitive, you might consider running a more full-featured solution, such as OpenVZ, Xen, KVM, VirtualBox or commercial solutions, such as VMware and a few others.
That being said, you should really consider to view your chrooted OS as just another host in your network. When you'll be using just chroot, you can access it as localhost (127.0.0.1) with some port number you'll assign to it (chrooted system will effectively share port assignations with parent system), while using other virtualization solutions allows you to assign a normal separate IP to each virtual machine and run it much as you would run a separate physical box.
chrooting is fairly "weak" security solution, is parent and child share a lot of resources almost without limitations (i.e. memory, CPU, process pool, disc space, privileges, sockets, etc). They only limitation in fact is limited filesystem access (i.e. chrooted applications can access only a portion of whole file system), although it provides some degree of isolation.