What's inside a Docker image/container? - linux

Considering the fact that docker images/containers come in various flavours - Ubuntu, CentOS, CoreOS etc.... I'm curious what actually makes up an image/container, and what is shared with the host OS? Where is the dividing line?
For example, I can download the base Ubuntu image and launch it on a CentOS host. Then, when I poke around inside the Ubuntu container I can see that it looks and feels like an Ubuntu server (filesystem layout etc). But if I run a uname command I see the kernel and the likes of the CentOS host....
Obviously I understand that the underlying kernel is shared by all containers on the same host. But what else is shared with the host OS, and what is part of the image/container?
E.g. the kernel is part of the host, the filesystem layout is part of the the image/container.... Is there a spec that defines this?

It can be helpful to distinguish between images and containers (docs). An image is static and lives only on disk. A container is a running instance of an image and it includes its own process tree as well as RAM and other runtime resources.
An image is a logical grouping of layers plus metadata about what to do when creating a container and how to assemble the layers. Part of that metadata is that each layer knows its parent's ID.
So, what goes into a layer? The files (and directories) you've added to the parent. There are also special files ("whiteout") that indicate that something was deleted from the parent.
When you docker run an image, docker creates a container: it unpacks all the layers in the correct order, creating a new "root" file system separate from the host. docker also reads the image metadata and starts either the "entrypoint" or "command" specified when the image was created -- that starts a new process sub-tree. From inside the container, that first process seems like the root of the tree, but from the host you can see it is a subtree of processes.
The root file system is what makes one Linux distro different from another (there can be some kernel module differences as well, and bootloader/boot file system differences, but these are usually invisible to the running processes). The kernel is shared with the host and is, in fact, still managing its usual responsibilities inside the container. But the root file system is different, and so when you're inside the container, it looks and feels like whatever distro was in the Docker image.
The container not only has its own file system and process tree, but also has its own logical network interface and, optionally, its own allocation of RAM and CPU time. You're in control over the container though, as the operator, so you can decide to share the host's network interface with the container, give it unlimited access to RAM and CPU, and even mount devices, files and directories from the host into the container. The default is to keep things separate, but you have the power to break the isolation model as much as you need to.

Docker is a wrapper over LXC Linux Containers and the documentation for that will let you know in detail what is shared and what is not.
In general the host machine sees/contains everything inside the containers from file system to processes etc. You can issue a ps command on the host vm and see processes inside the container.
Remember docker containers are not VMs - hence everything is actually running natively on the host and is using the host kernel directly. Each container has its own user namespace (similar to root jails of old). There are tools/features which make sure containers only see its own processes, have its own file-system layered onto host file system, and a networking stack which pipes to the host networking stack.

Related

Understanding the technical (Docker) container architecture

I am new to containers and would like to get a good knowledge about how container technology (Docker) is made up from 'scratch'. I have to write a paper and hope that I have every important thing correctly understood so far.
The following diagram is made by me and shows my current understanding of containers.
Obviously we need an OS with a Kernel that allows us to use the hardware. For Docker this is Linux. Docker for Windows uses a VM with Linux for that.
On top of our Linux OS we then run our Docker Engine. Our Docker Engine is in charge of starting, building, configuring ... our images and containers. But most importantly the Docker Engine handles everything that has to do with isolating of containers, for example it maintains how namespaces or cgroups are used so that every container has it's own full filesystem.
Then we have our actual containers. Containers themselvesneed almost every time a kind of OS itself. This is mostly just a very compact one like Alpine or Busybox. They collect a small number of standard functions such as 'file', 'tar', 'grep' that most software definitely need. This compact OS is now using the Kernel from our full Linux OS. They don't have their own Kernel.
On top of the compact OS we then place our actual piece of software such as Node.js or a NGINX Server. This software is only using the compact OS which in return uses the Kernel from our full Linux OS. And all data or modifications that is generated or done in runtime are made on the writeable layer of our container.
And if I understood correctly, our container or everything that runs in our container is not using or interacting with our full Linux OS but just with it's Kernel?
I also don't quite understand how the writeable layer in a container works. Like how does my software for example know that a modified file from a read-only layer is now present in the writeable layer and should use this?
I would really appreciate some corrections or suggestions on what I have missed out so far. Thank you
And if I understood correctly, our container or everything that runs in our container is not using or interacting with our full Linux OS but just with it's Kernel?
The containers are just processes. For kernel, Docker daemon, NodeJS application and Nginx are processes. That's why containers don't have their own kernels. The difference between Docker daemon process (and other processes on a host) and processes that are running within containers is in their scope (it's called a namespace). Processes in containers are run in isolation and they don't see anything around their namespace. There are many different namespaces, for example, a pid namespace is one of them and it limits the visibility of other processes. That's why ps command in a container doesn't show processes from a host or other containers. Namespaces is a kernel things and they are more about what a process can see and access to while there is also cgroups that apply limits for CPU and memory usage.
I hope this helps you somehow, at least, I tried to put more attention to the kernel because Docker is just a daemon that spins new processes with configured namespaces, cgroups and own filesystem.
Here are some links that might be useful:
What even is a container: namespaces and cgroups
How containers work: overlayfs
See also other posts about containers on https://jvns.ca (I recommend it because Julia explains it in simple words and even provide illustrations.)
If you want to go deeper, I'd suggest to look at Namespaces: from chroot() to containers slides and read the article about creation of own containers.

Docker - Access host /proc

This is a duplicate of a post I have created in the docker forum. Thus I am going to close this / the other one once this problem is solved. But since no one answers in the docker forum and my problem persists, I'll post it again, looking forward to get an answer.
I would like to expose a server monitoring app as a docker container. The app I have written relies on /proc to read system information like CPU utilization or disk stats. Thus I have to forward the information provided in hosts /proc virtual file system to my docker container.
So I made a simple image (using the first or second intro on docker website: Link) and started it:
docker run -v=/proc:/host/proc:ro -d hostfiletest
Assuming the running container could read from /host/proc to obtain information about the host system.
I fired up a console inside the container to check:
docker exec -it {one of the funny names the container get} bash
And checked the content of /host/proc.
Easiest way to check it was getting the content of /host/proc/sys/kernel/hostname - that should yield the hostname of the vm I am working on.
But I get the hostname of the container, while /host/proc/uptime gets me the correct uptime of the vm.
Do I miss something here? Maybe something conceptual?
Docker version 17.05.0-ce, build 89658be running on Linux 4.4.0-97-generic (VM)
Update:
I found several articles describing how to run a specific monitoring app inside a containing using the same approach I mentioned above.
Update:
Just tried using an existing Ubuntu image - same behavior. Running the image privileged and with pid=host doesn't help.
Greetings
Peepe
The reason of this problem is that /proc is not a normal filesystem. According to procfs, it is like an interface to access some kernel data and system information. This interface provides a file-like structure, so it can make people misunderstand that it is a normal directory.
Files in /proc are also not normal files. They are empty (size = 0). You can check by yourself.
$ stat /proc/sys/kernel/hostname
File: /proc/sys/kernel/hostname
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
So the file doesn't hold any data, but when you read the file, the kernel will dynamically return to you a corresponding system information.
To answer your question, /proc/sys/kernel/hostname is just an interface to access the hostname. And depending on where you access that interface, on the host or on the container, you will get the corresponding hostname. This is also applied when you use bind mount -v /proc:/hosts/proc:ro, since bind mount will provide an alternative view of /proc. If you call the interface /hosts/proc/sys/kernel/hostname, the kernel will return the hostname of the box where you are in (the container).
In short, think about/proc/sys/kernel/hostname as a mirror, if your host stands in front of it, it will reflect the host. If it is the container, it will reflect the container.
I know its a few months later no but I came across the same problem today.
In my case I was using psutil in Python to read disk stats of the hosts from inside a docker container.
The solution was to mount the whole host files system as read only into the docker container with -v /:/rootfs:ro and specify the path to proc as psutil.PROCFS_PATH = '/rootfs/proc'.
Now the psutil.disk_partitions() lists all partitions from the host files system. As the hostname is also contained within the proc hierarchy, I guess this also works for other host system information as long the the retrieving command points to /rootsfs/proc.

How does docker Images and Layers work?

Actually I am new to the Docker ecosystem and I am trying to understand how exactly does a container work on a base image? Does the base image gets loaded into the container?
I have been through Docker docs where its said that a read write container layer is formed on top of a image layer which is the container layer, but what I am confused about is image is immutable, right? Then where is the image running, is it inside the Docker engine in the VM and how the container is actually coming into play?
how exactly does a container work on a base image?
Does the base image gets loaded into the container?
Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server.
Like FreeBSD Jails and Solaris Zones, Linux containers are self-contained execution environments -- with their own, isolated CPU, memory, block I/O, and network resources (Using CGROUPS kernel feature) -- that share the kernel of the host operating system. The result is something that feels like a virtual Machine, but sheds all the weight and startup overhead of a guest operating system.
This being said Each distribution has it's own official docker image (library), that is shipped with minimal binaries, Considered docker's best practices and it's ready to build on.
I am confused about is image is immutable, right? where is the image running, is it inside the Docker engine in the VM and how the container is actually coming into play?
Docker used to use AUFS, still uses it on debian and uses AUFS like file systems like overlay and etc on other distributions. AUFS provides layering. Each Image consists of Layers, These layers are read only. Each container has a read/write layer on top of its image layers. Read only layer are shared between containers so you will have storage space savings. Container will see the union mount of all image layers + read/write layer.

Docker Host Security - Can container run dangerous code or change host from inside of a container?

Lets say I pull a new image from a hub repository and run it without looking at the contents of the dockerfile. Can the container or image affect my host in any way possible?
Please let me know because I will be running a list of images from a user inputted image names on my server. I am worried if it will affect the server/host.
With a default execution of an image, the answer is a conditional no. The kernel capabilities are limited, the filesystem is restricted, the process space is isolated, and it's on a separate bridged network from the host. Anything that allows access back to the host would be a security vulnerability.
The conditional part is that it can use up all your CPU cycles, it can exhaust your memory, it can fill your drive, and it can send network traffic out from your machine NAT'ed to your IP address. In other words, by default, there's nothing preventing the container from a DoS attack on your host.
Docker does have the ability to limit many of these things, including capping memory, restricting CPU's or prioritizing processes, and there are quota solutions to the filesystem.
You can also go the other direction and expose the host to the container, effectively creating security vulnerabilities. This would include mounting host volumes, especially the docker.sock inside the container, removing kernel capability restrictions with --privileged, and removing network isolation with --net=host. Doing any of these with a container turns off the protections that Docker provides by default.
Docker does have a lower level of isolation than a virtual machine due to the way it shares the kernel with the host. So if the code you are running contains a kernel or physical hardware exploit, that could access the host. For this reason, if you are running untrusted code, you may want to look into linuxkit, which provides a lightweight container based operating system to run inside a vm. This is used to provide the moby os that runs under hyperv/xhyve on docker for windows/mac.

LXC without chroot

Is there any way to use LXC for resource management using process groups without creating containers? I am working on a service that runs arbitrary code inside a sandbox, for which I am only interested in hardware resource management. I don't want any chrooting; I just want these process groups to have access to the main file system.
I was told that lxc is light weight, but all the examples that I see create a new container (I.e. a dir with a full OS) for every lxc process. I don't really see how this is much lighter than any other VM solution.
So is there any way that LXC can be used to control and manage multiple process groups, without creating separate containers for each and every one of them?
LXC isn't a monolithic system. It's a collection of kernel features that can be used to isolate processes in various different ways, and a userspace tool to use all of these features together to create full-fledged containers. But the individual features are still usable on their own, without LXC. Furthermore, LXC does not require a chroot, and even when you give it a chroot, you can bind-mount directories from the host system into the container, sharing those particular directory trees between the host and the container.
For instance, cgroups are used by LXC to set resource limits on containers. But they can be used to set resource limits on groups of processes without using the LXC tools at all. You can manipulate /sys/fd/cgroup/memory or /sys/fs/cgroup/cpuacct directly, to put process into cgroups that limit the amount of memory or CPU they are allowed to use. Or if you're on a system using systemd, you can control the memory limits for a group of processes using MemoryLimit=200M or the like in the .service file for a given service.
If you want to use LXC to do lightweight resource management, you can do that with or without a chroot. When starting an LXC container, you can choose which resources you want to isolate; so you could create a container with only a virtualized network, and nothing else; or a container with only memory limits, but sharing everything else with the host. The only things that will be isolated are those specified in the configuration file for your container. For example, lxc ships with several example container definitions that only isolate the network; they share a root partition and almost everything else with the host. Here's how to run a container identical to the host system except it has no network interface:
sudo lxc-execute -n foo -f /usr/share/doc/lxc/examples/lxc-no-netns.conf /bin/bash
If you want some files to be shared with the host, but not others, you have two choices; you could use a shared root directory, and mount over the files that you want to be different in the container; or you could use a chroot, but mount the files that you do want to share in the container.
For example, here's the configuration for a container that shares everything with the host except for /home; it instead bind-mounts /home/me/fake-home over /home within the container:
lxc.mount.entry = /home/me/fake-home /home none rw,bind 0 0
Or if you want to have a completely different root, but still share some directories like /usr, you can bind mount a few directories into a directory, and use that as the root of the filesystem.
So you have lots of options, and can choose to isolate just one component, more than one, or as many as LXC supports, depending on your needs.

Resources