Ubuntu docker image doesn't contain any /dev/sdX block devices? - linux

On normal linux machine or VM, "ls /dev" will show a lot of sdX or hdX hard disk information.
But today I just tried latest docker and ubuntu image. ls shows this:
console core fd full mqueue null ptmx pts random shm stderr stdin stdout tty urandom zero
I don't see anything called sdX or hdX.
Why is this? How docker ubuntu store anything without having a /dev/sdX?

Docker doesn't create virtual machines, it creates containers to run an application in an isolated space. Including physical hardware devices would allow those applications to escape from the container isolation, so they are not provided by default. You can include specific devices inside the container with the docker run --device ... cli flag, and you will most likely need to include --privileged give the root user back various capabilities that are removed. None of this is recommended for running untrusted containers.

Related

Docker images and containers change when docker desktop is running on linux

When docker desktop is running on linux, I see a different set of containers and images compared to when it is not running. That is, when I run docker images in the terminal, the output depends on whether docker desktop is running or not. After I 'quit docker desktop', the original behavior is restored.
I note the following changes:
docker desktop is off
docker desktop is running
images 'a, b, c'
shows images 'd, e, f'
containers 'aa, bb, cc'
containers 'dd, ee, ff'
non colored cli output
pretty colored cli output
My suspicion is that docker desktop kills a running docker service and starts a fresh one whose images and containers are located elsewhere on my filesystem. Then after quitting, the original service is restored. I'd like this behavior to change, such that the images and containers I'm working on are always the same, regardless of whether docker desktop is running or not.
I'm looking for some feedback on how to start debugging this.
Docker only runs natively on Linux. Docker Desktop is the "hack" that allows running docker on other platforms (MacOS, Windows, etc). Docker Desktop actually starts a Linux VM and runs docker inside that VM. It then takes care of mapping ports and volumes so that it appears to the end user that docker is "running directly on host".
The beauty of running Docker on linux is that it runs natively and you don't need extra hacks and tricks. So why you would use Docker Desktop on Ubuntu.... beats me :) However, the explanation of why you see different results is becuase you see different docker processes running on different machines: one on the host and one on a VM.

Is it safe to mount /dev into a Docker container

I'm affected by an issue described in moby/moby/27886, meaning loop devices I create in a Docker container do not appear in the container, but they do on the host.
One of the potential workarounds is to mount /dev into the Docker container, so something like:
docker run -v /dev:/dev image:tag
I know that it works, at least on Docker Engine and the Linux kernel that I currently use (20.10.5 and 5.4.0-70-generic respectively), but I'm not sure how portable and safe it is.
In runc's libcontainer/SPEC for filesystems I found some detailed information on /dev mounts for containers, so I'm a bit worried that mounting /dev from the host might cause unwanted side-effects. I even found one problematic use-case that was fixed in 2018...
So, is it safe to mount /dev into a Docker container? Is it guaranteed to work?
I'm not asking if it works, because it seems it works, I'm interested if this is a valid usage, natively supported by Docker Engine.

How to customize golang-docker image to use golang for scripting?

I came across this blog: using go as a scripting language and tried to create a custom image that I can use to run golang scripts i.e.
FROM golang:1.15
RUN go get github.com/erning/gorun
RUN mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc
RUN echo ':golang:E::go::/go/bin/gorun:OC' | tee /proc/sys/fs/binfmt_misc/register
It fails with error:
mount: /proc/sys/fs/binfmt_misc: permission denied.
ERROR: Service 'go_saga' failed to build : The command '/bin/sh -c mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc' returned a non-zero code: 32
It's readonly file system so can't change the permissions as well. The task I'm trying to achieve here is well documented here. Please help me with following questions:
Is that even possible i.e. mount /proc/sys/fs/binfmt_misc and write to the file: /proc/sys/fs/binfmt_misc/register ?
If Yes, how to do that ?
I guess, it would be great, if we could run golang scripts in the container.
First a quick disclaimer that I haven't done this binfmt trick to run go scripts. I suppose it might work, but I just use go run when I want to run something on the fly.
There's a lot to unpack in this. Container isolation runs an application with a shared kernel in an isolated environment. The namespaces, cgroups, and security settings are designed to prevent one container from impacting other containers or the host.
Why is that important? Because /proc/sys/fs/binfmt_misc is interacting with the kernel, and pushing a change to that would be considered a container escape since you're modifying the underlying host.
The next thing to cover is building an image vs running a container. When you build an image with the Dockerfile, you are defining the image filesystem and some metadata (labels, entrypoint, exposed ports, etc). Each RUN command executes that command inside a temporary container, based on the previous step's result, and when the command finishes it captures the changes to the container filesystem. When you mount another filesystem, that doesn't change the underlying container filesystem, so even if you could, the mount command would be a noop during the image build.
So if this is possible, you'll need to do it inside the container rather than during build time, that container will need to be privileged since doing things like mounting filesystems and modifying /proc requires access not normally given to containers, and you'll be modifying the host kernel in the process. You'd need to make the container entrypoint run the mount and register the binfmt_misc entry, and figure out what to do if the entry is already setup/registered, but possibly to a different directory in another container.
As an aside, when dealing with binfmt_misc and containers, the F flag is very important, though in your use case it's important that you don't have it. Typically you need the F flag so the binary is found on the host filesystem rather than searched for within the container filesystem namespace. The typical use case of binfmt_misc and containers is configuring the host to be able to run containers for different architectures, e.g. Docker Desktop can run amd64, arm64, and a bunch of other platforms today using this.
In the end, if you want to run a container as a one off to run a go command as a script, I'd skip the binfmt misc trick and make an entrypoint that does a go run instead. But if you're using the container for longer run processes where you want to periodically run a go file as a script, you'll need to do that in the container, and as a privileged container that has the ability to escape to the host.

Docker on the mac separates internal and external file ownerships; not so on linux

Running docker on the Mac, with a centos image, I see mounted volumes taking on the ownership of the centos (internal) user, while on the filesystem the ownership is mine (mdf:mdf).
Using the same centos image on RHEL 7, I see the volumes mounted, but inside, in centos, the home dir and the files all show my uid (1055).
I can do a recursive chown to, e.g., insideguy:insideguy, and all looks right. But back in the host filesystem, the ownerships have changed to some other person in the registry that has the same uid as was selected for insideguy(1001) when useradd was executed.
Is there some fundamental limitation in docker for Linux that makes this happen?
As another side effect, in our cluster one cannot chown on a mounted filesystem, even with sudo privileges; only on a local filesystem. So the desire to keep the docker home directories in, e.g., ~/dockerhome, fails because docker seems to be trying (and failing) to perform some chowns (not described in the Dockerfile or the start script, so assumed to be part of the --volume treatment). Placed in /var or /opt with appropriate ownerships, all goes well.
Any idea what's different between the two docker hosts?
Specifics: OSX 10.11.6; docker v1.12.1 on mac, v1.12.2 on RHEL 7; centos 7
There is a fundamental limitation to Docker on OS X that makes this happen: that is the fact that Docker only runs on Linux.
When running Docker on other platforms, this requires first setting up a Linux VM (historically through VirtualBox, although more recently other options are available) and then running Docker inside that VM.
Because Docker is running natively on Linux, it is sharing filesystems directly with the host when you use something like docker run -v /host/path:/container/path. So if inside the container you run chown userA somefile and user A has userid 1001, and on your host that user id belongs to userB, then of course when you look at the files on the host they will appear to be owned by userB. There's no magic here; this is just how Unix file permissions work. You get the same behavior if, say, you were to move a disk or NFS filesystem from one host to another that had conflicting entries in their local /etc/passwd files.
Most Docker containers are running as root (or at least, not as your local user). This means that any files created by a process in Docker will typically not be owned by you, which can of course cause problems if you are trying to access a filesystem that does not permit this sort of access. Your choices when using Docker are pretty the same choices you have when not using Docker: either ensure that you are running containers as your own user id -- which may not be possible, since many images are built assuming they will be running as root -- or arrange to store files somewhere else.
This is one of the reasons why many people discourage the use of host volume mounts, because it can lead to this sort of confusion (and also because when interacting with a remote Docker API, the remote Docker daemon doesn't have any access to your local host filesystem).
With Docker for Mac, there is some magic file sharing that goes on to expose your local filesystem to the Linux VM (for example, with VirtualBox, Docker may use the shared folders feature). This translation layer is probably the cause of the behavior you've noted on OS X with respect to file ownership.

Why does docker container prompt "Permission denied"?

I use following command to run a docker container, and map a directory from host(/root/database) to container(/tmp/install/database):
# docker run -it --name oracle_install -v /root/database:/tmp/install/database bofm/oracle12c:preinstall bash
But in container, I find I can't use ls to list contents in /tmp/install/database/ though I am root and have all privileges:
[root#77eb235aceac /]# cd /tmp/install/database/
[root#77eb235aceac database]# ls
ls: cannot open directory .: Permission denied
[root#77eb235aceac database]# id
uid=0(root) gid=0(root) groups=0(root)
[root#77eb235aceac database]# cd ..
[root#77eb235aceac install]# ls -alt
......
drwxr-xr-x. 7 root root 4096 Jul 7 2014 database
I check /root/database in host, and all things seem OK:
[root#localhost ~]# ls -lt
......
drwxr-xr-x. 7 root root 4096 Jul 7 2014 database
Why does docker container prompt "Permission denied"?
Update:
The root cause is related to SELinux. Actually, I met similar issue last year.
A permission denied within a container for a shared directory could be due to the fact that this shared directory is stored on a device. By default containers cannot access any devices. Adding the option $docker run --privileged allows the container to access all devices and performs Kernel calls. This is not considered as secure.
A cleaner way to share device is to use the option docker run --device=/dev/sdb (if /dev/sdb is the device you want to share).
From the man page:
--device=[]
Add a host device to the container (e.g. --device=/dev/sdc:/dev/xvdc:rwm)
--privileged=true|false
Give extended privileges to this container. The default is false.
By default, Docker containers are “unprivileged” (=false) and cannot, for example, run a Docker daemon inside the Docker container. This is because by default a container is not allowed to access any devices. A “privileged” container is given access to all devices.
When the operator executes docker run --privileged, Docker will enable access to all devices on the host as well as set some configuration in AppArmor to allow the container nearly all the same access to the host as processes running outside of a container on the host.
I had a similar issue when sharing an nfs mount point as a volume using docker-compose. I was able to resolve the issue with:
docker-compose up --force-recreate
Eventhough you found the issue, this may help someone else.
Another reason is a mismatch with the UID/GID. This often shows up as being able to modify a mount as root but not as the containers user
You can set the UID, so for an ubuntu container running as ubuntu you may need to append :uid=1000 (check with id -u) or set the UID locally depending on your use case.
uid=value and gid=value
Set the owner and group of the files in the filesystem (default: uid=gid=0)
There is a good blog about it here with this tmpfs example
docker run \
--rm \
--read-only \
--tmpfs=/var/run/prosody:uid=100 \
-it learning/tmpfs
http://www.dendeer.com/post/docker-tmpfs/
I got answer from a comment under: Why does docker container prompt Permission denied?
man docker-run gives the proper answer:
Labeling systems like SELinux require that proper labels are placed on volume content mounted into a container. Without a label, the security system might prevent the processes running
inside the container from using the content. By default, Docker does not change the labels set by the OS.
To change a label in the container context, you can add either of two suffixes :z or :Z to the volume mount. These suffixes tell Docker to relabel file objects on the shared volumes. The z option tells Docker that two containers share the volume content. As a result, Docker labels the content with a shared content label. Shared volume labels allow all containers to
read/write content. The Z option tells Docker to label the content with a private unshared label. Only the current container can use a private volume.
For example:
docker run -it --name oracle_install -v /root/database:/tmp/install/database:z ...
So I was trying to run a C file using Python os.system in the container but the I was getting the same error my fix was while creating the image add this line RUN chmod -R 777 app it worked for me

Resources