Docker container on Alpine Linux 3.7: Strange pid 1 not visible within the container's pid namespace

Docker container on Alpine Linux 3.7: Strange pid 1 not visible within the container's pid namespace - linux

I am currently tracking a weird issue we are experiencing using dockerd 17.10.0-ce on an Alpine Linux 3.7 host. It seems for all the containers on this host, the process tree initiated as the entrypoint/command of the Docker image is NOT visible within the container itself. In comparison, on an Ubuntu host, the same image will have the process tree visible as PID 1.
Here is an example.
Run a container with an explicit known entrypoint/command:
% docker run -d --name testcontainer --rm busybox /bin/sh -c 'sleep 1000000'
Verify the processes are seen by dockerd properly:
% docker top testcontainer
PID USER TIME COMMAND
6729 root 0:00 /bin/sh -c sleep 1000000
6750 root 0:00 sleep 1000000
Now, start a shell inside that container and check the process list:
% docker exec -t -i testcontainer /bin/sh
/ # ps -ef
PID USER TIME COMMAND
6 root 0:00 /bin/sh
12 root 0:00 ps -ef
As can be observed, our entrypoint command (/bin/sh -c 'sleep 1000000') is not visible inside the container itself. Even running top will yield the same results.
Is there something I am missing here? On an Ubuntu host with the same docker engine version, the results are as I would expect. Could this be related to Alpine's hardened kernel causing an issue with how the container PID space is separated?
Any help appreciated for areas to investigate.
-b

It seems this problem is related to grsecurity module which the Alpine kernel implements. In this specific case, the GRKERNSEC_CHROOT_FINDTASK kernel setting is used to limit what processes can do outside of the chroot environment. This is controlled by the kernel.grsecurity.chroot_findtask sysctl variable.
From the grsecurity docs:
kernel.grsecurity.chroot_findtask
If you say Y here, processes inside a chroot will not be able to kill,
send signals with fcntl, ptrace, capget, getpgid, setpgid, getsid, or
view any process outside of the chroot. If the sysctl option is
enabled, a sysctl option with name "chroot_findtask" is created.
The only workaround I have found for now is to disable this flag as well as the chroot_deny_mknod and chroot_deny_chmod flags in order to get the same behaviour as with a non-grsecurity kernel.
kernel.grsecurity.chroot_deny_mknod=0
kernel.grsecurity.chroot_deny_chmod=0
kernel.grsecurity.chroot_findtask=0
Of course this is less than ideal since it bypasses and disables security features of the system but might be a valid workaround for a development environment.

Related

Docker - is it safe to switch to non-root user in ENTRYPOINT?

Is it considered a secure practice to run root privileged ENTRYPOINT ["/bin/sh", entrypoint.sh"], that later switches to non-root user before running the application?
More context:
There are a number of articles (1, 2, 3) suggesting that running the container as non-root user is a best practice in terms of security. This can be achieved using the USER appuser command, however there are cases (4, 5) when running the container as root and only switching to non-root in the an entrypoint.sh script is the only way to go around, eg:
#!/bin/sh
chown -R appuser:appgroup /path/to/volume
exec runuser -u appuser "$#"
and in Dockerfile:
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/bin/sh", "entrypoint.sh"]
CMD ["/usr/bin/myapp"]
When calling docker top container I can see two processes, one root and one non-root
PID USER TIME COMMAND
5004 root 0:00 runuser -u appuser /usr/bin/myapp
5043 1000 0:02 /usr/bin/myapp
Does it mean my container is running with a vulnerability given that root process, or is it considered secure?
I found little discussion on the subject (6, 7) and none seem definitive. I've looked for similar questions on StackOverflow but couldn't find anything related (8, 9, 10) that would address the security.

I just looked through what relevant literature (Adrian Mouat's Docker, Liz Rice's Container Security) has to say on the topic and added my own thoughts to it:
The main intention behind the much cited best practice to run containers as non-root is to avoid container breakouts via vulnerabilities in the application code. Naturally, if your application runs as root and then your container has access to the host, e.g. via a bind mount volume, a container breakout is possible. Likewise, if your application has rights to execute system libraries with vulnerabilities on your container file system, a denial of service attack looms.
Against these risks you are protected with your approach of using runuser, since your application would not have rights on the host's root file system. Similarly, your application could not be abused to call system libraries on the container file system or even execute system calls on the host kernel.
However, if somebody attaches to your container with exec, he would be root, since the container main process belongs to root. This might become an issue on systems with elaborate access right concepts like Kubernetes. Here, certain user groups might be granted a read-only view of the cluster including the right to exec into containers. Then, as root, they will have more rights than necessary, including possible rights on the host.
In conclusion, I don't have strong security concerns regarding your approach, since it mitigates the risk of attacks via application vulnerabilities by running the application as non-root. The fact that you run to container main process as root, I see as a minor disadvantage that only creates problems in niche access control setups, where not fully trusted subjects get read-only access to your system.

In your case runuser process (PID 1) stays alive. If you want to substitute PID 1, use
Dockerfile
ENTRYPOINT ["/bin/bash", "/var/usr/entrypoint.sh" ]
entrypoint.sh
#add user&group if not existing
exec su -l USERNAME -c "/bin/bash /var/usr/starting.sh"
In starting.sh you do everything you have to do as non-root, or call it directly after -c if it's just a one liner.
Result
docker top
UID PID PPID C STIME TTY TIME CMD
1000 11577 11556 0 14:58 ? 00:00:00 /bin/bash /var/usr/starting.sh
1000 11649 11577 0 14:58 ? 00:00:00 sleep 24h
There is no process using root anymore (all use UID 1000 in this case), and starting.sh containing bash runs as PID 1, which solves the docker logs problem also (just logs PID 1). Sub-Processes are also started with non-root User. Only if you docker exec from host, it is still using root. But this is mostly wanted for debugging.
top inside container
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 usr1000 S 2316 0% 7 0% /bin/bash /var/usr/starting.sh
48 0 root S 1660 0% 7 0% /bin/sh
54 48 root R 1592 0% 1 0% top
47 1 usr1000 S 1580 0% 5 0% sleep 24h
So you see: PID 1 runs the stuff (run everything with exec in there if you need it in the same process). Only thing running as root is the docker exec shell and therefore top.

Finding Docker container processes? (from host point of view)

I am doing some tests on docker and containers and I was wondering:
Is there a method I can use to find all process associated with a docker container by its name or ID from the host point of view.
After all, at the end of the day a container is a set of virtualized processes.

You can use docker top command.
This command lists all processes running within your container.
For instance this command on a single process container on my box displays:
UID PID PPID C STIME TTY TIME CMD
root 14097 13930 0 23:17 pts/6 00:00:00 /bin/bash
All methods mentioned by others are also possible to use but this one should be easiest.
Update:
To simply get the main process id within the container use this command:
docker inspect -f '{{.State.Pid}}' <container id>

Another way to get an overview of all Docker processes running on a host is using generic cgroup based systemd tools.
systemd-cgls will show all our cgroups and the processes running in them in a tree-view, like this:
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
├─docker
│ ├─070a034d27ed7a0ac0d336d72cc14671584cc05a4b6802b4c06d4051ce3213bd
│ │ └─14043 bash
│ ├─dd952fc28077af16a2a0a6a3231560f76f363359f061c797b5299ad8e2614245
│ │ └─3050 go-cron -s 0 0 * * * * -- automysqlbackup
As every Docker container has its own cgroup, you can also see Docker Containers and their corresponding host processes this way.
Two interesting properties of this method:
It works even if the Docker Daemon(s) are defunct.
It's a pretty quick overview.
You can also use systemd-cgtop to get an overview of the resource usage of Docker Containers, similar to top.
By the way: Since systemd services also correspond to cgroups these methods are also applicable to non-Dockerized systemd services.

I found a similar solution using a bash script in one line:
for i in $(docker container ls --format "{{.ID}}"); do docker inspect -f '{{.State.Pid}} {{.Name}}' $i; done

the process run in a docker container is a child of a process named containerd-shim (in Docker v18.09.4)
First figure out the process IDs of the containerd-shim processes.
For each of them, find their child process.
pgrep containerd-shim
7105
7141
7248
To find the child process of parent process 7105:
pgrep -P 7105
7127
In the end you could get the list with:
for i in $(pgrep containerd-shim); do pgrep -P $i; done
7127
7166
7275

When running this on the host, it will give you a list of processes running in a container with <Container ID>, showing host PIDs instead of container PIDs.
DID=$(docker inspect -f '{{.State.Pid}}' <Container ID>);ps --ppid $DID -o pid,ppid,cmd

docker ps will list docker containers that are running.
docker exec <id|name> ps will tell you the processes it's running.

Since the following command shows only the container's itself process ID (not all child processes):
docker inspect -f '{{.State.Pid}}' <container-name_or_ID>
To find a process that is the child of a container, this process ID must be find in directory /proc. So find "processID" inside it and then find the container hash from file:
/proc/parent_process/task/processID
and then cut container ID from hash (first 12-digits of the container hash) and then find the container itself:
#!/bin/bash
processPath=$(find /proc/ -name $1 2>/dev/null)
containerID=$(cat ${processPath}/cgroup | fgrep 'pids:/docker/' | sed -e 's#.*/docker/##g' | cut -c 1-12)
docker ps | fgrep $containerID
Save above script in a file such as: p2c and run it by:
p2c <PID>
For example:
p2c 85888

Another solution with docker container and docker top
docker ps --format "{{.ID}}" | xargs -I'{}' docker top {} -o pid | awk '!/PID/'
Note: awk '!/PID/' just remove the PID header from the output of docker top
If you want to know the whole process tree of docker container, you can try it
docker ps --format "{{.ID}}" | xargs -I'{}' docker top {} -o pid | awk '!/PID/' | xargs -I'{}' pstree -psa {}

Docker stats "container id"
Shows the resource consumption along with pid or simply Docker ps .
Probably this cheat sheet can be of use.
http://theearlybirdtechnology.com/2017/08/12/docker-cheatsheet/

docker: different PID for `top` and `ps`

I don't understand the difference in
$> docker top lamp-test
PID USER COMMAND
31263 root {supervisord} /usr/bin/python /usr/bin/supervisord -n
31696 root {mysqld_safe} /bin/sh /usr/bin/mysqld_safe
31697 root apache2 -D FOREGROUND
...
and
$> docker exec lamp-test ps
PID TTY TIME CMD
1 ? 00:00:00 supervisord
433 ? 00:00:00 mysqld_safe
434 ? 00:00:00 apache2
831 ? 00:00:00 ps
So, the question is, why are the PID different ? I would say that the output from ps is namespaced, but if that is true, what is top showing!

docker exec lamp-test ps show pids inside docker container.
docker top lamp-test show host system pids.
You can see a container processes, but You cannot kill them. This "flawed" isolation actually has some great benefits, like the ability to monitor the processes running inside all your containers from a single monitor process running on the host machine.

I don't think you should worry about this. You can't kill the PID in Host environment, but can do it in container.
docker exec <CONTAINER NAME> ps remember the PID
docker exec <CONTAINER NAME> kill <PID>

docker run a shell script in the background without exiting the container

I am trying to run a shell script in my docker container. The problem is that the shell script spawns another process and it should continue to run unless another shutdown script is used to terminate the processes that are spawned by the startup script.
When I run the below command,
docker run image:tag /bin/sh /root/my_script.sh
and then,
docker ps -a
I see that the command has exited. But this is not what I want. My question is how to let the command run in background without exiting?

You haven't explained why you want to see your container running after your script has exited, or whether or not you expect your script to exit.
A docker container exits as soon as the container's CMD exits. If you want your container to continue running, you will need a process that will keep running. One option is simply to put a while loop at the end of your script:
while :; do
sleep 300
done
Your script will never exit so your container will keep running. If your container hosts a network service (a web server, a database server, etc), then this is typically the process the runs for the life of the container.
If instead your script is exiting unexpectedly, you will probably need to take a look at your container logs (docker logs <container>) and possibly add some debugging to your script.
If you are simply asking, "how do I run a container in the background?", then Emil's answer (pass the -d flag to docker run) will help you out.

The process that docker runs takes the place of init in the UNIX process tree. init is the topmost parent process, and once it exits the docker container stops. Any child process (now an orphan process) will be stopped as well.
$ docker pull busybox >/dev/null
$ time docker run --rm busybox sleep 3
real 0m3.852s
user 0m0.179s
sys 0m0.012s
So you can't allow the parent pid to exit, but you have two options. You can leave the parent process in place and allow it to manage its children (for example, by telling it to wait until all child processes have exited)
$ time docker run --rm busybox sh -c 'sleep 3 & wait'
real 0m3.916s
user 0m0.178s
sys 0m0.013s
…or you can replace the parent process with the child process using exec. This means that the new command is being executed in the parent process's space…
$ time docker run --rm busybox sh -c 'exec sleep 3'
real 0m3.886s
user 0m0.173s
sys 0m0.010s
This latter approach may be complex depending on the nature of the child process, but having fewer unnecessary processes running is more idiomatically Docker. (Which is not saying you should only ever have one process.)

Run you container with your script in background with below command
docker run -i -t -d image:tag /bin/sh /root/my_script.sh
Check the container id by docker ps command
Then verify your script is executing or not on container
docker exec <id> /bin/sh -l -c "ps aux"

Wrap the program with a docker-entrypoint.sh bash script that blocks the container process and is able to catch ctrl-c. This bash example should help:
https://rimuhosting.com/knowledgebase/linux/misc/trapping-ctrl-c-in-bash
The script should shutdown the process cleanly when the exit signal is sent by Docker.
You can also add a loop inside the script that repeatedly checks the running process.

docker attach vs lxc-attach

UPDATE: Docker 0.9.0 use libcontainer now, diverting from LXC see: Attaching process to Docker libcontainer container
I'm running an istance of elasticsearch:
docker run -d -p 9200:9200 -p 9300:9300 dockerfile/elasticsearch
Checking the process it show like the following:
$ docker ps --no-trunc
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22 dockerfile/elasticsearch:latest /usr/share/elasticsearch/bin/elasticsearch java About an hour ago Up 8 minutes 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp pensive_morse
Now, when I try to attach the running container, I get stacked:
$ sudo docker attach 49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22
[sudo] password for lsoave:
the tty doesn't connect and the prompt is not back. Doing the same with lxc-attach works fine:
$ sudo lxc-attach -n 49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22
root#49fdccefe4c8:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 49 20:37 ? 00:00:20 /usr/bin/java -Xms256m -Xmx1g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMa
root 88 0 0 20:38 ? 00:00:00 /bin/bash
root 92 88 0 20:38 ? 00:00:00 ps -ef
root#49fdccefe4c8:/#
Does anybody know what's wrong with docker attach ?
NB. dockerfile/elasticsearch ends with:
ENTRYPOINT ["/usr/share/elasticsearch/bin/elasticsearch"]

You're attaching to a container that is running elasticsearch which isn't an interactive command. You don't get a shell to type in because the container is not running a shell. The reason lxc-attach works is because it's giving you a default shell. Per man lxc-attach:
If no command is specified, the current default shell of the user
running lxc-attach will be looked up inside the container and
executed. This will fail if no such user exists inside the container
or the container does not have a working nsswitch mechanism.
docker attach is behaving as expected.

As Ben Whaley notes this is expected behavior.
It's worth mentioning though that if you want to monitor the process you can do a number of things:
Start bash as front process: e.g. $ES_DIR/bin/elasticsearch && /bin/bash will give you your shell when you attach. Mainly useful during development. Not so clean :)
Install an ssh server. Although I've never done this myself it's a good option. Drawback is of course overhead, and maybe a security angle. Do you really want ssh on all of your containers? Personally, I like to keep them as small as possible with single-process as the ultimate win.
Use the log files! You can use docker cp to get the logs locally, or better the docker logs $CONTAINER_ID command. The latter give you the accumulated stdin/stderr output for the entre lifetime of the container each time though.
Mount the log directory. Just mount a directory on your host and have elasticsearch write to a logfile in that directory. You can have syslog on your host, Logstash, or whatever turns you on ;). Of course, the drawback here is that you are now using your host more than you might like. I also found a nice experiment using logstash in this blog.

FWIW, now that Docker 1.3 is released, you can use "docker exec" to open up a shell or other process on a running container. This should allow you to effectively replace lxc-attach when using the native driver.
http://blog.docker.com/2014/10/docker-1-3-signed-images-process-injection-security-options-mac-shared-directories/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string