Is it considered a secure practice to run root privileged ENTRYPOINT ["/bin/sh", entrypoint.sh"], that later switches to non-root user before running the application?
More context:
There are a number of articles (1, 2, 3) suggesting that running the container as non-root user is a best practice in terms of security. This can be achieved using the USER appuser command, however there are cases (4, 5) when running the container as root and only switching to non-root in the an entrypoint.sh script is the only way to go around, eg:
#!/bin/sh
chown -R appuser:appgroup /path/to/volume
exec runuser -u appuser "$#"
and in Dockerfile:
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/bin/sh", "entrypoint.sh"]
CMD ["/usr/bin/myapp"]
When calling docker top container I can see two processes, one root and one non-root
PID USER TIME COMMAND
5004 root 0:00 runuser -u appuser /usr/bin/myapp
5043 1000 0:02 /usr/bin/myapp
Does it mean my container is running with a vulnerability given that root process, or is it considered secure?
I found little discussion on the subject (6, 7) and none seem definitive. I've looked for similar questions on StackOverflow but couldn't find anything related (8, 9, 10) that would address the security.
I just looked through what relevant literature (Adrian Mouat's Docker, Liz Rice's Container Security) has to say on the topic and added my own thoughts to it:
The main intention behind the much cited best practice to run containers as non-root is to avoid container breakouts via vulnerabilities in the application code. Naturally, if your application runs as root and then your container has access to the host, e.g. via a bind mount volume, a container breakout is possible. Likewise, if your application has rights to execute system libraries with vulnerabilities on your container file system, a denial of service attack looms.
Against these risks you are protected with your approach of using runuser, since your application would not have rights on the host's root file system. Similarly, your application could not be abused to call system libraries on the container file system or even execute system calls on the host kernel.
However, if somebody attaches to your container with exec, he would be root, since the container main process belongs to root. This might become an issue on systems with elaborate access right concepts like Kubernetes. Here, certain user groups might be granted a read-only view of the cluster including the right to exec into containers. Then, as root, they will have more rights than necessary, including possible rights on the host.
In conclusion, I don't have strong security concerns regarding your approach, since it mitigates the risk of attacks via application vulnerabilities by running the application as non-root. The fact that you run to container main process as root, I see as a minor disadvantage that only creates problems in niche access control setups, where not fully trusted subjects get read-only access to your system.
In your case runuser process (PID 1) stays alive. If you want to substitute PID 1, use
Dockerfile
ENTRYPOINT ["/bin/bash", "/var/usr/entrypoint.sh" ]
entrypoint.sh
#add user&group if not existing
exec su -l USERNAME -c "/bin/bash /var/usr/starting.sh"
In starting.sh you do everything you have to do as non-root, or call it directly after -c if it's just a one liner.
Result
docker top
UID PID PPID C STIME TTY TIME CMD
1000 11577 11556 0 14:58 ? 00:00:00 /bin/bash /var/usr/starting.sh
1000 11649 11577 0 14:58 ? 00:00:00 sleep 24h
There is no process using root anymore (all use UID 1000 in this case), and starting.sh containing bash runs as PID 1, which solves the docker logs problem also (just logs PID 1). Sub-Processes are also started with non-root User. Only if you docker exec from host, it is still using root. But this is mostly wanted for debugging.
top inside container
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 usr1000 S 2316 0% 7 0% /bin/bash /var/usr/starting.sh
48 0 root S 1660 0% 7 0% /bin/sh
54 48 root R 1592 0% 1 0% top
47 1 usr1000 S 1580 0% 5 0% sleep 24h
So you see: PID 1 runs the stuff (run everything with exec in there if you need it in the same process). Only thing running as root is the docker exec shell and therefore top.
Related
Docker best practices state:
If a service can run without privileges, use USER to change to a non-root user.
In the case of cron, that doesn't seem practical as cron needs root privileges to function properly. The executable that cron runs, however, does NOT need root privileges. Therefore, I run cron itself as the root user, but call my crontab script to run the executable (in this case, a simple Python FTP download script I wrote) as a non-root user via the crontab -u <user> command.
The cron/Docker interactability and community experience still seems to be in its infancy, but there are some pretty good solutions out there. Utilizing lessons gleaned from this and this great posts, I arrived at a Dockerfile that looks something like this:
FROM python:3.7.4-alpine
RUN adduser -S riptusk331
WORKDIR /home/riptusk331
... boilerplate not necessary to post here ...
COPY mycron /etc/cron.d/mycron
RUN chmod 644 /etc/cron.d/mycron
RUN crontab -u riptusk331 /etc/cron.d/mycron
CMD ["crond", "-f", "-l", "0"]
and the mycron file is just a simple python execution running every minute
* * * * * /home/riptusk331/venv/bin/python3 /home/riptusk331/ftp.py
This works perfectly fine, but I am unsure of how exactly logging is being handled here. I do not see anything saved in /var/log/cron. I can see the output of cron and ftp.py on my terminal, as well as in the container logs if I pull it up in Kitematic. But I have no idea what is actually going on here.
So my first question(s) are: how is logging & output being handled here (without any redirects after the cron job), and is this implementation method ok & secure?
VonC's answer to this post suggests appending > /proc/1/fd/1 2>/proc/1/fd/2 to your cron job to redirect output to Docker's stdout and stderr. This is where I both get a little confused, and run into trouble.
My crontab file now looks like this
* * * * * /home/riptusk331/venv/bin/python3 /home/riptusk331/ftp.py > /proc/1/fd/1 2>/proc/1/fd/2
The output without any redirection appeared to be going to stdout/stderr already, but I am not entirely sure. I just know it was showing up on my terminal. So why would this redirect be needed?
When I add this redirect, I run into permissioning issues. Recall that this crontab is being called as the non-root user riptusk331. Because of this, I don't have root access and get the following error:
/bin/ash: can't create /proc/1/fd/1: Permission denied
The Alpine base images are based on a compact tool set called BusyBox and when you run crond here you're getting the BusyBox cron and not any other implementation. Its documentation is a little sparse, but if you look at the crond source (in C) what you'll find is that there is not any redirection at all when it goes to run a job (see the non-sendmail version of start_one_job); the job's stdout and stderr are crond's stdout and stderr. In Docker, since crond is the container primary process, that in turn becomes the container's output stream.
Anything that shows up in docker logs definitionally went to stdout or stderr or the container's main process. If this cron implementation wrote your job's output directly there, there's nothing wrong or insecure with taking advantage of that.
In heavier-weight container orchestration systems, there is some way to run a container on a schedule (Kubernetes CronJobs, Nomad periodic jobs). You might find it easier and more consistent with these systems to set up a container that runs your job once and then exits, and then to set up the host's cron to run your container (necessarily, as root).
You need to allow the CAP_SETGID to run crond as user, this can be a security risk if it is set to all busybox binary but you can use dcron package instead of busybox's builtin crond and set the CAP_SETGID just on that program. Here is what you need to add for Alpine, using riptusk331 as running user
USER root
# crond needs root, so install dcron and cap package and set the capabilities
# on dcron binary https://github.com/inter169/systs/blob/master/alpine/crond/README.md
RUN apk add --no-cache dcron libcap && \
chown riptusk331:riptusk331 /usr/sbin/crond && \
setcap cap_setgid=ep /usr/sbin/crond
USER riptusk331
I am currently tracking a weird issue we are experiencing using dockerd 17.10.0-ce on an Alpine Linux 3.7 host. It seems for all the containers on this host, the process tree initiated as the entrypoint/command of the Docker image is NOT visible within the container itself. In comparison, on an Ubuntu host, the same image will have the process tree visible as PID 1.
Here is an example.
Run a container with an explicit known entrypoint/command:
% docker run -d --name testcontainer --rm busybox /bin/sh -c 'sleep 1000000'
Verify the processes are seen by dockerd properly:
% docker top testcontainer
PID USER TIME COMMAND
6729 root 0:00 /bin/sh -c sleep 1000000
6750 root 0:00 sleep 1000000
Now, start a shell inside that container and check the process list:
% docker exec -t -i testcontainer /bin/sh
/ # ps -ef
PID USER TIME COMMAND
6 root 0:00 /bin/sh
12 root 0:00 ps -ef
As can be observed, our entrypoint command (/bin/sh -c 'sleep 1000000') is not visible inside the container itself. Even running top will yield the same results.
Is there something I am missing here? On an Ubuntu host with the same docker engine version, the results are as I would expect. Could this be related to Alpine's hardened kernel causing an issue with how the container PID space is separated?
Any help appreciated for areas to investigate.
-b
It seems this problem is related to grsecurity module which the Alpine kernel implements. In this specific case, the GRKERNSEC_CHROOT_FINDTASK kernel setting is used to limit what processes can do outside of the chroot environment. This is controlled by the kernel.grsecurity.chroot_findtask sysctl variable.
From the grsecurity docs:
kernel.grsecurity.chroot_findtask
If you say Y here, processes inside a chroot will not be able to kill,
send signals with fcntl, ptrace, capget, getpgid, setpgid, getsid, or
view any process outside of the chroot. If the sysctl option is
enabled, a sysctl option with name "chroot_findtask" is created.
The only workaround I have found for now is to disable this flag as well as the chroot_deny_mknod and chroot_deny_chmod flags in order to get the same behaviour as with a non-grsecurity kernel.
kernel.grsecurity.chroot_deny_mknod=0
kernel.grsecurity.chroot_deny_chmod=0
kernel.grsecurity.chroot_findtask=0
Of course this is less than ideal since it bypasses and disables security features of the system but might be a valid workaround for a development environment.
I am trying to configure a daemon process to run as a particular user, with that user's configuration information. I've tried
daemon /path/to/script --user=myUser
However, when I run a sample script that simply echoes $HOME to a file, it still shows /root for the home, not /home/myUser. Is there a switch for daemon that changes to the user as if you were doing "su"?
If not, is there a better way to accomplish this?
Thanks
Is there a switch for daemon that changes to the user as if you were
doing "su"?
No, there is no such switch. But you can use su itself:
daemon -- su -s/path/to/script myUser
or
daemon -- su -s/path/to/script - myUser
However note that this won't read myUser's configuration file (~/.daemonrc).
command:
bigxu#bigxu-ThinkPad-T410 ~/work/lean $ sudo ls
content_shell.pak leanote libgcrypt.so.11 libnotify.so.4 __MACOSX resources
icudtl.dat leanote.png libnode.so locales natives_blob.bin snapshot_blob.bin
most time it is right.but sometimes it is very slow.
so i strace it.
command:
bigxu#bigxu-ThinkPad-T410 ~/work/lean $ strace sudo ls
execve("/usr/bin/sudo", ["sudo", "ls"], [/* 66 vars */]) = 0
brk(0) = 0x7f2b3c423000
fcntl(0, F_GETFD) = 0
fcntl(1, F_GETFD) = 0
fcntl(2, F_GETFD) = 0
......
......
......
write(2, "sudo: effective uid is not 0, is"..., 140sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?
) = 140
exit_group(1) = ?
+++ exited with 1 +++
other information:
bigxu-ThinkPad-T410 lean # ls /etc/sudoers -alht
-r--r----- 1 root root 745 2月 11 2014 /etc/sudoers
bigxu-ThinkPad-T410 lean # ls /usr/bin/sudo -alht
-rwsr-xr-x 1 root root 152K 12月 14 21:13 /usr/bin/sudo
bigxu-ThinkPad-T410 lean # df `which sudo`
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 67153528 7502092 56217148 12%
For security reasons, the setuid bit and ptrace (used to run binaries under a debugger) cannot both be honored at the same time. Failure to enforce this restriction in the past led to CVE-2001-1384.
Consequently, any operating system designed with an eye to security will either stop honoring ptrace on exec of a setuid binary, or fail to honor the setuid bit when ptrace is in use.
On Linux, consider using Sysdig instead -- which, being able to only view but not modify behavior, does not run the same risks.
How to trace sudo
$ sudo strace -u <username> sudo -k <command>
sudo runs strace as root.
strace runs sudo as <username> passed via the -u option.
sudo drops cached credentials from the previous sudo with -k option (for asking the password again) and runs <command>.
The second sudo is the tracee (the process being traced).
For automatically putting the current user in the place of <username>, use $(id -u -n).
Why sudo does not work with strace
In addition to this answer by Charles, here is what execve() manual page says:
If the set-user-ID bit is set on the program file referred to by pathname, then the effective user ID of the calling process is changed to that of the owner of the program file. Similarly, when the set-group-ID bit of the program file is set the effective group ID of the calling process is set to the group of the program file.
The aforementioned transformations of the effective IDs are not performed (i.e., the set-user-ID and set-group-ID bits are ignored) if any of the following is true:
the no_new_privs attribute is set for the calling thread (see prctl(2));
the underlying filesystem is mounted nosuid (the MS_NOSUID flag for mount(2)); or
the calling process is being ptraced.
The capabilities of the program file (see capabilities(7)) are also ignored if any of the above are true.
The permissions for tracing a process, inspecting or modifying its memory, are described in subsection Ptrace access mode checking in section NOTES of ptrace(2) manual page. I've commented about this in this answer.
UPDATE: Docker 0.9.0 use libcontainer now, diverting from LXC see: Attaching process to Docker libcontainer container
I'm running an istance of elasticsearch:
docker run -d -p 9200:9200 -p 9300:9300 dockerfile/elasticsearch
Checking the process it show like the following:
$ docker ps --no-trunc
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22 dockerfile/elasticsearch:latest /usr/share/elasticsearch/bin/elasticsearch java About an hour ago Up 8 minutes 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp pensive_morse
Now, when I try to attach the running container, I get stacked:
$ sudo docker attach 49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22
[sudo] password for lsoave:
the tty doesn't connect and the prompt is not back. Doing the same with lxc-attach works fine:
$ sudo lxc-attach -n 49fdccefe4c8c72750d8155bbddad3acd8f573bf13926dcaab53c38672a62f22
root#49fdccefe4c8:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 49 20:37 ? 00:00:20 /usr/bin/java -Xms256m -Xmx1g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMa
root 88 0 0 20:38 ? 00:00:00 /bin/bash
root 92 88 0 20:38 ? 00:00:00 ps -ef
root#49fdccefe4c8:/#
Does anybody know what's wrong with docker attach ?
NB. dockerfile/elasticsearch ends with:
ENTRYPOINT ["/usr/share/elasticsearch/bin/elasticsearch"]
You're attaching to a container that is running elasticsearch which isn't an interactive command. You don't get a shell to type in because the container is not running a shell. The reason lxc-attach works is because it's giving you a default shell. Per man lxc-attach:
If no command is specified, the current default shell of the user
running lxc-attach will be looked up inside the container and
executed. This will fail if no such user exists inside the container
or the container does not have a working nsswitch mechanism.
docker attach is behaving as expected.
As Ben Whaley notes this is expected behavior.
It's worth mentioning though that if you want to monitor the process you can do a number of things:
Start bash as front process: e.g. $ES_DIR/bin/elasticsearch && /bin/bash will give you your shell when you attach. Mainly useful during development. Not so clean :)
Install an ssh server. Although I've never done this myself it's a good option. Drawback is of course overhead, and maybe a security angle. Do you really want ssh on all of your containers? Personally, I like to keep them as small as possible with single-process as the ultimate win.
Use the log files! You can use docker cp to get the logs locally, or better the docker logs $CONTAINER_ID command. The latter give you the accumulated stdin/stderr output for the entre lifetime of the container each time though.
Mount the log directory. Just mount a directory on your host and have elasticsearch write to a logfile in that directory. You can have syslog on your host, Logstash, or whatever turns you on ;). Of course, the drawback here is that you are now using your host more than you might like. I also found a nice experiment using logstash in this blog.
FWIW, now that Docker 1.3 is released, you can use "docker exec" to open up a shell or other process on a running container. This should allow you to effectively replace lxc-attach when using the native driver.
http://blog.docker.com/2014/10/docker-1-3-signed-images-process-injection-security-options-mac-shared-directories/