Using runAsNonRoot in Kubernetes - security

We’ve been planning for a long time to introduce securityContext: runAsNonRoot: true as a requirement to our pod configurations for a while now.
Testing this today I’ve learnt that since v1.8.4 (I think) you also have to specify a particular UID for the user running the container, e.g runAsUser: 333.
This means we not only have to tell developers to ensure their containers don’t run as root, but also specify a specific UID that they should run as, which makes this significantly more problematic for us to introduce.
Have I understood this correctly? What are others doing in this area? To leverage runAsNonRoot is it now required that Docker containers run with a specific and known UID?

The Kubernetes Pod SecurityContext provides two options runAsNonRoot and runAsUser to enforce non root users. You can use both options separate from each other because they test for different configurations.
When you set runAsNonRoot: true you require that the container will run with a user with any UID other than 0. No matter which UID your user has.
When you set runAsUser: 333 you require that the container will run with a user with UID 333.

What are others doing in this area?
We are using runAsUser in situations where we don't want root to be used. Granted, those situations are not that frequent as you might think, since philosophy of deployment of 'processes' as separated pod's containers inside kubernetes cluster architecture differs from traditional compound monolithic deployment on single host where security implications of breach are quite different...
Most of our local development is done either on minicube or docker edge with k8s manifests so setup is as close as possible to our deployment cluster (apart from obvious limits). With that said, we don't have issues with user id allocation since initialization of persistent volume is not done externally so all file user/group ownership is done within pods with proper file permissions. On very rare occasions that docker is used for development, developer is instructed to set appropriate permissions manually across mounted volumes but that rare happens.

Related

k8s check pod securityContext definition

The bounty expires in 5 days. Answers to this question are eligible for a +50 reputation bounty.
PJEM is looking for an answer from a reputable source:
I need to know which combination of security context config can have a security escalation, and also can control other pods in the cluster
I want to check if pod in the cluster running as privileged pods, which can indicate that we may have security issue, so I check if
privileged: true
However under the
securityContext: spec there is additional fields like
allowPrivilegeEscalation
RunAsUser
ProcMount
Capabilities
etc
Which may be risky (not sure about it) ,
My question is in case the pod is marked as privileged:false and the other fields are true like the following example,if this indicate some security issue ? Does this pods can do some operation on other pods etc , access external data?
For example the following configuration which indicate the the pod is not privileged but allowPrivilegeEscalation: true
securityContext:
allowPrivilegeEscalation: true
privileged: false
I want to know which securityContext combination of pod config can control other pods/process in the cluster ?
The securityContext are more related to the container itself and some access to the host machine.
The allowPrivilegeEscalation allow a process to gain more permissions than its parent process. This is more related to setuid/setgid flags in binaries, but inside a container there is no much to get worried about.
You can only control other containers in the host machine from inside a container if you have a hostPath volume, or something like that, allowing you to reach the .sock file as /run/crio/crio.sock or the docker.sock. Is pretty obvious that, if you are concerned about this, allowing requests to Docker API through the network should be disabled.
Of course, all of these access are ruled by DAC and MAC restrictions. This is why podman uidmap is better, because the root inside the container do not have the same root id outside the container.
From Kubernetes point of view, you don't need this kind of privileges, all you need is a ServiceAccount and the correct RBAC permissions to control other things inside Kubernetes. A ServiceAccount binded to a cluster-admin ClusterRole can do anything in the API and much more, like adding ssh keys to the hosts.
If you are concerned about pods executing things in Kubernetes or in the host, just force the use of nonRoot containers, avoid indiscriminate use of hostPath volumes, and control your RBAC.
Openshift uses a very nice restriction by default:
Ensures that pods cannot run as privileged
Ensures that pods cannot mount host directory volumes
Requires that a pod is run as a user in a pre-allocated range of UIDs (openshift feature, random uid)
Requires that a pod is run with a pre-allocated MCS label (selinux related)
I don't answer exactly what you want, because I shifted the attention to RBAC, but I hope this can give you a nice idea.
Strictly in the scope of securityContext (as of Kubernetes 1.26 API), here's few things that may be risky:
Certainly risky
capabilities.add will add Linux capabilities (like CAP_SYS_TIME to set system time) to a container. The default depends on container runtime (see for example Docker default set of capabilities) and should be reasonably secure, but adding capabilities like CAP_SYS_ADMIN may represent a risk. Excessive capabilities outlines a few possible escalations.
privileged: true grants all capabilities, so you'll definitely want to check for that (as you already do).
allowPrivilegeEscalation: true is risky as it allow a process to gain more privileges than its parent.
procMount will allow a container mounting node's /proc and expose sensible host information.
windowsOptions may be risky. According to Kubernetes doc it enables privileged access to the Windows node. I don't know much about Windows security, but I'd say risky :-)
Maybe risky (though usually intended to restrict permissions)
runAsGroup and runAsUser may be risky when set to root / 0. Given that by default container runtime will probably run container as root already it's mostly used to restrict container's permissions to a non-root user. But if your container runtime is configured to run container as non-root by default, it might be used to bypass that and run a container with root.
seLinuxOptions may be used to provide an insecure SELinux context, but is usually intended to define a more secure context.
seccompProfile defines system calls a container is allowed to make. It may be used to get access to sensitive system calls, though it's usually intended to restrict them.
(probably) Not risky
readOnlyRootFilesystem (default false) will make container root system read-only.
runAsNonRoot (default false) will prevent a container from running as root
capabilities.drop will drop Linux capabilities, restricting further what a container can do.
You can read more on Configure a Security Context official doc
What about non-Security Context related risks?
Security Context is not the only thing you should be wary about: you should also consider volume mounts to unsecure locations, RBAC, network, Secrets, etc. A good overview is provided by Security Checklist.

Using Ansible to implement certs rotation functionality in Kubernetes Cluster

How to use Ansible for certs rotation on different layers in kubernetes cluster?
Before we used fleet and now migrating to kubernetes.
If I hear your situation correctly, then I think you will be happiest with a DaemonSet that installs (and optionally monitors) ansible-pull.service and ansible-pull.timer on the Nodes.
The DaemonSet ensures the container is scheduled on every Node (unlike a CronJob or such), and with /etc/systemd/system volume mounted into the container plus go-systemd's ability to daemon-reload (along with the dbus socket, of course), the container can write out a suitably descriptive .service and .timer file for that Node.
Then ansible-pull will run as before, taking whatever steps your existing ansible playbooks did.
There are many approaches to how to achieve this similar action on non-Node machines, so I'll leave that as an exercise to the reader.
I don't know what you define as "Infrastructure" layer, but rotating the Kubernetes certs is relatively straightforward from ansible-pull's perspective: write out the new worker.pem and worker.key in /etc/kubernetes/ssl, bounce kubelet.service (or its hyperkube equivalent), voilà. Upper platform services I would expect are managed by the (ReplicaSet|Deployment|ReplicationController|etc) which owns them, meaning one can be a lot more declarative for in-cluster resources, having access to the full power of ConfigMap, Secret, Service, etc.

Disable certain Docker run options

I'm currently working on a setup to make Docker available on a high performance cluster (HPC). The idea is that every user in our group should be able to reserve a machine for a certain amount of time and be able to use Docker in a "normal way". Meaning accessing the Docker Daemon via the Docker CLI.
To do that, the user would be added to the Docker group. But this imposes a big security problem for us, since this basically means that the user has root privileges on that machine.
The new idea is to make use of the user namespace mapping option (as described in https://docs.docker.com/engine/reference/commandline/dockerd/#/daemon-user-namespace-options). As I see it, this would tackle our biggest security concern that the root in a container is the same as the root on the host machine.
But as long as users are able to bypass this via --userns=host , this doesn't increase security in any way.
Is there a way to disable this and other Docker run options?
As mentioned in issue 22223
There are a whole lot of ways in which users can elevate privileges through docker run, eg by using --privileged.
You can stop this by:
either not directly providing access to the daemon in production, and using scripts,
(which is not what you want here)
or by using an auth plugin to disallow some options.
That is:
dockerd --authorization-plugin=plugin1
Which can lead to:

Is there a way to restrict untrusted container scheduler?

I have an application which I'd like to give the privilege to launch short-lived tasks and schedule these as docker containers. I was thinking of doing this simply via docker run.
As I want to make the attack surface as small as possible, I treat the application as untrusted. As such it can potentially run arbitrary docker run commands (if the codebase contained bug or the container was compromised, input was improperly escaped somewhere etc.) against a predefined docker API endpoint.
This is why I'd like to restrict that application (effectively a scheduler) in some ways:
prevent --privileged use
enforce --read-only flag
enforce memory & CPU limits
I looked at couple of options:
selinux
the selinux policies would need to be set on the host level and then propagated inside the containers via --selinux-enabled flag on the daemon level. The scheduler can however override this anyway via run --privileged.
seccomp profiles
these are only applied at a time of launching the container (seccomp flags are available for docker run)
AppArmor
this can (again) be overriden on the scheduler level via --privileged
docker daemon --exec-opts flag
only a single option is actually available for this flag (native.cgroupdriver)
It seems that Docker is designed to trust container schedulers by default.
Does anyone know if this is a design decision?
Is there any other possible solution available w/ current latest Docker version that I missed?
I also looked at Kubernetes and its Limit Ranges & Resource Quotas which can be applied to K8S namespaces, which looked interesting, assuming there's a way to enforce certain schedulers to only use certain namespaces. This would however increase the scope of this problem to operating K8S cluster.
running docker on a unix platform should be compatible with nice Or so I would think at first looking a little more closely it looks like you need somethign like -cpuset-cpus="0,1"
From the second link , "The --cpu-quota looks to be similar to the --cpuset-cpus ... allocate one or a few cores to a process, it's just time managed instead of processor number managed."

containers and host user space shared when created using virsh

I'm trying to setup a container in redhat. The container should also run redhat version same as that of host. While exploring about these, I came across virsh and docker. Virsh supports host based containers and shares user space with host machine. Here I got confused with user space. Whether it mean filesystem space or some thing else. Can anyone clarify me on this? Also in which scenarios/cases virsh(host based container) can be used so that I can conclude whether its better to use virsh or docker. In my case i need to set up a redhat container in redhat host and run multiple instances of same process in each container. The containers should exchange data across each other without using network interface.
This should help clarify: http://rhelblog.redhat.com/2015/07/29/architecting-containers-part-1-user-space-vs-kernel-space/
It sounds like you really want to use Docker with -v bind mounts to share data. That is an article for a future day :-)
https://docs.docker.com/userguide/dockervolumes/
Current Kernels do not support yet the user namespace.
This is a known limitation of current containerization solutions. Unfortunately, usernamespace was implemented in latest kernel releases (staring from kernel 3.8) http://kernelnewbies.org/Linux_3.8 though it is not yet enabled in many mainstream distributions.
This is one of the strongest limitations of containers right now, if you are root (ID 1) in a container, you are root across the machine operating the container.
This is a problem affecting any product based on LXC though there is a strong push to fix this. It is actually a needed thing!
Alternatives is to go for hard Selinux jailing or work with underprivileged users accounts and assigning different users per container.
From Libvirt documentation https://libvirt.org/drvlxc.html:
User and group isolation
If the guest configuration does not list any ID mapping, then the user and group IDs used inside the container will match those used outside the container. In addition, the capabilities associated with a process in the container will infer the same privileges they would for a process in the host. This has obvious implications for security, since a root user inside the container will be able to access any file owned by root that is visible to the container, and perform more or less any privileged kernel operation. In the absence of additional protection from sVirt, this means that the root user inside a container is effectively as powerful as the root user in the host. There is no security isolation of the root user.
The ID mapping facility was introduced to allow for stricter control over the privileges of users inside the container. It allows apps to define rules such as "user ID 0 in the container maps to user ID 1000 in the host". In addition the privileges associated with capabilities are somewhat reduced so that they cannot be used to escape from the container environment. A full description of user namespaces is outside the scope of this document, however LWN has a good write-up on the topic. From the libvirt point of view, the key thing to remember is that defining an ID mapping for users and groups in the container XML configuration causes libvirt to activate the user namespace feature.

Resources