containers and host user space shared when created using virsh - linux

I'm trying to setup a container in redhat. The container should also run redhat version same as that of host. While exploring about these, I came across virsh and docker. Virsh supports host based containers and shares user space with host machine. Here I got confused with user space. Whether it mean filesystem space or some thing else. Can anyone clarify me on this? Also in which scenarios/cases virsh(host based container) can be used so that I can conclude whether its better to use virsh or docker. In my case i need to set up a redhat container in redhat host and run multiple instances of same process in each container. The containers should exchange data across each other without using network interface.

This should help clarify: http://rhelblog.redhat.com/2015/07/29/architecting-containers-part-1-user-space-vs-kernel-space/
It sounds like you really want to use Docker with -v bind mounts to share data. That is an article for a future day :-)
https://docs.docker.com/userguide/dockervolumes/

Current Kernels do not support yet the user namespace.
This is a known limitation of current containerization solutions. Unfortunately, usernamespace was implemented in latest kernel releases (staring from kernel 3.8) http://kernelnewbies.org/Linux_3.8 though it is not yet enabled in many mainstream distributions.
This is one of the strongest limitations of containers right now, if you are root (ID 1) in a container, you are root across the machine operating the container.
This is a problem affecting any product based on LXC though there is a strong push to fix this. It is actually a needed thing!
Alternatives is to go for hard Selinux jailing or work with underprivileged users accounts and assigning different users per container.
From Libvirt documentation https://libvirt.org/drvlxc.html:
User and group isolation
If the guest configuration does not list any ID mapping, then the user and group IDs used inside the container will match those used outside the container. In addition, the capabilities associated with a process in the container will infer the same privileges they would for a process in the host. This has obvious implications for security, since a root user inside the container will be able to access any file owned by root that is visible to the container, and perform more or less any privileged kernel operation. In the absence of additional protection from sVirt, this means that the root user inside a container is effectively as powerful as the root user in the host. There is no security isolation of the root user.
The ID mapping facility was introduced to allow for stricter control over the privileges of users inside the container. It allows apps to define rules such as "user ID 0 in the container maps to user ID 1000 in the host". In addition the privileges associated with capabilities are somewhat reduced so that they cannot be used to escape from the container environment. A full description of user namespaces is outside the scope of this document, however LWN has a good write-up on the topic. From the libvirt point of view, the key thing to remember is that defining an ID mapping for users and groups in the container XML configuration causes libvirt to activate the user namespace feature.

Related

k8s check pod securityContext definition

The bounty expires in 5 days. Answers to this question are eligible for a +50 reputation bounty.
PJEM is looking for an answer from a reputable source:
I need to know which combination of security context config can have a security escalation, and also can control other pods in the cluster
I want to check if pod in the cluster running as privileged pods, which can indicate that we may have security issue, so I check if
privileged: true
However under the
securityContext: spec there is additional fields like
allowPrivilegeEscalation
RunAsUser
ProcMount
Capabilities
etc
Which may be risky (not sure about it) ,
My question is in case the pod is marked as privileged:false and the other fields are true like the following example,if this indicate some security issue ? Does this pods can do some operation on other pods etc , access external data?
For example the following configuration which indicate the the pod is not privileged but allowPrivilegeEscalation: true
securityContext:
allowPrivilegeEscalation: true
privileged: false
I want to know which securityContext combination of pod config can control other pods/process in the cluster ?
The securityContext are more related to the container itself and some access to the host machine.
The allowPrivilegeEscalation allow a process to gain more permissions than its parent process. This is more related to setuid/setgid flags in binaries, but inside a container there is no much to get worried about.
You can only control other containers in the host machine from inside a container if you have a hostPath volume, or something like that, allowing you to reach the .sock file as /run/crio/crio.sock or the docker.sock. Is pretty obvious that, if you are concerned about this, allowing requests to Docker API through the network should be disabled.
Of course, all of these access are ruled by DAC and MAC restrictions. This is why podman uidmap is better, because the root inside the container do not have the same root id outside the container.
From Kubernetes point of view, you don't need this kind of privileges, all you need is a ServiceAccount and the correct RBAC permissions to control other things inside Kubernetes. A ServiceAccount binded to a cluster-admin ClusterRole can do anything in the API and much more, like adding ssh keys to the hosts.
If you are concerned about pods executing things in Kubernetes or in the host, just force the use of nonRoot containers, avoid indiscriminate use of hostPath volumes, and control your RBAC.
Openshift uses a very nice restriction by default:
Ensures that pods cannot run as privileged
Ensures that pods cannot mount host directory volumes
Requires that a pod is run as a user in a pre-allocated range of UIDs (openshift feature, random uid)
Requires that a pod is run with a pre-allocated MCS label (selinux related)
I don't answer exactly what you want, because I shifted the attention to RBAC, but I hope this can give you a nice idea.
Strictly in the scope of securityContext (as of Kubernetes 1.26 API), here's few things that may be risky:
Certainly risky
capabilities.add will add Linux capabilities (like CAP_SYS_TIME to set system time) to a container. The default depends on container runtime (see for example Docker default set of capabilities) and should be reasonably secure, but adding capabilities like CAP_SYS_ADMIN may represent a risk. Excessive capabilities outlines a few possible escalations.
privileged: true grants all capabilities, so you'll definitely want to check for that (as you already do).
allowPrivilegeEscalation: true is risky as it allow a process to gain more privileges than its parent.
procMount will allow a container mounting node's /proc and expose sensible host information.
windowsOptions may be risky. According to Kubernetes doc it enables privileged access to the Windows node. I don't know much about Windows security, but I'd say risky :-)
Maybe risky (though usually intended to restrict permissions)
runAsGroup and runAsUser may be risky when set to root / 0. Given that by default container runtime will probably run container as root already it's mostly used to restrict container's permissions to a non-root user. But if your container runtime is configured to run container as non-root by default, it might be used to bypass that and run a container with root.
seLinuxOptions may be used to provide an insecure SELinux context, but is usually intended to define a more secure context.
seccompProfile defines system calls a container is allowed to make. It may be used to get access to sensitive system calls, though it's usually intended to restrict them.
(probably) Not risky
readOnlyRootFilesystem (default false) will make container root system read-only.
runAsNonRoot (default false) will prevent a container from running as root
capabilities.drop will drop Linux capabilities, restricting further what a container can do.
You can read more on Configure a Security Context official doc
What about non-Security Context related risks?
Security Context is not the only thing you should be wary about: you should also consider volume mounts to unsecure locations, RBAC, network, Secrets, etc. A good overview is provided by Security Checklist.

How to create a docker image of current file and OS system?

I wonder if one can take all the current environment variables settings OS applications and create a simple docker layer on top of it all so that docker container user will not be able to damage host system even if he would remove all files, yet will have abilety to access all installed applications and system settings inside his docker layer?
Technically you might be able to hack together a solution that does this by copying in all data/apps, installing dependencies, re-configuring the applications and providing a bash shell to attach to for a user to play around with but this is not what Docker is designed for at all, not to mention that I would not recommend anyone to attempt this.
I always try to explain docker's usecase as processes which run in isolated containers with defined interfaces that may be exposed. Meaning you would ideally run one application within it which has an interface exposed for communication.
What you are looking for is essentially a VM with snapshots which you can provide to different users.

Using multiple docker containers on the same host securely like isolated instances

I know, multiple Docker containers can be used in the same host, but can they be used securely like isolated instances? I want to run multiple secure and sandboxed containers such that no container can affect or access others.
For instance, can I serve nginx and apache containers which listen to different ports, with full trust that each container can only access their own files, resources etc?
In some sense you are asking the million dollar question with containers, and to be clear, IMHO there is no black and white answer to the question "is the platform/technology secure enough." It is a big (and important) enough question that the list of startups--not to mention amount of funding they've received--around container security is an appreciable number!
As noted in another answer, isolation for containers is realized through an assortment of Linux kernel capabilities (namespaces and cgroups), and adding more security to these capabilities is yet another set of technologies like seccomp, apparmor (or SELinux), user namespaces, or general hardening of the container runtime & node it is installed on (e.g. via the CIS benchmark guidelines). Out of the box default installation and default runtime parameters are probably not good enough for generically trusting in the kernel isolation primitives of Linux. However, this depends greatly on the trust level of what you are running across your container workloads. For example, is this all in-house within one organization? Can workloads be submitted from external sources? Obviously the spectrum of possibilities may greatly impact your level of trust.
If your use case is potentially narrow (for example, you mention web serving content from nginx or apache), and you are willing to do some work to handle base image creation, minimization and hardening; add to that a --readonly root filesystem and a capability limiting apparmor and seccomp profile, bind mount in the content served + writeable area, with no executables and ownership by an unprivileged user--all those things together might be enough for a specific use case.
However, there is no guarantee that a currently unknown security escape becomes a "0day" for Linux containers in the future, and that has led to promotion of lightweight virtualization that marries container isolation with actual hardware-level virtualization through shims from hyper.sh or Intel Clear Containers, as two examples. This is a happy medium between running a full virtualized OS with another container runtime and trusting kernel isolation with a single daemon on a single node. There is still a performance cost and memory overhead to adding this layer of isolation, but it is much less than a fully virtualized OS and work continues to make this less of a performance impact.
For a deeper set of information on all the "knobs" available for tuning container security, a presentation I gave last year several times is available on slideshare as well as via video from Skillsmatter.
The incredibly thorough "Understanding and Hardening Linux Containers" by Aaron Grattafiori is also a great resource with exhaustive detail on many of the same topics.
filesystem isolation (as well as memory and processes isolation) is a core feature of docker containers, based on the Linux Kernel abilities.
But if you wanted to be completely sure, you would deploy your containers on different nodes (each managed by their own docker daemons), each node being a VM (Virtual Machine) on your host, ensuring a complete sandbox.
Then a docker swarm or Kubernetes would be able to orchestrate those node and their containers, and make them communicate.
This is normally not needed when you have just a few linked containers: their should be able to be managed in isolation by a single docker daemon. You could use user namespace for additional isolation.
Plus, using nodes to separate containers implies different machines or different VM within the same machine.
And one big difference with a VM and a container is that a VM will preempt resources (allocate a fix minimal amount of disk/memory/CPU), which means you cannot launch an hundred VM, one per container. As opposed to a single docker instance, where a container, if it does nothing, won't consume much disk space/memory/CPU at all.

Disable certain Docker run options

I'm currently working on a setup to make Docker available on a high performance cluster (HPC). The idea is that every user in our group should be able to reserve a machine for a certain amount of time and be able to use Docker in a "normal way". Meaning accessing the Docker Daemon via the Docker CLI.
To do that, the user would be added to the Docker group. But this imposes a big security problem for us, since this basically means that the user has root privileges on that machine.
The new idea is to make use of the user namespace mapping option (as described in https://docs.docker.com/engine/reference/commandline/dockerd/#/daemon-user-namespace-options). As I see it, this would tackle our biggest security concern that the root in a container is the same as the root on the host machine.
But as long as users are able to bypass this via --userns=host , this doesn't increase security in any way.
Is there a way to disable this and other Docker run options?
As mentioned in issue 22223
There are a whole lot of ways in which users can elevate privileges through docker run, eg by using --privileged.
You can stop this by:
either not directly providing access to the daemon in production, and using scripts,
(which is not what you want here)
or by using an auth plugin to disallow some options.
That is:
dockerd --authorization-plugin=plugin1
Which can lead to:

Is there a way to restrict untrusted container scheduler?

I have an application which I'd like to give the privilege to launch short-lived tasks and schedule these as docker containers. I was thinking of doing this simply via docker run.
As I want to make the attack surface as small as possible, I treat the application as untrusted. As such it can potentially run arbitrary docker run commands (if the codebase contained bug or the container was compromised, input was improperly escaped somewhere etc.) against a predefined docker API endpoint.
This is why I'd like to restrict that application (effectively a scheduler) in some ways:
prevent --privileged use
enforce --read-only flag
enforce memory & CPU limits
I looked at couple of options:
selinux
the selinux policies would need to be set on the host level and then propagated inside the containers via --selinux-enabled flag on the daemon level. The scheduler can however override this anyway via run --privileged.
seccomp profiles
these are only applied at a time of launching the container (seccomp flags are available for docker run)
AppArmor
this can (again) be overriden on the scheduler level via --privileged
docker daemon --exec-opts flag
only a single option is actually available for this flag (native.cgroupdriver)
It seems that Docker is designed to trust container schedulers by default.
Does anyone know if this is a design decision?
Is there any other possible solution available w/ current latest Docker version that I missed?
I also looked at Kubernetes and its Limit Ranges & Resource Quotas which can be applied to K8S namespaces, which looked interesting, assuming there's a way to enforce certain schedulers to only use certain namespaces. This would however increase the scope of this problem to operating K8S cluster.
running docker on a unix platform should be compatible with nice Or so I would think at first looking a little more closely it looks like you need somethign like -cpuset-cpus="0,1"
From the second link , "The --cpu-quota looks to be similar to the --cpuset-cpus ... allocate one or a few cores to a process, it's just time managed instead of processor number managed."

Resources