LXC without chroot - linux

Is there any way to use LXC for resource management using process groups without creating containers? I am working on a service that runs arbitrary code inside a sandbox, for which I am only interested in hardware resource management. I don't want any chrooting; I just want these process groups to have access to the main file system.
I was told that lxc is light weight, but all the examples that I see create a new container (I.e. a dir with a full OS) for every lxc process. I don't really see how this is much lighter than any other VM solution.
So is there any way that LXC can be used to control and manage multiple process groups, without creating separate containers for each and every one of them?

LXC isn't a monolithic system. It's a collection of kernel features that can be used to isolate processes in various different ways, and a userspace tool to use all of these features together to create full-fledged containers. But the individual features are still usable on their own, without LXC. Furthermore, LXC does not require a chroot, and even when you give it a chroot, you can bind-mount directories from the host system into the container, sharing those particular directory trees between the host and the container.
For instance, cgroups are used by LXC to set resource limits on containers. But they can be used to set resource limits on groups of processes without using the LXC tools at all. You can manipulate /sys/fd/cgroup/memory or /sys/fs/cgroup/cpuacct directly, to put process into cgroups that limit the amount of memory or CPU they are allowed to use. Or if you're on a system using systemd, you can control the memory limits for a group of processes using MemoryLimit=200M or the like in the .service file for a given service.
If you want to use LXC to do lightweight resource management, you can do that with or without a chroot. When starting an LXC container, you can choose which resources you want to isolate; so you could create a container with only a virtualized network, and nothing else; or a container with only memory limits, but sharing everything else with the host. The only things that will be isolated are those specified in the configuration file for your container. For example, lxc ships with several example container definitions that only isolate the network; they share a root partition and almost everything else with the host. Here's how to run a container identical to the host system except it has no network interface:
sudo lxc-execute -n foo -f /usr/share/doc/lxc/examples/lxc-no-netns.conf /bin/bash
If you want some files to be shared with the host, but not others, you have two choices; you could use a shared root directory, and mount over the files that you want to be different in the container; or you could use a chroot, but mount the files that you do want to share in the container.
For example, here's the configuration for a container that shares everything with the host except for /home; it instead bind-mounts /home/me/fake-home over /home within the container:
lxc.mount.entry = /home/me/fake-home /home none rw,bind 0 0
Or if you want to have a completely different root, but still share some directories like /usr, you can bind mount a few directories into a directory, and use that as the root of the filesystem.
So you have lots of options, and can choose to isolate just one component, more than one, or as many as LXC supports, depending on your needs.

Related

Best Practise for docker intercontainer communication

I have two docker containers A and B. On container A a django application is running. On container B a WEBDAV Source is mounted.
Now I want to check from container A if a folder exists in container B (in the WebDAV mount destination).
What is the best solution to do something like that? Currently I solved it mounting the docker socket into the container A to execute cmds from A inside B. I am aware that mounting the docker socket into a container is a security risk for the host and the whole application stack.
Other possible solutions would be to use SSH or share and mount the directory which should be checked. Of course there are further possible solutions like doing it with HTTP requests.
Because there are so many ways to solve a problem like that, I want to know if there is a best practise (considering security, effort to implement, performance) to execute commands from container A in contianer B.
Thanks in advance
WebDAV provides a file-system-like interface on top of HTTP. I'd just directly use this. This requires almost no setup other than providing the other container's name in configuration (and if you're using plain docker run putting both containers on the same network), and it's the same setup in basically all container environments (including Docker Swarm, Kubernetes, Nomad, AWS ECS, ...) and a non-Docker development environment.
Of the other options you suggest:
Sharing a filesystem is possible. It leads to potential permission problems which can be tricky to iron out. There are potential security issues if the client container isn't supposed to be able to write the files. It may not work well in clustered environments like Kubernetes.
ssh is very hard to set up securely in a Docker environment. You don't want to hard-code a plain-text password that can be easily recovered from docker history; a best-practice setup would require generating host and user keys outside of Docker and bind-mounting them into both containers (I've never seen a setup like this in an SO question). This also brings the complexity of running multiple processes inside a container.
Mounting the Docker socket is complicated, non-portable across environments, and a massive security risk (you can very easily use the Docker socket to root the entire host). You'd need to rewrite that code for each different container environment you might run in. This should be a last resort; I'd consider it only if creating and destroying containers would need to be a key part of this one container's operation.
Is there a best practise to execute commands from container A in contianer B?
"Don't." Rearchitect your application to have some other way to communicate between the two containers, often over HTTP or using a message queue like RabbitMQ.
One solution would be to mount one filesystem readonly on one container and read-write on the other container.
See this answer: Docker, mount volumes as readonly

Why does running top inside docker container only show processes inside the container?

I'm running top inside a docker container and I'm seeing that the only processes that show up are the initial process used to run the container and top. Why does it show this instead of displaying other processes on the docker host as well?
In order to understand why this is happening, you need to understand the basic concepts of Linux that Docker is taking advantage of.
There is this feature in the Linux Kernel called namespaces that partitions/isolates the host resources in way where a set of processes sees one set of resources where as another set of processes sees another set of resources.
Linux has 7 types of namespaces:
Mount - isolate mount points
UTS - isolate hostname
IPC - isolate interprocess communication resources
PID - isolate the PID number space
Network - isolate network interfaces
User - isolate UID/GID number spaces
Cgroup - isolate cgroup root directory
When you are working on your linux machine everything that you do is on the same namespace, but when you create a container by doing docker run by default it'll create a new separate namespace to isolate the container from your host.
In the specific case of your question, you see just one process running because the container is in a different PID namespace as your host machine.
You can tell Docker to share the same PID namespace by using --pid="host" when you create the container, there are some cases when doing that is useful.

Docker Host Security - Can container run dangerous code or change host from inside of a container?

Lets say I pull a new image from a hub repository and run it without looking at the contents of the dockerfile. Can the container or image affect my host in any way possible?
Please let me know because I will be running a list of images from a user inputted image names on my server. I am worried if it will affect the server/host.
With a default execution of an image, the answer is a conditional no. The kernel capabilities are limited, the filesystem is restricted, the process space is isolated, and it's on a separate bridged network from the host. Anything that allows access back to the host would be a security vulnerability.
The conditional part is that it can use up all your CPU cycles, it can exhaust your memory, it can fill your drive, and it can send network traffic out from your machine NAT'ed to your IP address. In other words, by default, there's nothing preventing the container from a DoS attack on your host.
Docker does have the ability to limit many of these things, including capping memory, restricting CPU's or prioritizing processes, and there are quota solutions to the filesystem.
You can also go the other direction and expose the host to the container, effectively creating security vulnerabilities. This would include mounting host volumes, especially the docker.sock inside the container, removing kernel capability restrictions with --privileged, and removing network isolation with --net=host. Doing any of these with a container turns off the protections that Docker provides by default.
Docker does have a lower level of isolation than a virtual machine due to the way it shares the kernel with the host. So if the code you are running contains a kernel or physical hardware exploit, that could access the host. For this reason, if you are running untrusted code, you may want to look into linuxkit, which provides a lightweight container based operating system to run inside a vm. This is used to provide the moby os that runs under hyperv/xhyve on docker for windows/mac.

What's inside a Docker image/container?

Considering the fact that docker images/containers come in various flavours - Ubuntu, CentOS, CoreOS etc.... I'm curious what actually makes up an image/container, and what is shared with the host OS? Where is the dividing line?
For example, I can download the base Ubuntu image and launch it on a CentOS host. Then, when I poke around inside the Ubuntu container I can see that it looks and feels like an Ubuntu server (filesystem layout etc). But if I run a uname command I see the kernel and the likes of the CentOS host....
Obviously I understand that the underlying kernel is shared by all containers on the same host. But what else is shared with the host OS, and what is part of the image/container?
E.g. the kernel is part of the host, the filesystem layout is part of the the image/container.... Is there a spec that defines this?
It can be helpful to distinguish between images and containers (docs). An image is static and lives only on disk. A container is a running instance of an image and it includes its own process tree as well as RAM and other runtime resources.
An image is a logical grouping of layers plus metadata about what to do when creating a container and how to assemble the layers. Part of that metadata is that each layer knows its parent's ID.
So, what goes into a layer? The files (and directories) you've added to the parent. There are also special files ("whiteout") that indicate that something was deleted from the parent.
When you docker run an image, docker creates a container: it unpacks all the layers in the correct order, creating a new "root" file system separate from the host. docker also reads the image metadata and starts either the "entrypoint" or "command" specified when the image was created -- that starts a new process sub-tree. From inside the container, that first process seems like the root of the tree, but from the host you can see it is a subtree of processes.
The root file system is what makes one Linux distro different from another (there can be some kernel module differences as well, and bootloader/boot file system differences, but these are usually invisible to the running processes). The kernel is shared with the host and is, in fact, still managing its usual responsibilities inside the container. But the root file system is different, and so when you're inside the container, it looks and feels like whatever distro was in the Docker image.
The container not only has its own file system and process tree, but also has its own logical network interface and, optionally, its own allocation of RAM and CPU time. You're in control over the container though, as the operator, so you can decide to share the host's network interface with the container, give it unlimited access to RAM and CPU, and even mount devices, files and directories from the host into the container. The default is to keep things separate, but you have the power to break the isolation model as much as you need to.
Docker is a wrapper over LXC Linux Containers and the documentation for that will let you know in detail what is shared and what is not.
In general the host machine sees/contains everything inside the containers from file system to processes etc. You can issue a ps command on the host vm and see processes inside the container.
Remember docker containers are not VMs - hence everything is actually running natively on the host and is using the host kernel directly. Each container has its own user namespace (similar to root jails of old). There are tools/features which make sure containers only see its own processes, have its own file-system layered onto host file system, and a networking stack which pipes to the host networking stack.

containers and host user space shared when created using virsh

I'm trying to setup a container in redhat. The container should also run redhat version same as that of host. While exploring about these, I came across virsh and docker. Virsh supports host based containers and shares user space with host machine. Here I got confused with user space. Whether it mean filesystem space or some thing else. Can anyone clarify me on this? Also in which scenarios/cases virsh(host based container) can be used so that I can conclude whether its better to use virsh or docker. In my case i need to set up a redhat container in redhat host and run multiple instances of same process in each container. The containers should exchange data across each other without using network interface.
This should help clarify: http://rhelblog.redhat.com/2015/07/29/architecting-containers-part-1-user-space-vs-kernel-space/
It sounds like you really want to use Docker with -v bind mounts to share data. That is an article for a future day :-)
https://docs.docker.com/userguide/dockervolumes/
Current Kernels do not support yet the user namespace.
This is a known limitation of current containerization solutions. Unfortunately, usernamespace was implemented in latest kernel releases (staring from kernel 3.8) http://kernelnewbies.org/Linux_3.8 though it is not yet enabled in many mainstream distributions.
This is one of the strongest limitations of containers right now, if you are root (ID 1) in a container, you are root across the machine operating the container.
This is a problem affecting any product based on LXC though there is a strong push to fix this. It is actually a needed thing!
Alternatives is to go for hard Selinux jailing or work with underprivileged users accounts and assigning different users per container.
From Libvirt documentation https://libvirt.org/drvlxc.html:
User and group isolation
If the guest configuration does not list any ID mapping, then the user and group IDs used inside the container will match those used outside the container. In addition, the capabilities associated with a process in the container will infer the same privileges they would for a process in the host. This has obvious implications for security, since a root user inside the container will be able to access any file owned by root that is visible to the container, and perform more or less any privileged kernel operation. In the absence of additional protection from sVirt, this means that the root user inside a container is effectively as powerful as the root user in the host. There is no security isolation of the root user.
The ID mapping facility was introduced to allow for stricter control over the privileges of users inside the container. It allows apps to define rules such as "user ID 0 in the container maps to user ID 1000 in the host". In addition the privileges associated with capabilities are somewhat reduced so that they cannot be used to escape from the container environment. A full description of user namespaces is outside the scope of this document, however LWN has a good write-up on the topic. From the libvirt point of view, the key thing to remember is that defining an ID mapping for users and groups in the container XML configuration causes libvirt to activate the user namespace feature.

Resources