Kernel configurations for lxc - linux

I am configuring Linux kernel 3.10.31ltsi and want to add the needed support for LXC, as far as I understood, cgroups and namespaces shall be available for LXC, but what are the configurations in menuconfig that need to be included?

You should use a script called "lxc-checkconfig" (which is part of LXC) to check whether your kernel supports or not all required settings; see
https://linuxcontainers.org/lxc/manpages/man1/lxc-checkconfig.1.html
As a side note, I think that LXC uses by default all namespaces; this means that you should set
CONFIG_UTS_NS, CONFIG_IPC_NS, CONFIG_USER_NS, CONFIG_PID_NS,
CONFIG_NET_NS, and the mount namesapces (forgot it's config entry).
Regarding cgroups - not sure, probably the memory, cpu and I/O cgroups controllers are mandatory, and maybe some more; use the lxc-checkconfig script.

Related

hidepid=2 stopped working after an update. Kernel don't suppport "per-mount point"?

I am running arch linux hardened (5.11.13-hardened1-1-hardened) and have been setting hidepid=2 thru the fstab:
proc /proc proc nosuid,nodev,noexec,hidepid=2,gid=proc 0 0
and in the and override file for systemd-logind.service as hidepid.conf:
[Service]
SupplementaryGroups=proc
All accordning to the arch wiki security and everything has been working fine up until a while ago after an update, and I think it is because of systemd-248 update, but I am obviously not sure.
When reading up on systemd changes a came across this section about "ProtectProc=invisible" which is being set in the systemd-logind.service now by default and should obsolete the fstab setting of hidepid=2 but in the description of "ProtectProc=" here freedesktop.org systemd protectproc=
If the kernel doesn't support per-mount point hidepid= mount options this setting remains without effect, and the unit's processes will be able to access and see other process as if the option was not used. This option is only available for system services and is not supported for services running in per-user instances of the service manager.
So what is the meaning of this? Is this something I can fix thru kernel parameters in the hardened kernel?
Best regards

How does AppArmor handle Linux-kernel mount namespaces?

I've searched through wiki of AppArmor's wiki as well as tried Internet searches for "apparmor mount namespace" (or similar). However, I always draw a blank as how AppArmor deals with them, which is especially odd considering that OCI containers could not exist without mount namespaces. Does AppArmor take mount namespaces into any account at all, or does it simply check for the filename passed to some syscall?
If a process inside a container switches mount namespaces does AppArmor take notice at all, or is it simply mount namespace-agnostic in that it doesn't care? For instance, if a container process switches into the initial mount namespace, can I write AppArmor MAC rules to prevent such a process from accessing senstive host files, while the same files inside its own container are allowed for access?
can I write AppArmor MAC rules to prevent such a process from
accessing senstive host files.
Just don't give container access to sensitive host filesystem part. That means don't mount them into container. This is out of scope of AppArmor to take care of if you do.
I would say that AppArmor is partially linux kernel mount namespace aware.
I think the attach_disconnected flag in apparmor is an indication that apparmor knows if you are in the main OS mount namespace or a separate mount namespace.
The attach_disconnected flag is briefly described at this link (despite the warning at the top of the page claiming to be a draft):
https://gitlab.com/apparmor/apparmor/-/wikis/AppArmor_Core_Policy_Reference
The following reference, from a ubuntu apparmor discussion, provides useful and related information although not directly answering your question.
https://lists.ubuntu.com/archives/apparmor/2018-July/011722.html
The following references, from a usenix presentation, provides a proposal to add security namespaces to the Linux kernel for use by frameworks such as apparmor. This does not directly show how / if apparmor currently uses kernel mount namespaces for decision making, but it's related enough to be of interest.
https://www.usenix.org/sites/default/files/conference/protected-files/security18_slides_sun.pdf
https://www.usenix.org/conference/usenixsecurity18/presentation/sun
I don't know if my response here is complete enough to be considered a full answer to your questions, however I don't have enough reputation points to put this into a comment. I also found it difficult to know when the AppArmor documentation meant "apparmor policy namespace" vs "linux kernel mount namespace", when the word "namespace" was specified alone.

LXC : Is it from linuxcontainers.org or part of Linux kernel?

I want to know about LXC and came across this site: https://linuxcontainers.org/lxc/introduction/; in this site, it talks about LXC, LXD, among others.
I am a bit confused, I am under the impression that LXC is a Linux kernel feature, so it should be present in Kernel itself. However, looking at the above site viz: https://linuxcontainers.org/lxc/introduction/, is this same when we say LXC (the kernel feature)? Or is LXC provided to the Linux kernel by https://linuxcontainers.org/lxc/introduction/?
How can I understand this subtle difference?
Most of the core features needed to operate Linux in containers are built into the kernel -- namespaces, control groups, virtual roots, etc. However, to assemble a usable container platform from these features requires a considerable amount of infrastructure. We need to manage container storage, create network links between containers, control per-container resource usage, etc. User-space programs can, and are, used to provide this infrastructure, and the tooling that goes with it.
I have written a series of articles on building a container from scratch that explains some of these issues:
http://kevinboone.me/containerfromscratch.html
It's possible in principle to build and connect containers using nothing but the features built into the kernel, and a bunch of shell scripts. Tools like LXC, Docker, and Podman all use the same kernel features (so far as I know), but they manipulate these features in different ways.

Mounting cgroups for Resource Management in Docker

This is in reference to https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory. I have already created working containers, running Docker version 18.05.0-ce on a Raspberry Pi (64-bit) using Raspbian Jessie Lite (essentially GUI-less Debian Jessie).
The documentation claims that you can just pass memory/cpu flags on the docker run command. But when I try something like docker run -it --name test --memory=512m container_os, it says:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap
I get a similar message about not having cpuset mounted if I pass a cpu-based flag, such as --cpuset-cpus. This obviously means that I don't have these different cgroups mounted for Docker to manage resources correctly, right?
Now referring to https://docs.docker.com/config/containers/runmetrics/#control-groups, I read the section about cgroups, but it wasn't super helpful to my understanding of the situation. So rather than just trying random kernel commands, does anyone with experience have a step-by-step explanation of how to do this the right way?
After quite a bit of research, I figured this out, in-case anyone else out there has this same problem.
In reference to https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, which is extremely helpful on understanding cgroups, a kernel with all of the proper support should have most of the cgroups for docker mounted by default. If not, there's a command to do so:
From section 2.1 - Basic Usage
"To mount a cgroup hierarchy with all available subsystems, type:
mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.
Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used."
For this particular case, however, trying to mount an individual cgroup, such as cpuset, results in an error saying that the "cpuset special device does not exist". This is because the devs of Raspbian Jessie 8 didn't configure the kernel to support the cgroups that Docker uses for resource management by default. This can easily be determined by typing the docker info command, and seeing this at the bottom of the output:
WARNING: No swap limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpuset support
These are all of the cgroups that are needed for Docker to manage memory and CPU resources for containers. Testing to see if your kernel supports something like cpuset is easy. If the file /proc/filesystems has an entry that says nodev cpuset, then that means your kernel has cpuset support, but if you're reading this then it probably means it's just not configured in your kernel. That would call for a kernel reconfiguration and rebuild however, which is not so easy.
With the right kernel configurations, it just works automatically like it seems from the Docker Docs.

Do Linux capabilities work with binfmt_misc?

I'm potentially interested in using Linux capabilities for a program (specifically, cap_net_bind_service to allow a program to bind to a TCP port less than 1024).
However, I'd like to do it for a program that is C# running under Mono. Normally, I think that would mean the Mono interpreter itself would need to have the capabilities set on it, rather than the whatever.exe program that it runs.
However, Linux also can have Mono binary kernel support, via the kernel binfmt_misc mechanism.
So, does the kernel binfmt_misc mechanism work with capabilities? That is, so that a particular binfmt_misc-enabled executable file can run with particular capabilities set.
Normally, I think that would mean the Mono interpreter itself would need to have the capabilities set on it[...]
It would take binfmt_misc out of the question if you set capabilities on the process tree in question, rather than on the files.
See cap_set_proc(), and tooling for manipulating it. For instance, if you were using systemd:
[Service]
ExecStart=/usr/bin/mono /path/to/your/executable.exe
User=your_service_account
Capabilities=CAP_NET_BIND_SERVICE

Resources