Why unshare(CLONE_NEWNET) requires CAP_SYS_ADMIN?

Why unshare(CLONE_NEWNET) requires CAP_SYS_ADMIN? - linux

I'm playing with linux namespaces and I've noticed that if a user wants to execute a process in a new network namespace (without using user namespaces) he needs to be root or have the CAP_SYS_ADMIN capability.
The unshare(2) manpage says:
CLONE_NEWNET (since Linux 2.6.24)
This flag has the same effect as the clone(2) CLONE_NEWNET flag.
Unshare the network namespace, so that the calling process is moved into a new net‐work namespace which is not shared with any previously existing process. Use of CLONE_NEWNET requires the CAP_SYS_ADMIN capability.
So, if I want to execute a pdf reader in a network sandbox I must use user-net-namespaces or some privileged wrapper.
Why? The new process will be placed in a new network namespace with no interfaces, so it will be isolated from the real network, right? Which kind of problems/security threats do unprivileged non user network namespaces raise?

Creating a network namespace allows manipulating the execution environment of binaries that have the setuid flag or are otherwise privileged. User namespaces take away this possibility, because a process cannot gain privileges that are not included in the user namespace.
In general, it cannot be known that no security vulnerability is caused by denying a privileged process from accessing the network. Therefore, the kernel assumes that operation is privileged, and it is up to the system policy to decide whether a privileged utility is provided for ordinary users.

Related

Kubernetes: privileged containers and security concerns

Running a container in privileged mode is discouraged for security reasons.
For example: https://www.cncf.io/blog/2020/10/16/hack-my-mis-configured-kubernetes-privileged-pods/
It seems obvious to me that is is preferable to avoid privileged containers when a non-privileged container instead would be sufficient.
However, let's say I need to run a service that requires root access on the host to perform some tasks. Is there an added security risk in running this service in a privileged container (or with some linux capabilities) rather than, for example, a daemon that runs as root (or with those same linux capabilities)? What is the added attack surface?
If a hacker manages to run a command in the context of the container, all right, it is game over. But what kind of vulnerability would allow him to do so that couldn't also be exploited in the case of the aforementioned daemon (apart from sharing the kubeconfig file thoughtlessly)?

Firstly and as you said, it is important to underline that running a container in privileged mode is highly discouraged for some obvious security reasons and here is why:
The risk of running a privileged container lies in the fact that it has access to the host's resources, including the ability to modify the host's system files, access sensitive information, and gain elevated privileges. Basically as it provides more permissions to the container than it would have in a non-privileged mode it significantly increase the risk of a attack surface.
If a hacker gains access to the privileged container, he can potentially access and manipulate the host system and potentially move laterally to other systems and compromising the security of the entire of your infrastructure. A similar vulnerability in a daemon running as root or with additional Linux capabilities would carry the same risk, as the hacker would have access to the same resources and elevated privileges.
In both cases, it is very important to follow best practices for securing the system, such as reducing the attack surface, implementing least privilege, and maintaining proper network segmentation to reduce the risk of compromise.

In this security article written by the astra security team they have mentioned PHP Remote Code Execution Vulnerability (2020) using which the attacker can get hold of your server. If this process is being run by a non root user the attack surface will be reduced but if the same service is having root user access the attacker can get access to remaining containers. This is the reason why it’s always preferred to have least privileged access configured for all the services, also go through this document for getting an overview on attacks that can be performed using privileged containers.

do DB engines in a docker containers use the host db engines?

If I run mongodb inside a container, does it communicate and uses the host mongodb?
or the container runs its own mongodb instance apart from the host one?

The short answer is no. The containerized database engine does not use the host one.
The memory of the modern operating system consist of user space and kernel space. The kernel space is the location where the code of the OS kernel is stored, and executes under. User space refers to all of the code in an operating system that lives outside of the kernel. This includes all kinds of utilities, programming languages, graphical tools and, among other things, database engines. The user space communicates with kernel space using system calls.
The docker environment is built on the two features of the Linux kernel: namespaces and cgroups. Namespace creates a virtually isolated user space and gives an application its dedicated system resources such as file system, network stack, etc. The dedicated namespace allows each application to run independently without inferring with other applications on same host. Cgroups enforces hardware resources limitation, accounting and controlling of an application. With putting namespaces and cgroups together, we can securely run multiple applications with isolated environments on the same host.
That is, if you run your application, say, database engine inside a container, you will get the following picture (source):
This means that the containerized application only has access to the resources provided by the namespace and cgroup. Thus, it does not have access to the applications in the user space of the host system.

How to limit privileged user access at Linux Kernel level?

I found this answer on learning Linux Kernel Programming and my question is more specific for the security features of the Linux Kernel. I want to know how to limit privileged users or process's access rights to other processes and files in contrast to full access of root.
Until now I found:
user and group for Discretionary Access Control (DAC), with differentiation in read, write and execute for user, group and other
user root for higher privileged tasks
setuid and setgid to extend users's DAC and set group/user ID of calling process, e.g. user run ping with root rights to open Linux sockets
Capabilities for fine-grained rights, e.g. remove suid bit of ping and set cap_net_raw
Control Groups (Cgroups) to limit access on resources i.e. cpu, network, io devices
Namespace to separate process's view on IPC, network, filesystem, pid
Secure Computing (Seccomp) to limit system calls
Linux Security Modules (LSM) to add additional security features like Mandatory Access Control, e.g. SELinux with Type Enforcement
Is the list complete? While writing the question I found fanotify to monitor filesystem events e.g. for anti virus scans. Probably there are more security features available.
Are there any more Linux security features which could be used in a programmable way from inside or outside of a file or process to limit privileged access? Perhaps there is a complete list.

The traditional unix way to limit a process that somehow needs more privileges and yet contain it so that it cannot use more than what it needs is to "chroot" it.
chroot changes the apparent root of a process. If done right, it can only access those resources inside that newly created chroot environment (aka. chroot jail)
e.g. it can only access those files, but also, only those devices etc.
To create a process that does this willingly is relatively easy, and not that uncommon.
To create an environment where an existing piece of software (e.g. a webserver, mailserver, ...) feels at home in and still functions properly is something that requires experience. The main thing is to find the minimal set of resources needed (shared libraries, configuration files, devices, dependent services (e.g. syslog), ... ).

You may add
EFS,AppArmor,Yama
auditctl,ausearch,aureport
Tools similar to fanotify:
Snort, ClamAV,OpenSSL,AIDE, nmap, GnuPG

containers and host user space shared when created using virsh

I'm trying to setup a container in redhat. The container should also run redhat version same as that of host. While exploring about these, I came across virsh and docker. Virsh supports host based containers and shares user space with host machine. Here I got confused with user space. Whether it mean filesystem space or some thing else. Can anyone clarify me on this? Also in which scenarios/cases virsh(host based container) can be used so that I can conclude whether its better to use virsh or docker. In my case i need to set up a redhat container in redhat host and run multiple instances of same process in each container. The containers should exchange data across each other without using network interface.

This should help clarify: http://rhelblog.redhat.com/2015/07/29/architecting-containers-part-1-user-space-vs-kernel-space/
It sounds like you really want to use Docker with -v bind mounts to share data. That is an article for a future day :-)
https://docs.docker.com/userguide/dockervolumes/

Current Kernels do not support yet the user namespace.
This is a known limitation of current containerization solutions. Unfortunately, usernamespace was implemented in latest kernel releases (staring from kernel 3.8) http://kernelnewbies.org/Linux_3.8 though it is not yet enabled in many mainstream distributions.
This is one of the strongest limitations of containers right now, if you are root (ID 1) in a container, you are root across the machine operating the container.
This is a problem affecting any product based on LXC though there is a strong push to fix this. It is actually a needed thing!
Alternatives is to go for hard Selinux jailing or work with underprivileged users accounts and assigning different users per container.
From Libvirt documentation https://libvirt.org/drvlxc.html:
User and group isolation
If the guest configuration does not list any ID mapping, then the user and group IDs used inside the container will match those used outside the container. In addition, the capabilities associated with a process in the container will infer the same privileges they would for a process in the host. This has obvious implications for security, since a root user inside the container will be able to access any file owned by root that is visible to the container, and perform more or less any privileged kernel operation. In the absence of additional protection from sVirt, this means that the root user inside a container is effectively as powerful as the root user in the host. There is no security isolation of the root user.
The ID mapping facility was introduced to allow for stricter control over the privileges of users inside the container. It allows apps to define rules such as "user ID 0 in the container maps to user ID 1000 in the host". In addition the privileges associated with capabilities are somewhat reduced so that they cannot be used to escape from the container environment. A full description of user namespaces is outside the scope of this document, however LWN has a good write-up on the topic. From the libvirt point of view, the key thing to remember is that defining an ID mapping for users and groups in the container XML configuration causes libvirt to activate the user namespace feature.

How do I disable SELinux for a subprocess launched from Apache?

My Apache module launches a helper subprocess which does, for example, but not limited by, the following things:
It sets up a socket so that it can communicate with Apache.
Reads and writes files in a temporary location that is deleted when Apache exits. These files are used e.g. for storing large amounts of data received over the network, in case that data does not comfortably fit in RAM.
It spawns user-specified executables. Similar to CGI. Each of these spawned processes are run as their own dedicated user.
The helper subprocess is launched as root so that it can manage file ownerships and permissions and can spawn more processes as specific users.
Some users of my module run on systems with SELinux installed, e.g. RedHat-based distros. SELinux usually interferes with my module. Until now I've been telling people to disable SELinux system-wide because I can't figure out how to write a proper policy for my software. Documentation is very scattered, complex and usually only targets system administrators, not software developers.
As a step into the right direction, I want to implement minimal support for SELinux. I'm looking for a way to launch my helper subprocess without any SELinux constraints without disabling SELinux system-wide. Is there a way to do that, and if so, how?

Well... you could write a rule that transitions your domain to unconfined_t, but then you'd piss off quite a few sysadmins. Best to write yourself a new domain that inherits from httpd_t and also adds the appropriate contexts for access.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string