Linux - understanding the mount namespace & clone CLONE_NEWNS flag - linux

I am reading the mount & clone man page. I want to clarify how CLONE_NEWNS effects the view of file system for the child process.
(File hierarchy)
Lets consider this tree to be the directory hierarchy. Lets says 5 & 6 are mount points in the parent process. I clarified mount points in another question.
So my understanding is : 5 & 6 are mount points means that the mount command was used previously to 'mount' file systems (directory hierarchies) at 5 & 6 (which means there must be directory trees under 5 & 6 as well).
From mount man page :
A mount namespace is the set of filesystem mounts that are visible to a process.
From clone man page :
Every process lives in a mount namespace. The namespace of a process is the data
(the set of mounts) describing the file hierarchy as seen by that process. After
a fork(2) or clone() where the CLONE_NEWNS flag is not set, the child lives in the
same mount namespace as the parent.
Also :
After a clone() where the CLONE_NEWNS flag is set, the cloned child is started in a
new mount namespace, initialized with a copy of the namespace of the parent.
Now if I use clone() with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ? Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
Thanks.

The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone().
Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount command) all processes would immediately see that change irrespective of their relationship to the mount command.
With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:
Should changes to the mount namespace made by the child propagate back to the parent?
Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).
Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).
Thus, we must decide when creating a child process with clone() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).
If the CLONE_NEWNS flag is passed to clone(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount command itself can work).
Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?
Yes. It sees the exact same tree as its parent after the call to clone().
Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
Yes. Since you've used CLONE_NEWNS, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
No. If you've used CLONE_NEWNS, the changes made in the child cannot propagate back to the parent.
If you haven't used CLONE_NEWNS, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork().)

I don't have enough reputation points to add a comment so instead adding this comment as an answer.
It's just an add on to Emmet's answer.
AFAICU, If a process is created with CLONE_NEWNS flag set, it can only mount those file systems which have FS_USERNS_MOUNT flag set. And almost all disk based file systems does not set this flag (due to security reasons).
In do_new_mount, there is this check:
if (user_ns != &init_user_ns) {
if (!(type->fs_flags & FS_USERNS_MOUNT)) {
put_filesystem(type);
return -EPERM;
}
Please correct me if I am wrong

Related

Getting the root device in a kernel module

I did some web searches for this, but could only find results about getting the kernel module associated with a device node. Is there anyway I can get the major and minor numbers of the current system's root device and, if applicable, the root device's parent device (e.g., /dev/sda is the "parent" of /dev/sda2)? Does the kernel export some functions for getting this or would I need to get it indirectly?
There is no module associated with a device node. Possibly you know that the root directory is something local to a process (the process structure stores the inode reference for the root directory --- and this can be changed with the privileged chroot(2) system call) and the current working directory (to solve for paths not beginning with /)
If you want to know the device responsible of the root directory you have two options:
Your process has not been made a chroot(2) syscall, so you opendir("/") and then do a fstat(2) on it (or you can do a stat(2) syscall on the "/" directory). This will give the device in which the root directory resides as the st_dev field of the struct stat returns. It is formatted as a dev_t number, in which some of the bits represent the major number and some the minor number. You can use the MKDEV(ma,mi) and MAJOR(dev) and MINOR(dev) macros defined in <linux/kdev_t.h> to access the major and minor numbers. To get the physical disk, just mask the minor number with 0xf0 and you will get the minor number of the whole disk.
your process has made a chroot(2) syscall, so you are not allowed to access the real root directory in the system. If you have access to the /proc filesystem, then probably you can call mount(1) command to get the mount table. you can search that table for the / entry, and then get the /dev/sd<disk> entry. Once you got the device, getting the parent device is easy. You can mask the number as you did in the last point to get the minor number of the physical disk.
You can also get to the /proc/diskstats file, that shows you the statistics of each block device. You'll get the major, minor and device name in the first three fields of each line.
NOTE
There are some disk arrangementes that dont't allow partitioning, as RAID devices or volume manager disks. In those cases, getting to the physical disk (or disks, as there can be more than one) is more difficult.

Get notification on cgroup process change?

Basically, inotify which normally serves to notify on filesystem changes doesn't work within the cgroup virtual filesystem.
Essentially I want a way to get a notification similar to inotify when a process in a cgroup either is dies or forks. I tried attaching inotify to the tasks virtual file inside the cgroup filesystem but that does nothing when a process forks on its own, only when a usespace tool actually manually writes to it to influence the cgroup.
inotify does not work on such virtual file system, be it cgroup, proc or sys.
Note: I tried this too, it would have been very handy in some situations, but nope. :-)
This is because the files and directories do not actually exist per see (for example they take 0 disk space), they are produced for you on the fly by the kernel as you visit them.
So the alternative would be to actively visit the files and dir in a busy loop periodically, which is so ugly that it is not a real alternative in most cases.
And this is why programs such as top, htop and such consume so much CPU. They do actually and actively browse the proc virtual file system rather than inotify or select or stuff like that in an eventing manner.
EDIT:
But there are some things that could help you though:
1/ For recent kernels (cgroups have been re-designed):
Look at:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
I quote:
2-3. [Un]populated Notification
Each non-root cgroup has a "cgroup.events" file which contains
"populated" field indicating whether the cgroup's sub-hierarchy has
live processes in it. Its value is 0 if there is no live process in
the cgroup and its descendants; otherwise, 1. poll and [id]notify
events are triggered when the value changes. [...]
1/ For older kernels:
You may want to have a look at notify_on_release and release_agent. Have a look at:
https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
notify_on_release flag: run the release agent on exit?
release_agent: the path to use for release notifications (this file exists in the top cgroup only)
And the sections "1.4 What does notify_on_release do ?" and "1.5 What does clone_children do ?"

How does chroot-escape protection in LXC implemented

How does chroot-escape protection in LXC implemented? Is there guarantee, that there no way to escape from lxc container to host?
I know, that linux-vserver uses chroot-barrier for that, but it doesn't part of stock kernel, afair.
Did you see the info contained in the "Applying mount namespaces" article: http://www.ibm.com/developerworks/library/l-mount-namespaces/
Under the "Per-user root" section:
"One shortcoming of this approach is that an ordinary chroot() can be escaped, although some privilege is needed. For instance, when executed with certain capabilities including CAP_SYS_CHROOT, the source for a program to break out of chroot() (see Resources) will cause the program to escape into the real filesystem root. Depending on the actual motivation for and use of the per-user filesystem trees, this may be a problem.
We can address this problem by using pivot_root(2) in a private namespace instead of chroot(2) to change the login's root to /share/USER/root. Whereas chroot() simply points the process's filesystem root to a specified new directory, pivot_root() detaches the specified new_root directory (which must be a mount) from its mount point and attaches it to the process root directory. Since the mount tree has no parent for the new root, the system cannot be tricked into entering it like it can with chroot(). We will use the pivot_root() approach."
In short I've seen pivot_root used in combination with the mnt namespace to mitigate such concerns.

Linux - use of term mount in clone man page

I asked a question to clarify what mount means in Linux.
I have a doubt in the use of this term in the clone man page:
The namespace of a process is the data (the set of mounts) describing the file
hierarchy as seen by that process.
The set of mounts - describing the file hierarchy seems misleading to me.
From what I understood based on the accepted answer, the file hierarchy could be much more than just the set of mounts, as the set of mounts will be just be the mount points where file systems were added to existing file system.
Can anyone clarify ?
If you think of “the set of mounts” as being (at least) a set of (device, mount point) pairs, rather than merely a set of mount points, then it starts to look a lot like the fstab or the output of the mount command (with no arguments), albeit without the additional information about flags and options (e.g. rw, nosuid, etc.).
Such a “set of mounts” provides complete information about what filesystems are mounted where. This is, by definition, the “mount namespace” of a process. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, additional questions arise when a process fork()s.
Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes.
With per-process mount namespaces, it is possible for a child process to have a different mount namespace from its parent. A question now arises:
Should changes to the mount namespace made by the child propagate back to the parent?
Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).
Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).
Thus, we must decide on fork() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).
If the CLONE_NEWNS flag is passed to clone() or fork(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parents data structure, where changes made by the child will be seen by the parent (so the mount command itself can work).

How to check the state of Linux threads?

How could I check the state of a Linux threads using codes, not tools? I want to know if a thread is running, blocked on a lock, or asleep for some other reason. I know the Linux tool "top" could do this work. But how to implement it in my own codes. Thanks.
I think you should study in details the /proc file system, also documented here, inside kernel source tree.
It is the way the Linux kernel tells things to outside!
There is a libproc also (used by ps and top, which reads /proc/ pseudo-files).
See this question, related to yours.
Reading files under /proc/ don't do any disk I/O (because /proc/ is a pseudo file system), so goes fast.
Lets say your process id is 100.
Go to /proc/100/task directory and there you could see multiple directories representing each threads.
then inside each subdirectory e.g. /proc/100/task/10100 there is a file named status.
the 2nd line inside this file is the state information of the thread.
You could also find it with by looking at the cgroup hierarchy of the service that your process belongs. Cgroups have a file called "tasks" and this file lists all the tasks of a service.
For example:
cat /sys/fs/cgroup/systemd/system.slice/hello.service/tasks
Note: cgroup should be enabled in your linux kernel.

Resources