Linux - use of term mount in clone man page - linux

I asked a question to clarify what mount means in Linux.
I have a doubt in the use of this term in the clone man page:
The namespace of a process is the data (the set of mounts) describing the file
hierarchy as seen by that process.
The set of mounts - describing the file hierarchy seems misleading to me.
From what I understood based on the accepted answer, the file hierarchy could be much more than just the set of mounts, as the set of mounts will be just be the mount points where file systems were added to existing file system.
Can anyone clarify ?

If you think of “the set of mounts” as being (at least) a set of (device, mount point) pairs, rather than merely a set of mount points, then it starts to look a lot like the fstab or the output of the mount command (with no arguments), albeit without the additional information about flags and options (e.g. rw, nosuid, etc.).
Such a “set of mounts” provides complete information about what filesystems are mounted where. This is, by definition, the “mount namespace” of a process. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, additional questions arise when a process fork()s.
Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes.
With per-process mount namespaces, it is possible for a child process to have a different mount namespace from its parent. A question now arises:
Should changes to the mount namespace made by the child propagate back to the parent?
Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).
Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).
Thus, we must decide on fork() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).
If the CLONE_NEWNS flag is passed to clone() or fork(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parents data structure, where changes made by the child will be seen by the parent (so the mount command itself can work).

Related

Get notification on cgroup process change?

Basically, inotify which normally serves to notify on filesystem changes doesn't work within the cgroup virtual filesystem.
Essentially I want a way to get a notification similar to inotify when a process in a cgroup either is dies or forks. I tried attaching inotify to the tasks virtual file inside the cgroup filesystem but that does nothing when a process forks on its own, only when a usespace tool actually manually writes to it to influence the cgroup.
inotify does not work on such virtual file system, be it cgroup, proc or sys.
Note: I tried this too, it would have been very handy in some situations, but nope. :-)
This is because the files and directories do not actually exist per see (for example they take 0 disk space), they are produced for you on the fly by the kernel as you visit them.
So the alternative would be to actively visit the files and dir in a busy loop periodically, which is so ugly that it is not a real alternative in most cases.
And this is why programs such as top, htop and such consume so much CPU. They do actually and actively browse the proc virtual file system rather than inotify or select or stuff like that in an eventing manner.
EDIT:
But there are some things that could help you though:
1/ For recent kernels (cgroups have been re-designed):
Look at:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
I quote:
2-3. [Un]populated Notification
Each non-root cgroup has a "cgroup.events" file which contains
"populated" field indicating whether the cgroup's sub-hierarchy has
live processes in it. Its value is 0 if there is no live process in
the cgroup and its descendants; otherwise, 1. poll and [id]notify
events are triggered when the value changes. [...]
1/ For older kernels:
You may want to have a look at notify_on_release and release_agent. Have a look at:
https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
notify_on_release flag: run the release agent on exit?
release_agent: the path to use for release notifications (this file exists in the top cgroup only)
And the sections "1.4 What does notify_on_release do ?" and "1.5 What does clone_children do ?"

How does chroot-escape protection in LXC implemented

How does chroot-escape protection in LXC implemented? Is there guarantee, that there no way to escape from lxc container to host?
I know, that linux-vserver uses chroot-barrier for that, but it doesn't part of stock kernel, afair.
Did you see the info contained in the "Applying mount namespaces" article: http://www.ibm.com/developerworks/library/l-mount-namespaces/
Under the "Per-user root" section:
"One shortcoming of this approach is that an ordinary chroot() can be escaped, although some privilege is needed. For instance, when executed with certain capabilities including CAP_SYS_CHROOT, the source for a program to break out of chroot() (see Resources) will cause the program to escape into the real filesystem root. Depending on the actual motivation for and use of the per-user filesystem trees, this may be a problem.
We can address this problem by using pivot_root(2) in a private namespace instead of chroot(2) to change the login's root to /share/USER/root. Whereas chroot() simply points the process's filesystem root to a specified new directory, pivot_root() detaches the specified new_root directory (which must be a mount) from its mount point and attaches it to the process root directory. Since the mount tree has no parent for the new root, the system cannot be tricked into entering it like it can with chroot(). We will use the pivot_root() approach."
In short I've seen pivot_root used in combination with the mnt namespace to mitigate such concerns.

Linux - understanding the mount namespace & clone CLONE_NEWNS flag

I am reading the mount & clone man page. I want to clarify how CLONE_NEWNS effects the view of file system for the child process.
(File hierarchy)
Lets consider this tree to be the directory hierarchy. Lets says 5 & 6 are mount points in the parent process. I clarified mount points in another question.
So my understanding is : 5 & 6 are mount points means that the mount command was used previously to 'mount' file systems (directory hierarchies) at 5 & 6 (which means there must be directory trees under 5 & 6 as well).
From mount man page :
A mount namespace is the set of filesystem mounts that are visible to a process.
From clone man page :
Every process lives in a mount namespace. The namespace of a process is the data
(the set of mounts) describing the file hierarchy as seen by that process. After
a fork(2) or clone() where the CLONE_NEWNS flag is not set, the child lives in the
same mount namespace as the parent.
Also :
After a clone() where the CLONE_NEWNS flag is set, the cloned child is started in a
new mount namespace, initialized with a copy of the namespace of the parent.
Now if I use clone() with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ? Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
Thanks.
The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone().
Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount command) all processes would immediately see that change irrespective of their relationship to the mount command.
With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:
Should changes to the mount namespace made by the child propagate back to the parent?
Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).
Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).
Thus, we must decide when creating a child process with clone() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).
If the CLONE_NEWNS flag is passed to clone(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount command itself can work).
Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?
Yes. It sees the exact same tree as its parent after the call to clone().
Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
Yes. Since you've used CLONE_NEWNS, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
No. If you've used CLONE_NEWNS, the changes made in the child cannot propagate back to the parent.
If you haven't used CLONE_NEWNS, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork().)
I don't have enough reputation points to add a comment so instead adding this comment as an answer.
It's just an add on to Emmet's answer.
AFAICU, If a process is created with CLONE_NEWNS flag set, it can only mount those file systems which have FS_USERNS_MOUNT flag set. And almost all disk based file systems does not set this flag (due to security reasons).
In do_new_mount, there is this check:
if (user_ns != &init_user_ns) {
if (!(type->fs_flags & FS_USERNS_MOUNT)) {
put_filesystem(type);
return -EPERM;
}
Please correct me if I am wrong

How to check the state of Linux threads?

How could I check the state of a Linux threads using codes, not tools? I want to know if a thread is running, blocked on a lock, or asleep for some other reason. I know the Linux tool "top" could do this work. But how to implement it in my own codes. Thanks.
I think you should study in details the /proc file system, also documented here, inside kernel source tree.
It is the way the Linux kernel tells things to outside!
There is a libproc also (used by ps and top, which reads /proc/ pseudo-files).
See this question, related to yours.
Reading files under /proc/ don't do any disk I/O (because /proc/ is a pseudo file system), so goes fast.
Lets say your process id is 100.
Go to /proc/100/task directory and there you could see multiple directories representing each threads.
then inside each subdirectory e.g. /proc/100/task/10100 there is a file named status.
the 2nd line inside this file is the state information of the thread.
You could also find it with by looking at the cgroup hierarchy of the service that your process belongs. Cgroups have a file called "tasks" and this file lists all the tasks of a service.
For example:
cat /sys/fs/cgroup/systemd/system.slice/hello.service/tasks
Note: cgroup should be enabled in your linux kernel.

Best POSIX way to determine if a filesystem is mounted read only

If I have a POSIX system like Linux or Mac OS X, what's the best and most portable way to determine if a path is on a read-only filesystem? I can think of 4 ways off the top of my head:
open(2) a file with O_WRONLY - You would need to come up with a unique filename and also pass in O_CREAT and O_EXCL. If it fails and you have an errno of EROFS then you know it's a read-only filesystem. This would have the annoying side effect of actually creating a file you didn't care about, but you could unlink(2) it immediately after creating it.
statvfs(3) - One of the fields of the returned struct statvfs is f_flag, and one of the flags is ST_RDONLY for a read-only filesystem. However, the spec for statvfs(3) makes it clear that applications cannot depend on any of the fields containing valid information. It would seem there's a decent possibility ST_RDONLY might not be set for a read-only filesystem.
access(2) - If you know the mount point, you can use access(2) with the W_OK flag as long as you are running as a user who would have write access to the mountpoint. Ie, either you are root or it was mounted with your UID as a mount parameter. You would get a return value of -1 and an errno of EROFS.
Parsing /etc/mtab or /proc/mounts - Doesn't seem portable. Mac OS X seems to have neither of these, for example. Even if the system did have /etc/mtab I'm not sure the fields are consistent between OSes or if the mount options for read-only (ro on Linux) are portable.
Are there other ways I'm missing? If you needed to know if a filesystem was mounted read-only, how would you do it?
You could also popen the command mount and examine the output looking for your file system and seeing if it held the text " (ro,".
But again, that's not necessarily portable.
My option would be to not worry about whether the file system was mounted read only at all. Just try and create your file and, if it fails, tell the user what the error was. And, of course, give them the option of saving it somewhere else.
You really have to do that sort of thing anyway since, in any scenario where there's even a small gap between testing and doing, you may find the situation changes (probably not to the extent of making an entire file system read only but, who knows, maybe there is (or will be in the future) a file system that allows this).
utime(path, NULL);
If you have write perms, then that will give you ROFS or -- if permitted -- simply update the mtime on the directory, which is basically harmless.

Resources