Umount("/proc") syscall for mount namespaces "Invalid Argument" error - linux

i'm currently trying to use different namespaces for test purposes. For this i tried to implement a MNT namespace (combined with a PID namespace) so that a program within this namespace cannot see other processes on the system.
When trying to use the umount system call like this (same goes with umount("/proc"), or with umount2 and the Force-option ):
if (umount2("/proc", 0)!= 0)
{
fprintf(stderr, "Error when unmounting /proc: %s\n",strerror(errno));
printf("\tKernel version might be incorrect\n");
exit(-1);
}
the system call execution ends with error number 22 "Invalid Argument".
This code snipped is called within a function that gets called when a child process with the namespaces is created:
pid_t child_pid = clone(child_exec, child_stack+1024*1024, Child_Flags,&args);
(the child_exec function). Flags are set as following:
int Child_Flags = CLONE_NEWIPC | CLONE_NEWUSER | CLONE_NEWUTS | CLONE_NEWNET |CLONE_NEWPID | CLONE_NEWNS |SIGCHLD ;
With the CLONE_NEWNS for a new mount namespace (http://man7.org/linux/man-pages/man7/namespaces.7.html)
Output of the program is as follows:
Testing with Isolation
Starting Container engine
In-Child-PID: 1
Error number 22
Error when unmounting /proc: Invalid argument
Can somebody point me to my error, so i can unmount the folder? Thank you in advance

You can't unmount things that were mounted in a different user namespace except by using pivot_root followed by umount to unmount /. You can overmount /proc without unmounting the old /proc.

Related

Can not get correct pid in WSL2

I am learning Linux programing.
When I trying to write a simple module to get family of a process, I find I can not get current pid of a process and its parent process. How to fix it?
Here is a part of my code.
static pid_t pid = 1;
module_param(pid, int, 0644);
static int hello_init(void) {
struct task_struct *p;
struct list_head *pp;
struct task_struct *psibling;
struct pid *kpid;
kpid = find_get_pid(pid);
p = pid_task(kpid, PIDTYPE_PID);
printk("me: %d %s\n", pid, p->comm);
if (p->parent == NULL) {
printk("No Parent\n");
}
else {
printk("Parent: %d %s\n", p->parent->pid, p->parent->comm);
}
list_for_each(pp, &p->parent->children) {
psibling = list_entry(pp, struct task_struct, sibling);
printk("sibling %d %s \n", psibling->pid, psibling->comm);
}
list_for_each(pp, &p->children) {
psibling = list_entry(pp, struct task_struct, sibling);
printk("children %d %s \n", psibling->pid, psibling->comm);
}
return 0;
}
result:
sudo insmod module.ko pid=1
dmesg
[ 6396.170631] me: 237 systemd
[ 6396.170633] Parent: 235 unshare
[ 6396.170633] sibling 237 systemd
[ 6396.170633] children 286 systemd-journal
[ 6396.170634] children 306 systemd-udevd
[ 6396.170635] children 314 systemd-network
[ 6396.170635] children 501 snapfuse
[ 6396.170636] children 508 dbus-daemon
[ 6396.170636] children 509 NetworkManager
[ 6396.170637] children 632 systemd-logind
[ 6396.170637] children 639 systemd
[ 6396.170638] children 665 rtkit-daemon
[ 6396.170638] children 671 polkitd
[ 6396.170638] children 711 udisksd
[ 6396.170639] children 761 upowerd
I'm not a Linux systems development expert, but I'll take a stab at helping based on what I see you trying.
First, you don't mention it in your question, but you are clearly running some sort of Systemd enablement. As you know, Systemd isn't normally supported on WSL. At a high level, the scripts to enable Systemd on WSL all have two essential functions:
Create a new PID namespace where Systemd is running as PID1. At the most basic level, this can be done via:
sudo -b unshare --pid --fork --mount-proc /lib/systemd/systemd --system-unit=basic.target
We can see the unshare in the list of processes returned, so that's getting called, at least.
Wait for Systemd to fully start, then enter the namespace that was created above. This is typically something like:
sudo -E nsenter --all -t $(pgrep -xo systemd) $SHELL
The actual scripts are typically a bit more complicated in order to handle multiple shells, distributions, etc. They also attempt to preserve more of the WSL environment inside the namespace in order to enable the Interop features such as running Windows .exes. But the core concept is always the same.
So, taking a guess here (again, as a non-systems-dev guy), it seems that:
kpid=find_get_pid(1) is returning the systemd process inside the namespace
pid_task(kpid, PIDTYPE_PID) is returning the "true" process information from the root namespace.
It seems to me that code must be running outside the namespace, since you see the unshare as part of it. From within the namespace, the unshare doesn't exist. You can verify this (inside the namespace) with ps -ef | grep unshare.
There are at least two possible solutions:
If it's not an issue (and from the comments, it wasn't), then just run your code from the root pid namespace. I'm assuming that your Systemd script is running via your shell startup files, so you should be able to get back to the root namespace by starting up with something like wsl ~ -e bash --noprofile --norc. This will start the shell without any of the startup scripts.
Of course, other techniques for disabling the Systemd script are probably documented by whatever script you are using.
If you do want your code to work properly from within a PID namespace, then you'll probably need to find the namespace (I'd start with the source of lsns as an example).
Then find the task struct within that namespace (probably find_task_by_pid_ns?).

Cannot open uid_map for writing from an app with cap_setuid capability set

While toying around with an example from user_namespaces(7), I've come across a strange behaviour.
What the application does
The application user-ns-ex calls clone(2) with CLONE_NEWUSER, thus creating a new process in a new user namespace. The parent process writes a map (0 1000 1) to /proc//uid_map file and tells (via a pipe) the child that it can proceed. The child process then execs bash.
I've copied the source code here.
The problem
The application opens /proc//uid_map for writing if I either set it no capabilites or all of them.
When I set only set_capuid,set_capgid and optionally cap_sys_admin the call to open(2) fails:
Set caps:
arksnote linux-namespaces # setcap 'cap_setuid,cap_setgid,cap_sys_admin=epi' ./user-ns-ex
arksnote linux-namespaces # getcap ./user-ns-ex
./user-ns-ex = cap_setgid,cap_setuid,cap_sys_admin+eip
Try to run:
kamyshev#arksnote ~/workspace/personal/linux-kernel/linux-namespaces $ ./user-ns-ex -v -U -M '0 1000 1' bash
./user-ns-ex: PID of child created by clone() is 19666
ERROR: open /proc/19666/uid_map: Permission denied
About to exec bash
And now a successfull case:
No capabilities:
arksnote linux-namespaces # setcap '=' ./user-ns-ex
arksnote linux-namespaces # getcap ./user-ns-ex
./user-ns-ex =
Runs Ok:
kamyshev#arksnote ~/workspace/personal/linux-kernel/linux-namespaces $ ./user-ns-ex -v -U -M '0 1000 1' bash
./user-ns-ex: PID of child created by clone() is 19557
About to exec bash
arksnote linux-namespaces # exit
I've been trying to find the reason in man-pages and playing with different capabilities but with no luck as of this moment. What puzzles me the most, is that the application runs with less capabilities and does not with more.
Can someone help me and clarify the issue?
The research
I have found the reason. During my reasearch I have found that uid_map file is not open because its ownership is changed to root.
Unprivileged process, no capabilities:
parent(m): capabilities: '='
parent(m): file /proc/4644/uid_map owner uid: 1000
parent(m): file /proc/4644/uid_map owner gid: 1000
Unprivileged process, capabilities are set (cap_setuid=pe):
parent(m): capabilities: '= cap_setuid+ep'
parent(m): file /proc/4644/uid_map owner uid: 0
parent(m): file /proc/4644/uid_map owner gid: 0
ERROR: open /proc/4668/uid_map: Permission denied
The following research has led me to this topic: what causes proc pid resources to become owned by root?
The rules on "dumpable" flag
This is what happens:
1) When a process is not dumpable, its /proc/<pid> inodes are given a root ownership:
// linux/base.c
struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
...
if (task_dumpable(task)) {
rcu_read_lock();
cred = __task_cred(task);
inode->i_uid = cred->euid;
inode->i_gid = cred->egid;
rcu_read_unlock();
}
2) The process is dumpable only when its "dumpable" attribute has a value 1 (SUID_DUMP_USER). See ptrace(2).
3) prctl(2) clears the situation further:
Normally, this flag is set to 1. However, it is reset to the
current value contained in the file /proc/sys/fs/suid_dumpable
(which by default has the value 0), in the following
circumstances:
* The process's effective user or group ID is changed.
* The process's filesystem user or group ID is changed (see
credentials(7)).
* The process executes (execve(2)) a set-user-ID or set-
group-ID program, resulting in a change of either the
effective user ID or the effective group ID.
* The process executes (execve(2)) a program that has file
capabilities (see capabilities(7)), but only if the
permitted capabilities gained exceed those already
permitted for the process.
Thus my problem arose from the last of the above rules:
int commit_creds(struct cred *new)
<...>
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
Fixes
There are a number of ways to overcome the issue:
Globally change /proc/sys/fs/suid_dumpable:
echo 1 > /proc/sys/fs/suid_dumpable
Set "dumpable" flag just for the process:
prctl(PR_SET_DUMPABLE, 1, 0, 0, 0)

mount fails from spawned process

I want to start a process that uses a USB hard drive once it gets inserted.
Since UDEV rules specifically mentions not to run long-time processes from RUN command, I send a FIFO message to my service which then opens the relevant process.
So the flow goes like this:
UDEV > runs action process > sends FIFO message to service > service gets message > runs the process who works with the HDD (aka HDD-PROCESS).
If I run my service from shell-1 and run 'action process' (the one that UDEV runs) from shell-2 everything works (including when trying it with udev).
But in deployment, the service is spawned from init, and when it does, the mount command fails saying "No such device".
I then detached "HDD-PROCESS" with fork and setsid, but that didn't help either.
from inittab:
::respawn:/opt/spwn_frm_init
ps relevant output:
PID PPID PGID SID COMM ARGS
31112 1 31112 31112 spwn_frm_init /bin/sh /opt/spwn_frm_init
31113 31112 31112 31112 runSvc /bin/sh /app/sys/runSvc
31114 31113 31112 31112 python python /app/sys/mainSvc.py
24064 1 24064 24064 python /usr/bin/python /app/sys/hdd_proc.py sdb1
everything runs under root (ps shows that too, I omitted that to save screen space).
So in short: when I run /opt/spwn_frm_init from shell, everything works. when I kill it and let it re-spawn from inittab, it doesn't and mount fails with error above.
UPDATE:
There is no problem when trying to mount an ext3 drive, but only on the NTFS one (using ntfs-3g).
Found it!
One of the differences between spawned process and another one who runs from shell is the environment variables which usually should be a problem when all I want is to call mount.
But when I noticed the problem happens only with the NTFS drive, it suddenly occurred to me that mount might need to call ntfs-3g so it worth checking if the second is accessible in the PATH variable.
which ntfs-3g led to /usr/local/bin/ntfs-3g which was mentioned in the default shell PATH but not in the one spawned from init.
To solve it, I added this /usr/local/bin to PATH in the "HDD-PROCESS" and mount began to work :)
A better error message in mount could have saved a lot of time here...

reboot within an initrd image

I am looking for a method to restart/reset my linux system from within an init-bottom script*. At the time my script is executed the system is found under /root and I have access to a busybox.
But the "reboot" command which is part of my busybox does not work. Is there any other possibility?
My system is booted normally with an initramfs image and my script is eventually causing an update process. The new systemd which comes with debian irritates this. But with a power reset everything is fine.
I have found this:
echo b >/proc/sysrq-trigger
(it's like pressing CTRL+ALT+DEL)
If you -are- init (the PID of your process/script is 0), then starting the busybox reboot program won't work since it tries to signal init (which is not started) to reboot.
Instead, as PID 0, you should do what init would do. This is call the correct kernel API for the reboot. See Man reboot(2) for details.
Assuming you are running a c program or something, one would do:
#include <unistd.h>
#include <sys/reboot.h>
void main() { reboot(0x1234567); }
This is much better than executing the sysrq trigger which will act more like a panic restart than a clean restart.
As a final note, busybox's init actually forks a process to do the reboot for it. This is because the reboot systemcall actually also exists the program, and the system should never run without an init process (which will also panic the kernel). Hence in this case, you would do something like:
pid_t pid;
pid = vfork();
if (pid == 0) { /* child */
reboot(0x1234567);
_exit(EXIT_SUCCESS);
}
while (1); /* Parent (init) waits */

Is there a way to figure out what is using a Linux kernel module?

If I load a kernel module and list the loaded modules with lsmod, I can get the "use count" of the module (number of other modules with a reference to the module). Is there a way to figure out what is using a module, though?
The issue is that a module I am developing insists its use count is 1 and thus I cannot use rmmod to unload it, but its "by" column is empty. This means that every time I want to re-compile and re-load the module, I have to reboot the machine (or, at least, I can't figure out any other way to unload it).
Actually, there seems to be a way to list processes that claim a module/driver - however, I haven't seen it advertised (outside of Linux kernel documentation), so I'll jot down my notes here:
First of all, many thanks for #haggai_e's answer; the pointer to the functions try_module_get and try_module_put as those responsible for managing the use count (refcount) was the key that allowed me to track down the procedure.
Looking further for this online, I somehow stumbled upon the post Linux-Kernel Archive: [PATCH 1/2] tracing: Reduce overhead of module tracepoints; which finally pointed to a facility present in the kernel, known as (I guess) "tracing"; the documentation for this is in the directory Documentation/trace - Linux kernel source tree. In particular, two files explain the tracing facility, events.txt and ftrace.txt.
But, there is also a short "tracing mini-HOWTO" on a running Linux system in /sys/kernel/debug/tracing/README (see also I'm really really tired of people saying that there's no documentation…); note that in the kernel source tree, this file is actually generated by the file kernel/trace/trace.c. I've tested this on Ubuntu natty, and note that since /sys is owned by root, you have to use sudo to read this file, as in sudo cat or
sudo less /sys/kernel/debug/tracing/README
... and that goes for pretty much all other operations under /sys which will be described here.
First of all, here is a simple minimal module/driver code (which I put together from the referred resources), which simply creates a /proc/testmod-sample file node, which returns the string "This is testmod." when it is being read; this is testmod.c:
/*
https://github.com/spotify/linux/blob/master/samples/tracepoints/tracepoint-sample.c
https://www.linux.com/learn/linux-training/37985-the-kernel-newbie-corner-kernel-debugging-using-proc-qsequenceq-files-part-1
*/
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h> // for sequence files
struct proc_dir_entry *pentry_sample;
char *defaultOutput = "This is testmod.";
static int my_show(struct seq_file *m, void *v)
{
seq_printf(m, "%s\n", defaultOutput);
return 0;
}
static int my_open(struct inode *inode, struct file *file)
{
return single_open(file, my_show, NULL);
}
static const struct file_operations mark_ops = {
.owner = THIS_MODULE,
.open = my_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static int __init sample_init(void)
{
printk(KERN_ALERT "sample init\n");
pentry_sample = proc_create(
"testmod-sample", 0444, NULL, &mark_ops);
if (!pentry_sample)
return -EPERM;
return 0;
}
static void __exit sample_exit(void)
{
printk(KERN_ALERT "sample exit\n");
remove_proc_entry("testmod-sample", NULL);
}
module_init(sample_init);
module_exit(sample_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers et al.");
MODULE_DESCRIPTION("based on Tracepoint sample");
This module can be built with the following Makefile (just have it placed in the same directory as testmod.c, and then run make in that same directory):
CONFIG_MODULE_FORCE_UNLOAD=y
# for oprofile
DEBUG_INFO=y
EXTRA_CFLAGS=-g -O0
obj-m += testmod.o
# mind the tab characters needed at start here:
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
When this module/driver is built, the output is a kernel object file, testmod.ko.
At this point, we can prepare the event tracing related to try_module_get and try_module_put; those are in /sys/kernel/debug/tracing/events/module:
$ sudo ls /sys/kernel/debug/tracing/events/module
enable filter module_free module_get module_load module_put module_request
Note that on my system, tracing is by default enabled:
$ sudo cat /sys/kernel/debug/tracing/tracing_enabled
1
... however, the module tracing (specifically) is not:
$ sudo cat /sys/kernel/debug/tracing/events/module/enable
0
Now, we should first make a filter, that will react on the module_get, module_put etc events, but only for the testmod module. To do that, we should first check the format of the event:
$ sudo cat /sys/kernel/debug/tracing/events/module/module_put/format
name: module_put
ID: 312
format:
...
field:__data_loc char[] name; offset:20; size:4; signed:1;
print fmt: "%s call_site=%pf refcnt=%d", __get_str(name), (void *)REC->ip, REC->refcnt
Here we can see that there is a field called name, which holds the driver name, which we can filter against. To create a filter, we simply echo the filter string into the corresponding file:
sudo bash -c "echo name == testmod > /sys/kernel/debug/tracing/events/module/filter"
Here, first note that since we have to call sudo, we have to wrap the whole echo redirection as an argument command of a sudo-ed bash. Second, note that since we wrote to the "parent" module/filter, not the specific events (which would be module/module_put/filter etc), this filter will be applied to all events listed as "children" of module directory.
Finally, we enable tracing for module:
sudo bash -c "echo 1 > /sys/kernel/debug/tracing/events/module/enable"
From this point on, we can read the trace log file; for me, reading the blocking,
"piped" version of the trace file worked - like this:
sudo cat /sys/kernel/debug/tracing/trace_pipe | tee tracelog.txt
At this point, we will not see anything in the log - so it is time to load (and utilize, and remove) the driver (in a different terminal from where trace_pipe is being read):
$ sudo insmod ./testmod.ko
$ cat /proc/testmod-sample
This is testmod.
$ sudo rmmod testmod
If we go back to the terminal where trace_pipe is being read, we should see something like:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
insmod-21137 [001] 28038.101509: module_load: testmod
insmod-21137 [001] 28038.103904: module_put: testmod call_site=sys_init_module refcnt=2
rmmod-21354 [000] 28080.244448: module_free: testmod
That is pretty much all we will obtain for our testmod driver - the refcount changes only when the driver is loaded (insmod) or unloaded (rmmod), not when we do a read through cat. So we can simply interrupt the read from trace_pipe with CTRL+C in that terminal; and to stop the tracing altogether:
sudo bash -c "echo 0 > /sys/kernel/debug/tracing/tracing_enabled"
Here, note that most examples refer to reading the file /sys/kernel/debug/tracing/trace instead of trace_pipe as here. However, one problem is that this file is not meant to be "piped" (so you shouldn't run a tail -f on this trace file); but instead you should re-read the trace after each operation. After the first insmod, we would obtain the same output from cat-ing both trace and trace_pipe; however, after the rmmod, reading the trace file would give:
<...>-21137 [001] 28038.101509: module_load: testmod
<...>-21137 [001] 28038.103904: module_put: testmod call_site=sys_init_module refcnt=2
rmmod-21354 [000] 28080.244448: module_free: testmod
... that is: at this point, the insmod had already been exited for long, and so it doesn't exist anymore in the process list - and therefore cannot be found via the recorded process ID (PID) at the time - thus we get a blank <...> as process name. Therefore, it is better to log (via tee) a running output from trace_pipe in this case. Also, note that in order to clear/reset/erase the trace file, one simply writes a 0 to it:
sudo bash -c "echo 0 > /sys/kernel/debug/tracing/trace"
If this seems counterintuitive, note that trace is a special file, and will always report a file size of zero anyways:
$ sudo ls -la /sys/kernel/debug/tracing/trace
-rw-r--r-- 1 root root 0 2013-03-19 06:39 /sys/kernel/debug/tracing/trace
... even if it is "full".
Finally, note that if we didn't implement a filter, we would have obtained a log of all module calls on the running system - which would log any call (also background) to grep and such, as those use the binfmt_misc module:
...
tr-6232 [001] 25149.815373: module_put: binfmt_misc call_site=search_binary_handler refcnt=133194
..
grep-6231 [001] 25149.816923: module_put: binfmt_misc call_site=search_binary_handler refcnt=133196
..
cut-6233 [000] 25149.817842: module_put: binfmt_misc call_site=search_binary_handler refcnt=129669
..
sudo-6234 [001] 25150.289519: module_put: binfmt_misc call_site=search_binary_handler refcnt=133198
..
tail-6235 [000] 25150.316002: module_put: binfmt_misc call_site=search_binary_handler refcnt=129671
... which adds quite a bit of overhead (in both log data ammount, and processing time required to generate it).
While looking this up, I stumbled upon Debugging Linux Kernel by Ftrace PDF, which refers to a tool trace-cmd, which pretty much does the similar as above - but through an easier command line interface. There is also a "front-end reader" GUI for trace-cmd called KernelShark; both of these are also in Debian/Ubuntu repositories via sudo apt-get install trace-cmd kernelshark. These tools could be an alternative to the procedure described above.
Finally, I'd just note that, while the above testmod example doesn't really show use in context of multiple claims, I have used the same tracing procedure to discover that an USB module I'm coding, was repeatedly claimed by pulseaudio as soon as the USB device was plugged in - so the procedure seems to work for such use cases.
It says on the Linux Kernel Module Programming Guide that the use count of a module is controlled by the functions try_module_get and module_put. Perhaps you can find where these functions are called for your module.
More info: https://www.kernel.org/doc/htmldocs/kernel-hacking/routines-module-use-counters.html
All you get are a list of which modules depend on which other modules (the Used by column in lsmod). You can't write a program to tell why the module was loaded, if it is still needed for anything, or what might break if you unload it and everything that depends on it.
You might try lsof or fuser.
If you use rmmod WITHOUT the --force option, it will tell you what is using a module. Example:
$ lsmod | grep firewire
firewire_ohci 24695 0
firewire_core 50151 1 firewire_ohci
crc_itu_t 1717 1 firewire_core
$ sudo modprobe -r firewire-core
FATAL: Module firewire_core is in use.
$ sudo rmmod firewire_core
ERROR: Module firewire_core is in use by firewire_ohci
$ sudo modprobe -r firewire-ohci
$ sudo modprobe -r firewire-core
$ lsmod | grep firewire
$
try kgdb and set breakpoint to your module
For anyone desperate to figure out why they can't reload modules, I was able to work around this problem by
Getting the path of the currently used module using "modinfo"
rm -rfing it
Copying the new module I wanted to load to the path it was in
Typing "modprobe DRIVER_NAME.ko".

Resources