Is there a way to figure out what is using a Linux kernel module? - linux

If I load a kernel module and list the loaded modules with lsmod, I can get the "use count" of the module (number of other modules with a reference to the module). Is there a way to figure out what is using a module, though?
The issue is that a module I am developing insists its use count is 1 and thus I cannot use rmmod to unload it, but its "by" column is empty. This means that every time I want to re-compile and re-load the module, I have to reboot the machine (or, at least, I can't figure out any other way to unload it).

Actually, there seems to be a way to list processes that claim a module/driver - however, I haven't seen it advertised (outside of Linux kernel documentation), so I'll jot down my notes here:
First of all, many thanks for #haggai_e's answer; the pointer to the functions try_module_get and try_module_put as those responsible for managing the use count (refcount) was the key that allowed me to track down the procedure.
Looking further for this online, I somehow stumbled upon the post Linux-Kernel Archive: [PATCH 1/2] tracing: Reduce overhead of module tracepoints; which finally pointed to a facility present in the kernel, known as (I guess) "tracing"; the documentation for this is in the directory Documentation/trace - Linux kernel source tree. In particular, two files explain the tracing facility, events.txt and ftrace.txt.
But, there is also a short "tracing mini-HOWTO" on a running Linux system in /sys/kernel/debug/tracing/README (see also I'm really really tired of people saying that there's no documentation…); note that in the kernel source tree, this file is actually generated by the file kernel/trace/trace.c. I've tested this on Ubuntu natty, and note that since /sys is owned by root, you have to use sudo to read this file, as in sudo cat or
sudo less /sys/kernel/debug/tracing/README
... and that goes for pretty much all other operations under /sys which will be described here.
First of all, here is a simple minimal module/driver code (which I put together from the referred resources), which simply creates a /proc/testmod-sample file node, which returns the string "This is testmod." when it is being read; this is testmod.c:
/*
https://github.com/spotify/linux/blob/master/samples/tracepoints/tracepoint-sample.c
https://www.linux.com/learn/linux-training/37985-the-kernel-newbie-corner-kernel-debugging-using-proc-qsequenceq-files-part-1
*/
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h> // for sequence files
struct proc_dir_entry *pentry_sample;
char *defaultOutput = "This is testmod.";
static int my_show(struct seq_file *m, void *v)
{
seq_printf(m, "%s\n", defaultOutput);
return 0;
}
static int my_open(struct inode *inode, struct file *file)
{
return single_open(file, my_show, NULL);
}
static const struct file_operations mark_ops = {
.owner = THIS_MODULE,
.open = my_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static int __init sample_init(void)
{
printk(KERN_ALERT "sample init\n");
pentry_sample = proc_create(
"testmod-sample", 0444, NULL, &mark_ops);
if (!pentry_sample)
return -EPERM;
return 0;
}
static void __exit sample_exit(void)
{
printk(KERN_ALERT "sample exit\n");
remove_proc_entry("testmod-sample", NULL);
}
module_init(sample_init);
module_exit(sample_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers et al.");
MODULE_DESCRIPTION("based on Tracepoint sample");
This module can be built with the following Makefile (just have it placed in the same directory as testmod.c, and then run make in that same directory):
CONFIG_MODULE_FORCE_UNLOAD=y
# for oprofile
DEBUG_INFO=y
EXTRA_CFLAGS=-g -O0
obj-m += testmod.o
# mind the tab characters needed at start here:
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
When this module/driver is built, the output is a kernel object file, testmod.ko.
At this point, we can prepare the event tracing related to try_module_get and try_module_put; those are in /sys/kernel/debug/tracing/events/module:
$ sudo ls /sys/kernel/debug/tracing/events/module
enable filter module_free module_get module_load module_put module_request
Note that on my system, tracing is by default enabled:
$ sudo cat /sys/kernel/debug/tracing/tracing_enabled
1
... however, the module tracing (specifically) is not:
$ sudo cat /sys/kernel/debug/tracing/events/module/enable
0
Now, we should first make a filter, that will react on the module_get, module_put etc events, but only for the testmod module. To do that, we should first check the format of the event:
$ sudo cat /sys/kernel/debug/tracing/events/module/module_put/format
name: module_put
ID: 312
format:
...
field:__data_loc char[] name; offset:20; size:4; signed:1;
print fmt: "%s call_site=%pf refcnt=%d", __get_str(name), (void *)REC->ip, REC->refcnt
Here we can see that there is a field called name, which holds the driver name, which we can filter against. To create a filter, we simply echo the filter string into the corresponding file:
sudo bash -c "echo name == testmod > /sys/kernel/debug/tracing/events/module/filter"
Here, first note that since we have to call sudo, we have to wrap the whole echo redirection as an argument command of a sudo-ed bash. Second, note that since we wrote to the "parent" module/filter, not the specific events (which would be module/module_put/filter etc), this filter will be applied to all events listed as "children" of module directory.
Finally, we enable tracing for module:
sudo bash -c "echo 1 > /sys/kernel/debug/tracing/events/module/enable"
From this point on, we can read the trace log file; for me, reading the blocking,
"piped" version of the trace file worked - like this:
sudo cat /sys/kernel/debug/tracing/trace_pipe | tee tracelog.txt
At this point, we will not see anything in the log - so it is time to load (and utilize, and remove) the driver (in a different terminal from where trace_pipe is being read):
$ sudo insmod ./testmod.ko
$ cat /proc/testmod-sample
This is testmod.
$ sudo rmmod testmod
If we go back to the terminal where trace_pipe is being read, we should see something like:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
insmod-21137 [001] 28038.101509: module_load: testmod
insmod-21137 [001] 28038.103904: module_put: testmod call_site=sys_init_module refcnt=2
rmmod-21354 [000] 28080.244448: module_free: testmod
That is pretty much all we will obtain for our testmod driver - the refcount changes only when the driver is loaded (insmod) or unloaded (rmmod), not when we do a read through cat. So we can simply interrupt the read from trace_pipe with CTRL+C in that terminal; and to stop the tracing altogether:
sudo bash -c "echo 0 > /sys/kernel/debug/tracing/tracing_enabled"
Here, note that most examples refer to reading the file /sys/kernel/debug/tracing/trace instead of trace_pipe as here. However, one problem is that this file is not meant to be "piped" (so you shouldn't run a tail -f on this trace file); but instead you should re-read the trace after each operation. After the first insmod, we would obtain the same output from cat-ing both trace and trace_pipe; however, after the rmmod, reading the trace file would give:
<...>-21137 [001] 28038.101509: module_load: testmod
<...>-21137 [001] 28038.103904: module_put: testmod call_site=sys_init_module refcnt=2
rmmod-21354 [000] 28080.244448: module_free: testmod
... that is: at this point, the insmod had already been exited for long, and so it doesn't exist anymore in the process list - and therefore cannot be found via the recorded process ID (PID) at the time - thus we get a blank <...> as process name. Therefore, it is better to log (via tee) a running output from trace_pipe in this case. Also, note that in order to clear/reset/erase the trace file, one simply writes a 0 to it:
sudo bash -c "echo 0 > /sys/kernel/debug/tracing/trace"
If this seems counterintuitive, note that trace is a special file, and will always report a file size of zero anyways:
$ sudo ls -la /sys/kernel/debug/tracing/trace
-rw-r--r-- 1 root root 0 2013-03-19 06:39 /sys/kernel/debug/tracing/trace
... even if it is "full".
Finally, note that if we didn't implement a filter, we would have obtained a log of all module calls on the running system - which would log any call (also background) to grep and such, as those use the binfmt_misc module:
...
tr-6232 [001] 25149.815373: module_put: binfmt_misc call_site=search_binary_handler refcnt=133194
..
grep-6231 [001] 25149.816923: module_put: binfmt_misc call_site=search_binary_handler refcnt=133196
..
cut-6233 [000] 25149.817842: module_put: binfmt_misc call_site=search_binary_handler refcnt=129669
..
sudo-6234 [001] 25150.289519: module_put: binfmt_misc call_site=search_binary_handler refcnt=133198
..
tail-6235 [000] 25150.316002: module_put: binfmt_misc call_site=search_binary_handler refcnt=129671
... which adds quite a bit of overhead (in both log data ammount, and processing time required to generate it).
While looking this up, I stumbled upon Debugging Linux Kernel by Ftrace PDF, which refers to a tool trace-cmd, which pretty much does the similar as above - but through an easier command line interface. There is also a "front-end reader" GUI for trace-cmd called KernelShark; both of these are also in Debian/Ubuntu repositories via sudo apt-get install trace-cmd kernelshark. These tools could be an alternative to the procedure described above.
Finally, I'd just note that, while the above testmod example doesn't really show use in context of multiple claims, I have used the same tracing procedure to discover that an USB module I'm coding, was repeatedly claimed by pulseaudio as soon as the USB device was plugged in - so the procedure seems to work for such use cases.

It says on the Linux Kernel Module Programming Guide that the use count of a module is controlled by the functions try_module_get and module_put. Perhaps you can find where these functions are called for your module.
More info: https://www.kernel.org/doc/htmldocs/kernel-hacking/routines-module-use-counters.html

All you get are a list of which modules depend on which other modules (the Used by column in lsmod). You can't write a program to tell why the module was loaded, if it is still needed for anything, or what might break if you unload it and everything that depends on it.

You might try lsof or fuser.

If you use rmmod WITHOUT the --force option, it will tell you what is using a module. Example:
$ lsmod | grep firewire
firewire_ohci 24695 0
firewire_core 50151 1 firewire_ohci
crc_itu_t 1717 1 firewire_core
$ sudo modprobe -r firewire-core
FATAL: Module firewire_core is in use.
$ sudo rmmod firewire_core
ERROR: Module firewire_core is in use by firewire_ohci
$ sudo modprobe -r firewire-ohci
$ sudo modprobe -r firewire-core
$ lsmod | grep firewire
$

try kgdb and set breakpoint to your module

For anyone desperate to figure out why they can't reload modules, I was able to work around this problem by
Getting the path of the currently used module using "modinfo"
rm -rfing it
Copying the new module I wanted to load to the path it was in
Typing "modprobe DRIVER_NAME.ko".

Related

how to know what instances are using a specific loadable kernel module

lsmod can show modules currently loaded and how many instances are using them. Just like the infomation shown below.
user#centos7:~$ lsmod | grep fuse
Module Size Used by
fuse 85681 1
I want to know exactly which instance is using the module because I want to unload the module by killing the process.
I have tried enabling CONFIG_MODULE_FORCE_UNLOAD option when compiling the kernel and typing rmmod -f fuse, but the system just freezes.
Thanks for #tsyvarev 's comment. There's no filesystem using it but when I type mount -l it shows
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
so I typed sudo umount /sys/fs/fuse/connections. After that I can unloaded it successfully. The problem I encountered seems much easier than the link you posted.

How to use -V option in perf probe?

I'm trying to print a value of variable in specific function using perf. So I tried perf probe with -V option, but I got error message like this.
# perf probe -V tcp_sendmsg
Failed to find the path for the kernel: Invalid ELF file
Error: Failed to show vars.
So I downloaded kernel symbol&source package and I checked that the value CONFIG_DEBUG_INFO in /boot/config-5.3.0-46-generic'set to 1. But still the same error occurs.
How should I fix this problem?
Ubuntu 18.04 LTS
release: 5.3.0-46-generic
version: 38~18.04.01
To make perf probe -V work, you will have to pass a vmlinux file that has debuginfo, which you seem to have done as per your comment -
sudo perf probe --vmlinux /usr/lib/debug/boot/vmlinux-4.15.0-1077-azure
-V tcp_sendmsg
Available variables at tcp_sendmsg
#<tcp_sendmsg+0>
size_t size
struct msghdr* msg
struct sock* sk
Now as per the man-page, perf probe -L means
-L, --line=
Show source code lines which can be probed.
This needs an argument which specifies a range of the source code.
(see LINE SYNTAX for detail)
which means you need to specify the location of the kernel source code, when you run perf probe -L, otherwise the elfutils tool, that analyzes the vmlinux file with debuginfo would not be able to determine the right path for the kernel sources.
You can download the kernel sources as mentioned here
And then run perf probe where you specify the --source switch,
sudo perf probe --source /usr/src/linux-source-4.4.0 --vmlinux /usr/lib/debug/boot/vmlinux-4.15.0-1077-azure -L tcp_sendmsg
<tcp_sendmsg#/usr/src/linux-source-4.4.0//net/ipv4/tcp.c:0>
0 /* Clear memory counter. */
1 tp->ucopy.memory = 0;
}
4 static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
5 {
6 struct sk_buff *skb;
u32 offset;
9 while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
offset = seq - TCP_SKB_CB(skb)->seq;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
offset--;

Does GNU time memory output account for child processes too?

When running GNU time (/usr/bin/time) and checking for memory consumption, does its output account for the memory usage of the child processes of my target program?
Could not find anything in GNU's time manpage.
Yes.
You can easily check with:
$ /usr/bin/time -f '%M' sh -c 'perl -e "\$y=q{x}x(2*1024*1024)" & wait'
8132
$ /usr/bin/time -f '%M' sh -c 'perl -e "\$y=q{x}x(8*1024*1024)" & wait'
20648
GNU time is using the wait4 system call on Linux (via the wait3 glibc wrapper), and while undocumented, the resource usage it returns in the struct rusage also includes the descendands of the process waited for. You can look at the kernel implementation of wait4 in kernel/exit.c for all the details:
$ grep -C2 RUSAGE_BOTH include/uapi/linux/resource.h
#define RUSAGE_SELF 0
#define RUSAGE_CHILDREN (-1)
#define RUSAGE_BOTH (-2) /* sys_wait4() uses this */
#define RUSAGE_THREAD 1 /* only the calling thread */
FreeBSD and NetBSD also have a wait6 system call which returns separate info for the process waited for and for its descendants. They also clearly document that the rusage returned by wait3 and wait4 also includes grandchildren.

Raw capture capabilities (CAP_NET_RAW, CAP_NET_ADMIN) not working outside /usr/bin and friends for packet capture program using libpcap

TL;DR: Why are cap_net_raw, cap_net_admin capabilities only working in /usr/bin (or /usr/sbin), but not other places? Can this be configured someplace?
I'm having problems assigning capabilities to my C program utilizing libpcap in Ubuntu 14.04. Even after assigning capabilities using setcap(8) and checking it using getcap(8), I still get a permission error. It seems capabilities only work for executables in \usr\bin and friends.
My program test.c looks as follows:
#include <stdio.h>
#include <pcap.h>
int main(int argc, char **argv) {
if (argc != 2) {
printf("Specify interface \n");
return -1;
}
char errbuf[PCAP_ERRBUF_SIZE];
struct pcap* pcap = pcap_open_live(argv[1], BUFSIZ, 1, 0, errbuf);
if (pcap == NULL) {
printf("%s\n", errbuf);
return -1;
}
return 0;
}
and is compiled with
gcc test.c -lpcap
generating a.out executable. I set capabilities:
sudo setcap cap_net_raw,cap_net_admin=eip ./a.out
And check to see that it looks right:
getcap a.out
which gives me
a.out = cap_net_admin,cap_net_raw+eip
Running a.out gives me:
./a.out eth0
eth0: You don't have permission to capture on that device (socket: Operation not permitted)
Running with sudo works as expected (program prints nothing and exits).
Here's the interesting part: If I move a.out to /usr/bin (and reapply the capabilities), it works. Vice versa: taking the capability-enabled /usr/bin/dumpcap from wireshark (which works fine for users in the wireshark group) and moving it out of /usr/bin, say to my home dir, reapplying the same capabilities, it doesn't work. Moving it back, it works.
SO: Why are these capabilities only working in /usr/bin (and /usr/sbin), but not other places? Can this be configured someplace?
This might be because your home directory is mounted with nosuid, which seems to prevent capabilities working. Ubuntu encrypts the home directory, and mounts that with ecryptfs as nosuid.
Binaries with capabilities work for me in /usr/, and /home/, but not my home directory.
The only reference I could find to nosuid defeating capabilities is this link: http://www.gossamer-threads.com/lists/linux/kernel/1860853#1860853. I would love to find an authoritative source.

Busybox init does not start /etc/init.d/rcS

I'm trying to build embedded system using buildroot. Everything seems to work. All modules are starting, the system is stable. The problem is that /etc/init.d/rcS does not start during initialization of the system. If I run it manually everything is OK. I have it in my inittab file.
# /etc/inittab
#
# Copyright (C) 2001 Erik Andersen <andersen#codepoet.org>
#
# Note: BusyBox init doesn't support runlevels. The runlevels field is
# completely ignored by BusyBox init. If you want runlevels, use
# sysvinit.
#
# Format for each entry: <id>:<runlevels>:<action>:<process>
#
# id == tty to run on, or empty for /dev/console
# runlevels == ignored
# action == one of sysinit, respawn, askfirst, wait, and once
# process == program to run
# Startup the system
null::sysinit:/bin/mount -t proc proc /proc
null::sysinit:/bin/mount -o remount,rw /
null::sysinit:/bin/mkdir -p /dev/pts
null::sysinit:/bin/mkdir -p /dev/shm
null::sysinit:/bin/mount -a
null::sysinit:/bin/hostname -F /etc/hostname
# now run any rc scripts
::sysinit:/etc/init.d/rcS
# Put a getty on the serial port
ttyFIQ0::respawn:/sbin/getty -L -n ttyFIQ0 115200 vt100 # GENERIC_SERIAL
# Stuff to do for the 3-finger salute
::ctrlaltdel:/sbin/reboot
# Stuff to do before rebooting
null::shutdown:/etc/init.d/rcK
null::shutdown:/bin/umount -a -r
null::shutdown:/sbin/swapoff -a
Any idea what could be wrong?
/bin/init needs to be on your filesystem.
/bin/sh needs to be on your filesystem.
/etc/init.d/rcS needs to be executable and have #!/bin/sh as its first line.
Init
Are you sure you where invoking Busybox init? What was the kernel command line? If no init= option was supplied to the the kernel, the kernel will look for an executable at /init.
For instance, if your busybox binary resides in /bin/busybox, you need to create the following symlink :
ln -s /bin/busybox /init
If you want your init to reside in /sbin, to comply with the inittab, also create a symlink there. Note that the kernel will not respect init= setting if you don't mount root and your busybox only runs in an initramfs.
ln -s /bin/busybox /sbin/init
Inittab
Also, you could try not using an inittab. The things you try to run from inittab, might very well fit in rcS and any descendant scripts. From the same source you found your example inittab:
# Note: BusyBox init works just fine without an inittab. If no inittab is
# found, it has the following default behavior:
# ::sysinit:/etc/init.d/rcS
# ::askfirst:/bin/sh
# ::ctrlaltdel:/sbin/reboot
# ::shutdown:/sbin/swapoff -a
# ::shutdown:/bin/umount -a -r
# ::restart:/sbin/init
# tty2::askfirst:/bin/sh
# tty3::askfirst:/bin/sh
# tty4::askfirst:/bin/sh
rcS
Make sure /etc/init.d/rcS is executable:
chmod +x chroot chroot /bin/busybox
And try with:
#!/bin/busybox sh
echo "Hello world!"
Please note that this sentence can get buried between kernel log messages, so you might want to pass the quiet kernel command line option to see if it appears.
Busybox symlinks
Are the symlinks installed into the file system or not? If not it is not a disaster. Make sure that /etc/init.d/rcS starts with:
#!/bin/busybox sh
mkdir -pv /sbin
/bin/busybox --install -s
In addition to the scripts themselves being executable and having a correct shebang line, the kernel also needs to be compiled with the CONFIG_BINFMT_SCRIPT option enabled.
CONFIG_BINFMT_SCRIPT:
Say Y here if you want to execute interpreted scripts starting with
#! followed by the path to an interpreter.
You can build this support as a module; however, until that module
gets loaded, you cannot run scripts. Thus, if you want to load this
module from an initramfs, the portion of the initramfs before loading
this module must consist of compiled binaries only.
Most systems will not boot if you say M or N here. If unsure, say Y.
Without this option, you may receive the error message can't run '/etc/init.d/rcS': Exec format error.
From the information given, everything looks correct.
Some things to try:
Check ownership of your rcS script.
Comment out everything from rcS, and add something very simple:
echo "This worked" > /tmp/test
There might be something in your script related to a startup race condition that is causing it to exit. Also curious if your script is starting syslogd.

Resources