How to simulate INotify failure in functional test? - linux

I have a Linux application that uses inotify for tracking filesystem changes. And I want to write a functional test suite for it that tests the application from the end-user perspective and as part of it I'd like to test situations where filesystem fails and particularly I want to test inotify failure.
Speicifically I'd like to make inotify_init(), inotify_add_watch(), inotify_rm_watch() calls and call of read() for the inotify file-descriptors return an error when it's required in the tests.
But the problem is that I can't find the way of how to simulate inotify failure. I wonder if somebody already encountered such problem and knows some solutions.

If you want to avoid any mocking whatsoever, your best bet is simply provoking errors by directly hitting OS limits. For example, inotify_init can fail with EMFILE errno, if the calling process has reached it's limit on number of open file descriptors. To reach such conditions with 100% precision you can use two tricks:
Dynamically manipulate limits of running process by changing values in procfs
Assign your app process to dedicated cgroup and "suspend" it by giving it ~0% CPU time via cgroups API (this is how Android throttles background apps and implements it's energy-saving "Doze" mode).
All possible errors conditions of inotify are documented in man pages of inotify, inotify_init and inotify_add_watch (I don't think that inotify_rm_watch can fail except for purely programming errors in your code).
Aside from ordinary errors (such as going over /proc/sys/fs/inotify/max_user_watches) inotify has several fault modes (queue space exhaustion, watch ID reuse), but those aren't "failures" in strict sense of word.
Queue exhaustion happens when someone performs filesystem changes faster than you can react. It is easy to reproduce: use cgroups to pause your program while it has an inotify descriptor open (so the event queue isn't drained) and rapidly generate lots of notifications by modifying the observed files/directories. Once you have /proc/sys/fs/inotify/max_queued_events of unhandled events, and unpause your program, it will receive IN_Q_OVERFLOW (and potentially miss some events, that didn't fit into queue).
Watch ID reuse is tedious to reproduce, because modern kernels switched from file descriptor-like behavior to PID-like behavior for watch-IDs. You should use the same approach as when testing for PID reuse — create and destroy lots of inotify watches until the integer watch ID wraps around.
Inotify also has a couple of tricky corner-cases, that rarely occur during normal operation (for example, all Java bindings I know, including Android and OpenJDK, do not handle all of them correctly): same-inode problem and handling IN_UNMOUNT.
Same-inode problem is well-explained in inotify documentation:
A successful call to inotify_add_watch() returns a unique watch descriptor for this inotify instance, for the filesystem object (inode) that corresponds to pathname. If the filesystem object was not previously being watched by this inotify instance, then the watch descriptor is newly allocated. If the filesystem object was already being watched (perhaps via a different link to the same object), then the descriptor for the existing watch is returned.
In plain words: if you watch two hard-links to the same file, their numeric watch IDs will be the same. This behavior can easily result in losing track of the second inotify watch, if you store watches in something like hashmap, keyed with integer watch IDs.
Second issue is even harder to observe, thus rarely properly supported despite not even being error mode: unmounting a partition, currently observed via inotify. The tricky part is: Linux filesystems do not allow you to unmount themselves when they you have file descriptors opened them, but observing a file via inotify does not prevent the filesystem unmounting. If your app observes files on separate filesystem, and user unmounts that filesystem, you have to be prepared to handle the resulting IN_UNMOUNT event.
All of tests above should be possible to perform on tmpfs filesystem.

After a bit of thinking, I came up with another solution. You can use Linux "seccomp" facility to "mock" results of individual inotify-related system calls. The upsides of this approach are being simple, robust and completely non-intrusive. You can conditionally adjust behavior of syscalls while still using original OS behavior in other cases. Technically this still counts as mocking, but the mocking layer is placed very deeply, between the kernel code and userspace syscall interface.
You don't need to modify the code of program, just write a wrapper, that installs a fitting seccomp filter before exec-ing your app (the code below uses libseccomp):
// pass control to kernel syscall code by default
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
if (!ctx) exit(1);
// modify behavior of specific system call to return `EMFILE` error
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EMFILE), __NR_inotify_init, 0));
execve(...
Seccomp is essentially a limited interpreter, running extended version of BPF bytecode, so it's capabilities are very extensive. libseccomp allows you to install limited conditional filters (for example comparing integer arguments of system call to constant values). If you want to achieve more impressive conditional behavior (such as comparing file path, passed to inotify_add_watch, to predefined value), you can combine direct usage of seccomp() syscall with kernel bpf() facility to write complex filtering programs in eBPF dialect.
Writing syscall filters might be tedious, and the behavior of program under effects of seccomp does not actually depend on kernel implementation (seccomp filters are invoked by kernel before passing the control to kernel syscall handler). So you may want to combine sparse use of seccomp with more organic approach, outlined in the my other answer.

Probably not as non-intrusive as you would like, but the INotify class from inotify_simple is small. You could completely wrap it, delegate all of the methods, and inject the errors.
The code would look something like this:
from inotify_simple.inotify_simple import INotify
class WrapINotify(object):
init_error_list = []
add_watch_error_list = []
rm_watch_error_list = []
read_error_list = []
def raise_if_error(self, error_list):
if not error_list:
return
# Simulate INotify raising an exception
exception = error_list.pop(0)
raise exception
def __init__(self):
self.raise_if_error(WrapINotify.init_error_list)
self.inotify = INotify()
def add_watch(self, path, mask):
self.raise_if_error(WrapINotify.add_watch_error_list)
self.inotify.add_watch(path, mask)
def rm_watch(self, wd):
self.raise_if_error(WrapINotify.rm_watch_error_list)
return self.inotify.rm_watch(wd)
def read(self, timeout=None, read_delay=None):
self.raise_if_error(WrapINotify.read_error_list)
return self.inotify.read(timeout, read_delay)
def close(self):
self.inotify.close()
def __enter__(self):
return self.inotify.__enter__()
def __exit__(self, exc_type, exc_value, traceback):
self.inotify.__exit__(exc_type, exc_value, traceback)
With this code, you would, somewhere else do:
WrapINotify.add_watch_error_list.append(OSError(28, 'No space left on disk'))
to inject the error. Of course, you could add more code to the wrapper class to implement different error injection schemes.

Related

What is the difference between these methods to obtain resource usage?

In Linux, we can use two ways to find out resources used like time, page faults, page swaps, context switching. One of the ways is using the getrusage() function, the other method is using the command /usr/bin/time -v [command to check usage]. What is the difference between these ways of finding resource usage?
When you use a command like time(1) it must use a system call such as getrusage(2) by way of its system library wrapper. This is building a request with the right system call number and structure to indicate it wants rusage information for the processes' children.
For compatibility across UNIX/POSIX operating systems, which specific functions are chosen to build a command is done from a hierarchy of options to adequately cover the OSes the command runs on. (Some OSes may not implement everything or have various quirks.)
In time's case it will prefer to group waiting for the child and getting its usage into calling wait3 which in turn is implemented as a wrapper around the even more complex wait4, which has its own systemcall number.
Both wait3/4 and getrusage fill the same rusage structure with information, and since time only directly calls one child process, calling wait3() as it does or breaking this into less featured calls like wait();getrusage(RUSAGE_CHILDREN) is in essence the same. Therefore, time is effectively displaying the same data as getrusage provides (together with some more general data it assembles from the system like real time elapsed using calls to gettimeofday).
The real difference among the systemcall wrapper functions is:
getrusage has another argument allowing a process to look at itself so far.
wait4 could target just one direct child and that child's decendents.
wait3 is a simplification of either wait4 or using wait();getrusage() that is not as versatile as either but just good enough for the time(1) command as it is implemented. (Therefore wait3 is the simplest and safest option for time to use on OSes where it is available.)
To verify they are the same, one could change time to an alternate version, recompile and compare:
while ((caught = wait3 (&status, 0, NULL)) != pid)
{
if (caught == -1) {
getrusage(RUSAGE_CHILDREN, &resp->ru);
return 0;
}
}

Can I override a system function before calling fork?

I'd like to be able to intercept filenames with a certain prefix from any of my child processes that I launch. This would be names like "pipe://pipe_name". I think wrapping the open() system call would be a good way to do this for my application, but I'd like to do it without having to compile a separate shared library and hooking it with the LD_PRELOAD trick (or using FUSE and having to have a mounted directory)
I'll be forking the processes myself, is there a way to redirect open() to my own function before forking and have it persist in the child after an exec()?
Edit: The thought behind this is that I want to implement multi-reader pipes by having an intermediate process tee() the data from one pipe into all the others. I'd like this to be transparent to my child processes, so that they can take a filename and open() it, and, if it's a pipe, I'll return the file descriptor for it, while if it's a normal file, I'll just pass that to the regular open() function. Any alternative way to do this that makes it transparent to the child processes would interesting to hear. I'd like to not have to compile a separate library that has to be pre-linked though.
I believe the answer here is no, it's not possible. To my knowledge, there's only three ways to achieve this:
LD_PRELOAD trick, compile a .so that's pre-loaded to override the system call
Implement a FUSE filesystem and pass a path into it to the client program, intercept calls.
Use PTRACE to intercept system calls, and fix them up as needed.
2 and 3 will be very slow as they'll have to intercept every I/O call and swap back to user space to handle it. Probably 500% slower than normal. 1 requires building and maintaining an external shared library to link in.
For my application, where I'm just wanting to be able to pass in a path to a pipe, I realized I can open both ends in my process (via pipe()), then go grab the path to the read end in /proc//fd and pass that path to the client program, which gives me everything I need I think.

Is there a supported way to obtain LDT entries of debuggee?

A userspace process can call modify_ldt(2) to alter entries of its LDT. A debugger, to make correct analysis of what the process reads and where, as well as what code it executes currently, needs to know what descriptor a value of e.g. CS=0x7 selects.
Currently the only way I think could possibly work is injecting some code into debuggee to retrieve the LDT, executing it and then returning back to original state. But this is quite error-prone and is likely to break the debugger user's workflow e.g. when signals arrive.
So is there a better way? I've googled something like PTRACE_LDT, but the pages are from 2005, and grepping modern linux source doesn't find anything relevant to x86 other than comments.

How do you safely read memory in Unix (or at least Linux)?

I want to read a byte of memory but don't know if the memory is truly readable or not. You can do it under OS X with the vm_read function and under Windows with ReadProcessMemory or with _try/_catch. Under Linux I believe I can use ptrace, but only if not already being debugged.
FYI the reason I want to do this is I'm writing exception handler program state-dumping code, and it helps the user a lot if the user can see what various memory values are, or for that matter know if they were invalid.
If you don't have to be fast, which is typical in code handling state dump/exception handlers, then you can put your own signal handler in place before the access attempt, and restore it after. Incredibly painful and slow, but it is done.
Another approach is to parse the contents of /dev/proc//maps to build a map of the memory once, then on each access decide whether the address in inside the process or not.
If it were me, I'd try to find something that already does this and re-implement or copy the code directly if licence met my needs. It's painful to write from scratch, and nice to have something that can give support traces with symbol resolution.

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?
If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.
Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.
There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.
Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.
If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.
Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.
If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Resources