Replacing system calls (syscalls) in Linux 2.6+ - multithreading

I'm looking into writing a userland threading library, since there seems to be no active work in this area, and I believe the C++0x promises and futures may give this model some power. Unfortunately, in order to make this model work, it is essential to ensure a context switch on blocking calls. As such, I would like to intercept every syscall in order to replace it with an asynchronous version. There are some caveats:
I know there are asynchronous syscalls for just about every regular syscall, but for backwards compatibility reasons this is not a viable solution.
I know that in Linux 2.4 or earlier it was possible to directly change the sys_call_table, but this has vanished.
As I would like my library to be statically linked if desired, the LD_PRELOAD trick isn't viable.
Similarly, kernel modules are not an option because this is supposed to be a userland library.
Finally, ptrace() is also not an option for similar reasons. I can't have my library forking a new process just in order to be used.
Is this possible?

I'm looking into writing a userland threading library, since there seems to be no active work in this area
You might want to take a look at the thread libraries Marcel (and its publications) and MPC, which implement hybrid (kernel and user-level) threads, mainly in the purpose of High-Performance Computing, so they had to find some solution for this blocking system calls.
So as to avoid the blocking of kernel threads when the application
makes blocking system calls, Marcel uses Scheduler Activations when
they are available, or just intercepts such blocking calls at dynamic
symbols level.

Related

How to check if a timer handler thread is running in POSIX

We are developing a kernel driver and corresponding test cases (in user land) and used timers in our code. Malloc is almost not available. Timers are set up by SIGEV_THREAD so new threads are created.
According to the instructions here and here, it is hard to implement a general clean up system. So I am trying to define a framework with coding rules to deal with this.
In this method, I need to count the number of running handlers of a specific timer. In kernel, I can use try_to_del_timer_sync() and re-add to achieve this, but I cannot find a method in userland, especially POSIX.
Linux-specific methods are also welcome.

how to check if pthread_mutex is based on robust futex

I am trying to use robust futex based pthread mutex in Linux because I need to be both fast and robust (recover the "dead" lock). How can I check if the pthread mutex library on any Linux system is based off robust futex?
Thanks!
If you have the futex(2) system call and if it is used (just strace(1) a 10 line application that uses mutexes) then you have the robust feature as the futex(2) system call only entered the kernel after robustness has been built into it. This does not mean that you are using robust futexes, just that you have the feature in the kernel.
Next you want to know that your libc supports it. Any version above 2.9 supports it. Just check your version.
If you are writing a multi-threaded application then you don't really need the robustness of the futexes since you control the threads and can make sure that threads release the mutexes they use before they die or register a cleanup function to do the lock releasing (there is a pthread api for that). If you are still worried see my notes below about using robust mutexes anyway.
I just want to make it plain & clear that you are going to pay in performance if you want to use robust futexes in a multi-threaded application. The main use of robust futexes is to use them as synchronization primitives in multi-process applications where the chance of one component dying without killing the rest of the components is high compared to the same chance in a multi-threaded application where the abnormal death of a thread means the death of the entire application.
To use robust futexes in either a multi-threaded or a multi-process application you need to mark the futexes as robust by using the undocumented function pthread_mutexattr_setrobust(3). I've submitted a bug report to the manual pages maintainers to add documentation about that function. You need to pass PTHREAD_MUTEX_ROBUST to that function as opposed to PTHREAD_MUTEX_STALLED which is the default.
In a multi-threaded application marking the mutex as robust is all you have to do.
To use robust futexes in a multi-process application you need to also mark the futex as being shared across processes by calling the (fortunately documented) function pthread_mutexattr_setpshared(3) and pass PTHREAD_PROCESS_SHARED to it. This is opposed to the default PTHREAD_PROCESS_PRIVATE.
Actually in strace(1) you will not see acquisition and release of the locks but you will see calls to set_robust_list(2) if your futex is robust.
I hope this helps.

Where can I find documentation on the kflushd?

I cannot find any documentation on the kflushd such as what it does exactly, how it is involved in network IO and how I could use it/call it from my own code.
kflushd AFAIK handles writing out pending I/O in memory to their corresponding devices. If you want to flush pending I/O's you can always call flush, fflush, or sync to force a write to the I/O device.
To call it from your code simply use one of the calls I mentioned (although I think there might be one more I'm forgetting).
Kernel processes like kflushd are started by the kernel on its own (they are not descendant of the init process started by fork-ing) and exist only for the kernel needs. User applications may invisibly need them (because they need some feature offered by the kernel which the kernel implements with the help of its own kernel processes) but don't actively use them.
You definitely should use appropriately the fflush(3) library function (which just happens to make the relevant write(2) syscalls).
You may want to use the fsync(2) and related syscalls.
Regarding networking, you may be interested by Nagle's algorithm. See this answer.

What interprocess locking calls should I monitor?

I'm monitoring a process with strace/ltrace in the hope to find and intercept a call that checks, and potentially activates some kind of globally shared lock.
While I've dealt with and read about several forms of interprocess locking on Linux before, I'm drawing a blank on what to calls to look for.
Currently my only suspect is futex() which comes up very early on in the process' execution.
Update0
There is some confusion about what I'm after. I'm monitoring an existing process for calls to persistent interprocess memory or equivalent. I'd like to know what system and library calls to look for. I have no intention call these myself, so naturally futex() will come up, I'm sure many libraries will implement their locking calls in terms of this, etc.
Update1
I'd like a list of function names or a link to documentation, that I should monitor at the ltrace and strace levels (and specifying which). Any other good advice about how to track and locate the global lock in mind would be great.
If you can start monitored process in valgrind, then there are two projects:
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer
and Helgrind
http://valgrind.org/docs/manual/hg-manual.html
Helgrind is aware of all the pthread
abstractions and tracks their effects
as accurately as it can. On x86 and
amd64 platforms, it understands and
partially handles implicit locking
arising from the use of the LOCK
instruction prefix.
So, this tools can detect even atomic memory accesses. And they will check pthread usage
flock is another good one
There are many system calls can be used for locking: flock, fcntl, and even create.
When you are using pthreads/sem_* locks they may be executed in user space so you'll never
see them in strace as futex is called only for pending operations. Like when you actually
need to wait.
Some operations can be done in user space only - like spinlocks - you'll never see them
unless they do some waits for timer - backoff so you may see only stuff like nanosleep when one lock waits for other.
So there is no "generic" way to trace them.
on systems with glibc ~ >= 2.5 (glibc + nptl) you can use process shared
semaphores (last parameter to sem_init), more precisely, posix unnamed semaphores
posix mutexes (with PTHREAD_PROCESS_SHARED to pthread_mutexattr_setpshared)
posix named semaphores (got from sem_open/sem_unlink)
system v (sysv) semaphores: semget, semop
On older systems with glibc 2.2, 2.3 with linuxthreads or on embedded systems with uClibc you can use ONLY system v (sysv) semaphores for iterprocess communication.
upd1: any IPC and socker must be checked.

What is the status of POSIX asynchronous I/O (AIO)?

There are pages scattered around the web that describe POSIX AIO facilities in varying amounts of detail. None of them are terribly recent. It's not clear what, exactly, they're describing. For example, the "official" (?) web site for Linux kernel asynchronous I/O support here says that sockets don't work, but the "aio.h" manual pages on my Ubuntu 8.04.1 workstation all seem to imply that it works for arbitrary file descriptors. Then there's another project that seems to work at the library layer with even less documentation.
I'd like to know:
What is the purpose of POSIX AIO? Given that the most obvious example of an implementation I can find says it doesn't support sockets, the whole thing seems weird to me. Is it just for async disk I/O? If so, why the hyper-general API? If not, why is disk I/O the first thing that got attacked?
Where are there example complete POSIX AIO programs that I can look at?
Does anyone actually use it, for real?
What platforms support POSIX AIO? What parts of it do they support? Does anyone really support the implied "Any I/O to any FD" that <aio.h> seems to promise?
The other multiplexing mechanisms available to me are perfectly good, but the random fragments of information floating around out there have made me curious.
Doing socket I/O efficiently has been solved with kqueue, epoll, IO completion ports and the likes. Doing asynchronous file I/O is sort of a late comer (apart from windows' overlapped I/O and solaris early support for posix AIO).
If you're looking for doing socket I/O, you're probably better off using one of the above mechanisms.
The main purpose of AIO is hence to solve the problem of asynchronous disk I/O. This is most likely why Mac OS X only supports AIO for regular files, and not sockets (since kqueue does that so much better anyway).
Write operations are typically cached by the kernel and flushed out at a later time. For instance when the read head of the drive happens to pass by the location where the block is to be written.
However, for read operations, if you want the kernel to prioritize and order your reads, AIO is really the only option. Here's why the kernal can (theoretically) do that better than any user level application:
The kernel sees all disk I/O, not just your applications disk jobs, and can order them at a global level
The kernel (may) know where the disk read head is, and can pick the read jobs you pass on to it in optimal order, to move the head the shortest distance
The kernel can take advantage of native command queuing to optimize your read operations further
You may be able to issue more read operations per system call using lio_listio() than with readv(), especially if your reads are not (logically) contiguous, saving a tiny bit of system call overhead.
Your program might be slightly simpler with AIO since you don't need an extra thread to block in a read or write call.
That said, posix AIO has a quite awkward interface, for instance:
The only efficient and well supported mean of event callbacks are via signals, which makes it hard to use in a library, since it means using signal numbers from the process-global signal namespace. If your OS doesn't support realtime signals, it also means you have to loop through all your outstanding requests to figure out which one actually finished (this is the case for Mac OS X for instance, not Linux). Catching signals in a multi-threaded environment also makes for some tricky restrictions. You can typically not react to the event inside the signal handler, but you have to raise a signal, write to a pipe or use signalfd() (on linux).
lio_suspend() has the same issues as select() does, it doesn't scale very well with the number of jobs.
lio_listio(), as implemented has fairly limited number of jobs you can pass in, and it's not trivial to find this limit in a portable way. You have to call sysconf(_SC_AIO_LISTIO_MAX), which may fail, in which case you can use the AIO_LISTIO_MAX define, which are not necessarily defined, but then you can use 2, which is defined as guaranteed to be supported.
As for real-world application using posix AIO, you could take a look at lighttpd (lighty), which also posted a performance measurement when introducing support.
Most posix platforms supports posix AIO by now (Linux, BSD, Solaris, AIX, tru64). Windows supports it via its overlapped file I/O. My understanding is that only Solaris, Windows and Linux truly supports async. file I/O all the way down to the driver, whereas the other OSes emulate the async. I/O with kernel threads. Linux being the exception, its posix AIO implementation in glibc emulates async operations with user level threads, whereas its native async I/O interface (io_submit() etc.) are truly asynchronous all the way down to the driver, assuming the driver supports it.
I believe it's fairly common among OSes to not support posix AIO for any fd, but restrict it to regular files.
Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.
Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.
Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.
So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.
A libtorrent developer provides a report on this: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/
There is aio_write - implemented in glibc; first call of the aio_read or aio_write function spawns a number of user mode threads, aio_write or aio_read post requests to that thread, the thread does pread/pwrite and when it is finished the answer is posted back to the blocked calling thread.
Ther is also 'real' aio - supported by the kernel level (need libaio for that, see the io_submit call http://linux.die.net/man/2/io_submit ); also need O_DIRECT for that (also may not be supported by all file systems, but the major ones do support it)
see here:
http://lse.sourceforge.net/io/aio.html
http://linux.die.net/man/2/io_submit
Difference between POSIX AIO and libaio on Linux?

Resources