Where can I find documentation on the kflushd? - linux

I cannot find any documentation on the kflushd such as what it does exactly, how it is involved in network IO and how I could use it/call it from my own code.

kflushd AFAIK handles writing out pending I/O in memory to their corresponding devices. If you want to flush pending I/O's you can always call flush, fflush, or sync to force a write to the I/O device.
To call it from your code simply use one of the calls I mentioned (although I think there might be one more I'm forgetting).

Kernel processes like kflushd are started by the kernel on its own (they are not descendant of the init process started by fork-ing) and exist only for the kernel needs. User applications may invisibly need them (because they need some feature offered by the kernel which the kernel implements with the help of its own kernel processes) but don't actively use them.
You definitely should use appropriately the fflush(3) library function (which just happens to make the relevant write(2) syscalls).
You may want to use the fsync(2) and related syscalls.
Regarding networking, you may be interested by Nagle's algorithm. See this answer.

Related

How do I check if a given operation (or system call) is atomic on Linux?

I want to find a reliable way (other than reading the kernel source code) to check if a given operation (or system call) is atomic (in the sense that other process can only see the state before or after that operation, but not something in between) on Linux. The goal of this is to avoid using unnecessary locks for some operations if the kernel already does that for me.
So far I can only find resources like this about this topic, which is by no means authoritative or exhaustive. Also, the Linux man pages contains little information about this. For example, for most functions mentioned in the above link, I don't find anything about their atomicity in the man pages.
Could anyone tell me if there is a standard or official documentation which provides this information? Any help would be much appreciated.
I think POSIX thread-safe functions are a good starting point. Thread-safe functions are functions that will give the same results when called from different threads. This is not at all the same as being atomic, but at least it gives a hint about which functions certainly are not atomic.
POSIX.1-2001 and POSIX.1-2008 require that all functions specified in the standard shall be thread-safe, except for a specific set of functions (most of which are implemented in the standard library and not in the kernel).
As an example of a function that is thread-safe but not atomic, consider fwrite(). fwrite() will write to a per-process buffer under pthread locks, so it is thread-safe. However, the buffer may be flushed in separate write() chunks, so other processes don't see it as an atomic write.

Suppress a process in linux kernel scheduler (not kill)

In linux scheduler, I want to suppress some processes by modifying the scheduler code. Is it possible to suppress process without killing but just suppression?
In the linux scheduler, I want to suppress some processes by modifying the scheduler code
Probably not possible, and certainly ill defined. The good way to think of modifying the kernel is first: don't, and later don't yet, and at last minimally and carefully !
What exactly "suppressing" a process is meaning to you? You might want to terminate it. You certainly cannot simply "suppress" some process, since the kernel is carefully cleaning up after it has terminated.
And why are you wanting to modify the kernel? In general, user-space and user-mode is a better place to do such things (or even systemd). You might want to also have some kernel thread (very tricky).
You might consider kernel to user-space communication with netlink(7), then try to minimize your kernel footprint. Be however aware that the scheduler is a critical, and very well tuned, piece of code inside the kernel.
In practice, I would suggest a high-priority user-land daemon. See setpriority(2), nice(2) and sched(7). We don't know what you want to achieve, but it is likely do be practically doable in user-land. And if it is not, perhaps Linux is not the right kernel for you (taking into account that you Silvara is a drone developer). Then look into genuine real-time operating systems, IoT OSes like Contiki, or library operating systems unikernels such as MirageOS.

How does the Linux kernel realize reentrancy?

All Unix kernels are reentrant: several processes may be executing in kernel
mode at the same time. How can I realize this effect in code? How should I handle the situation whereby many processes invoke system calls, pending in kernel mode?
[Edit - the term "reentrant" gets used in a couple of different senses. This answer uses the basic "multiple contexts can be executing the same code at the same time." This usually applies to a single routine, but can be extended to apply to a set of cooperating routines, generally routines which share data. An extreme case of this is when applied to a complete program - a web server, or an operating system. A web-server might be considered non-reentrant if it could only deal with one client at a time. (Ugh!) An operating system kernel might be called non-reentrant if only one process/thread/processor could be executing kernel code at a time.
Operating systems like that occurred during the transition to multi-processor systems. Many went through a slow transition from written-for-uniprocessors to one-single-lock-protects-everything (i.e. non-reentrant) through various stages of finer and finer grained locking. IIRC, linux finally got rid of the "big kernel lock" at approx. version 2.6.37 - but it was mostly gone long before that, just protecting remnants not yet converted to a multiprocessing implementation.
The rest of this answer is written in terms of individual routines, rather than complete programs.]
If you are in user space, you don't need to do anything. You call whatever system calls you want, and the right thing happens.
So I'm going to presume you are asking about code in the kernel.
Conceptually, it's fairly simple. It's also pretty much identical to what happens in a multi-threaded program in user space, when multiple threads call the same subroutine. (Let's assume it's a C program - other languages may have differently named mechanisms.)
When the system call implementation is using automatic (stack) variables, it has its own copy - no problem with re-entrancy. When it needs to use global data, it generally needs to use some kind of locking - the specific locking required depends on the specific data it's using, and what it's doing with that data.
This is all pretty generic, so perhaps an example might help.
Let's say the system call want to modify some attribute of a process. The process is represented by a struct task_struct which is a member of various linked lists. Those linked lists are protected by the tasklist_lock. Your system call gets the tasklist_lock, finds the right process, possibly gets a per-process lock controlling the field it cares about, modifies the field, and drops both locks.
One more detail, which is the case of processes executing different system calls, which don't share data with each other. With a reasonable implementation, there are no conflicts at all. One process can get itself into the kernel to handle its system call without affecting the other processes. I don't remember looking specifically at the linux implementation, but I imagine it's "reasonable". Something like a trap into an exception handler, which looks in a table to find the subroutine to handle the specific system call requested. The table is effectively const, so no locks required.

Replacing system calls (syscalls) in Linux 2.6+

I'm looking into writing a userland threading library, since there seems to be no active work in this area, and I believe the C++0x promises and futures may give this model some power. Unfortunately, in order to make this model work, it is essential to ensure a context switch on blocking calls. As such, I would like to intercept every syscall in order to replace it with an asynchronous version. There are some caveats:
I know there are asynchronous syscalls for just about every regular syscall, but for backwards compatibility reasons this is not a viable solution.
I know that in Linux 2.4 or earlier it was possible to directly change the sys_call_table, but this has vanished.
As I would like my library to be statically linked if desired, the LD_PRELOAD trick isn't viable.
Similarly, kernel modules are not an option because this is supposed to be a userland library.
Finally, ptrace() is also not an option for similar reasons. I can't have my library forking a new process just in order to be used.
Is this possible?
I'm looking into writing a userland threading library, since there seems to be no active work in this area
You might want to take a look at the thread libraries Marcel (and its publications) and MPC, which implement hybrid (kernel and user-level) threads, mainly in the purpose of High-Performance Computing, so they had to find some solution for this blocking system calls.
So as to avoid the blocking of kernel threads when the application
makes blocking system calls, Marcel uses Scheduler Activations when
they are available, or just intercepts such blocking calls at dynamic
symbols level.

What is the status of POSIX asynchronous I/O (AIO)?

There are pages scattered around the web that describe POSIX AIO facilities in varying amounts of detail. None of them are terribly recent. It's not clear what, exactly, they're describing. For example, the "official" (?) web site for Linux kernel asynchronous I/O support here says that sockets don't work, but the "aio.h" manual pages on my Ubuntu 8.04.1 workstation all seem to imply that it works for arbitrary file descriptors. Then there's another project that seems to work at the library layer with even less documentation.
I'd like to know:
What is the purpose of POSIX AIO? Given that the most obvious example of an implementation I can find says it doesn't support sockets, the whole thing seems weird to me. Is it just for async disk I/O? If so, why the hyper-general API? If not, why is disk I/O the first thing that got attacked?
Where are there example complete POSIX AIO programs that I can look at?
Does anyone actually use it, for real?
What platforms support POSIX AIO? What parts of it do they support? Does anyone really support the implied "Any I/O to any FD" that <aio.h> seems to promise?
The other multiplexing mechanisms available to me are perfectly good, but the random fragments of information floating around out there have made me curious.
Doing socket I/O efficiently has been solved with kqueue, epoll, IO completion ports and the likes. Doing asynchronous file I/O is sort of a late comer (apart from windows' overlapped I/O and solaris early support for posix AIO).
If you're looking for doing socket I/O, you're probably better off using one of the above mechanisms.
The main purpose of AIO is hence to solve the problem of asynchronous disk I/O. This is most likely why Mac OS X only supports AIO for regular files, and not sockets (since kqueue does that so much better anyway).
Write operations are typically cached by the kernel and flushed out at a later time. For instance when the read head of the drive happens to pass by the location where the block is to be written.
However, for read operations, if you want the kernel to prioritize and order your reads, AIO is really the only option. Here's why the kernal can (theoretically) do that better than any user level application:
The kernel sees all disk I/O, not just your applications disk jobs, and can order them at a global level
The kernel (may) know where the disk read head is, and can pick the read jobs you pass on to it in optimal order, to move the head the shortest distance
The kernel can take advantage of native command queuing to optimize your read operations further
You may be able to issue more read operations per system call using lio_listio() than with readv(), especially if your reads are not (logically) contiguous, saving a tiny bit of system call overhead.
Your program might be slightly simpler with AIO since you don't need an extra thread to block in a read or write call.
That said, posix AIO has a quite awkward interface, for instance:
The only efficient and well supported mean of event callbacks are via signals, which makes it hard to use in a library, since it means using signal numbers from the process-global signal namespace. If your OS doesn't support realtime signals, it also means you have to loop through all your outstanding requests to figure out which one actually finished (this is the case for Mac OS X for instance, not Linux). Catching signals in a multi-threaded environment also makes for some tricky restrictions. You can typically not react to the event inside the signal handler, but you have to raise a signal, write to a pipe or use signalfd() (on linux).
lio_suspend() has the same issues as select() does, it doesn't scale very well with the number of jobs.
lio_listio(), as implemented has fairly limited number of jobs you can pass in, and it's not trivial to find this limit in a portable way. You have to call sysconf(_SC_AIO_LISTIO_MAX), which may fail, in which case you can use the AIO_LISTIO_MAX define, which are not necessarily defined, but then you can use 2, which is defined as guaranteed to be supported.
As for real-world application using posix AIO, you could take a look at lighttpd (lighty), which also posted a performance measurement when introducing support.
Most posix platforms supports posix AIO by now (Linux, BSD, Solaris, AIX, tru64). Windows supports it via its overlapped file I/O. My understanding is that only Solaris, Windows and Linux truly supports async. file I/O all the way down to the driver, whereas the other OSes emulate the async. I/O with kernel threads. Linux being the exception, its posix AIO implementation in glibc emulates async operations with user level threads, whereas its native async I/O interface (io_submit() etc.) are truly asynchronous all the way down to the driver, assuming the driver supports it.
I believe it's fairly common among OSes to not support posix AIO for any fd, but restrict it to regular files.
Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.
Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.
Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.
So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.
A libtorrent developer provides a report on this: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/
There is aio_write - implemented in glibc; first call of the aio_read or aio_write function spawns a number of user mode threads, aio_write or aio_read post requests to that thread, the thread does pread/pwrite and when it is finished the answer is posted back to the blocked calling thread.
Ther is also 'real' aio - supported by the kernel level (need libaio for that, see the io_submit call http://linux.die.net/man/2/io_submit ); also need O_DIRECT for that (also may not be supported by all file systems, but the major ones do support it)
see here:
http://lse.sourceforge.net/io/aio.html
http://linux.die.net/man/2/io_submit
Difference between POSIX AIO and libaio on Linux?

Resources