Difference between POSIX AIO and libaio on Linux? - linux

What I seem to understand:
POSIX AIO APIs are prototyped in <aio.h> and you link your program with librt(-lrt), while the libaio APIs in <libaio.h> and your program is linked with libaio (-laio).
What I can't figure out:
1.Does the kernel handle the either of these methods differently?
2.Is the O_DIRECT flag mandatory for using either of them?
As mentioned in this post, libaio works fine without O_DIRECT when using libaio.Okay,understood but:
According to R.Love's Linux System Programming book, Linux supports aio (which I assume is POSIX AIO) on regular files only if opened with O_DIRECT.But a small program that I wrote (using aio.h,linked with -lrt) that calls aio_write on a file opened without the O_DIRECT flag works without issues.

On linux, the two AIO implementations are fundamentally different.
The POSIX AIO is a user-level implementation that performs normal blocking I/O in multiple threads, hence giving the illusion that the I/Os are asynchronous. The main reason to do this is that:
it works with any filesystem
it works (essentially) on any operating system (keep in mind that gnu's libc is portable)
it works on files with buffering enabled (i.e. no O_DIRECT flag set)
The main drawback is that your queue depth (i.e. the number of outstanding operations you can have in practice) is limited by the number of threads you choose to have, which also means that a slow operation on one disk may block an operation going to a different disk. It also affects which I/Os (or how many) is seen by the kernel and the disk scheduler as well.
The kernel AIO (i.e. io_submit() et.al.) is kernel support for asynchronous I/O operations, where the io requests are actually queued up in the kernel, sorted by whatever disk scheduler you have, presumably some of them are forwarded (in somewhat optimal order one would hope) to the actual disk as asynchronous operations (using TCQ or NCQ). The main restriction with this approach is that not all filesystems work that well or at all with async I/O (and may fall back to blocking semantics), files have to be opened with O_DIRECT which comes with a whole lot of other restrictions on the I/O requests. If you fail to open your files with O_DIRECT, it may still "work", as in you get the right data back, but it probably isn't done asynchronously, but is falling back to blocking semantics.
Also keep in mind that io_submit() can actually block on the disk under certain circumstances.

Related

File Access (read/write) synchronization between 'n' processes in Linux

I am studying Operating Systems this semester and was just wondering how Linux handles file access (read/write) synchronization, what is the default implementation does it use semaphores, mutexes or monitors? And can you please tell me where I would find this in the source codes or my own copy of Ubuntu and how to disable it?
I need to disable it so i can check if my own implementation of this works, also how do i add my own implementation to the system.
Here's my current plan please tell me if its okay:
Disable the default implementation, add my own. (recompile kernel if need be)
My own version would keep track of every incoming process and maintain a list of what files they were using adn whenever a file would repeat i would check if its a reader process or a writer process
I will be going with a reader preferred solution to the readers writers problem.
Kernel doesn't impose process synchronization (it should be performed by processes while kernel only provides tools for that), but it can guarantee atomicity on some operations: atomic operation can not be interrupted and its result cannot be altered by other operation running in parallel.
Speaking of writing to a file, it has some atomicity guarantees. From man -s3 write:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
Some discussion on SO: Atomicity of write(2) to a local filesystem.
To maintain atomicity, various kernel routines hold i_mutex mutex of an inode. I.e. in generic_file_write_iter():
mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
So other write() calls won't mess with your call. Readers, however doesn't lock i_mutex, so they may get invalid data. Actual locking for readers is performed in page cache, so a page (4096 bytes on x86) is a minimum amount data that guarantees atomicity in kernel.
Speaking of recompiling kernel to test your own implementation, there are two ways of doing that: download vanilla kernel from http://kernel.org/ (or from Git), patch and build it - it is easy. Recompiling Ubuntu kernels is harder -- it will require working with Debian build tools: https://help.ubuntu.com/community/Kernel/Compile
I'm not clear about what you trying to achieve with your own implementation. If you want to apply strictier synchronization rules, maybe it is time to look at TxOS?

Linux system call for creating process and thread

I read in a paper that the underlying system call to create processes and threads is actually the same, and thus the cost of creating processes over threads is not that great.
First, I wanna know what is the system call that creates
processes/threads (possibly a sample code or a link?)
Second, is
the author correct to assume that creating processes instead of
threads is inexpensive?
EDIT:
Quoting article:
Replacing pthreads with processes is surprisingly inexpensive,
especially on Linux where both pthreads and processes are invoked
using the same underlying system call.
Processes are usually created with fork, threads (lightweight processes) are usually created with clone nowadays. However, anecdotically, there exist 1:N thread models, too, which don't do either.
Both fork and clone map to the same kernel function do_fork internally. This function can create a lightweight process that shares the address space with the old one, or a separate process (and many other options), depending on what flags you feed to it. The clone syscall is more or less a direct forwarding of that kernel function (and used by the higher level threading libraries) whereas fork wraps do_fork into the functionality of the 50 year old traditional Unix function.
The important difference is that fork guarantees that a complete, separate copy of the address space is made. This, as Basil points out correctly, is done with copy-on-write nowadays and therefore is not nearly as expensive as one would think.
When you create a thread, it just reuses the original address space and the same memory.
However, one should not assume that creating processes is generally "lightweight" on unix-like systems because of copy-on-write. It is somewhat less heavy than for example under Windows, but it's nowhere near free.
One reason is that although the actual pages are not copied, the new process still needs a copy of the page table. This can be several kilobytes to megabytes of memory for processes that use larger amounts of memory.
Another reason is that although copy-on-write is invisible and a clever optimization, it is not free, and it cannot do magic. When data is modified by either process, which inevitably happens, the affected pages fault.
Redis is a good example where you can see that fork is everything but lightweight (it uses fork to do background saves).
The underlying system call to create threads is clone(2) (it is Linux specific). BTW, the list of Linux system calls is on syscalls(2), and you could use the strace(1) command to understand the syscalls done by some process or command. Processes are usually created with fork(2) (or vfork(2), which is not much useful these days). However, you could (and some C standard libraries might do that) create them with some particular form of clone. I guess that the kernel is sharing some code to implement clone, fork etc... (since some functionalities, e.g. management of the virtual address space, are common).
Indeed, process creation (and also thread creation) is generally quite fast on most Unix systems (because they use copy-on-write machinery for the virtual memory), typically a small fraction of a millisecond. But you could have pathological cases (e.g. thrashing) which makes that much longer.
Since most C standard library implementations are free software on Linux, you could study the source code of the one on your system (often GNU glibc, but sometimes musl-libc or something else).

buffered asynchronous file I/O on linux

I am looking for the most efficient way to do asynchronous file I/O on linux.
The POSIX glibc implementation uses threads in userland.
The native aio kernel api only works with unbuffered operations, patches for the kernel to add support for buffered operations exist, but those are >3 years old and no one seems to care about integrating them into the mainline.
I found plenty of other ideas, concepts, patches that would allow asynchronous I/O, though most of them in articles that are also >3 years old. What of all this is really available in todays kernel? I've read about servlets, acalls, stuff with kernel threads and more things I don't even remember right now.
What is the most efficient way to do buffered asynchronous file input/output in todays kernel?
Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.
The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.
Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.
From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.
Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.
Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.
EDIT:
As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.
If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.
The material seems old -- well, it is old -- because it's been around for long and, while by no means trivial, is well understood. A solution you can lift is published in W. Richard Stevens's superb and unparalleled book (read "bible"). The book is the rare treasure that is clear, concise, and complete: every page gives real and immediate value:
Advanced Programming in the UNIX Environment
Two other such, also by Stevens, are the first two volumes of his Unix Network Programming collection:
Volume 1: The Sockets Networking API (with Fenner and Rudoff) and
Volume 2: Interprocess Communications
I can't imagine being without these three fundamental books; I'm dumbstruck when I find someone who hasn't heard of them.
Still more of Steven's books, just as precious:
TCP/IP Illustrated, Vol. 1: The Protocols
(2021) If your Linux kernel is new enough (at least 5.1 but newer kernels bring improvements) then io_uring will be "the most efficient way to do asynchronous file input/output" *. That applies to both buffered and direct I/O!
In the Kernel Recipes 2019 video "Faster IO through io_uring", io_uring author Jens Axboe demonstrates buffered I/O via io_uring finishing in almost half the time of synchronous buffered I/O. As #Marenz noted, unless you want to userspace threads io_uring is the only way to do buffered asynchronous I/O because Linux AIO (aka libaio/io_submit()) doesn't have the ability to always do buffered asynchronous I/O...
Additionally, in the article "Modern storage is plenty fast." Glauber Costa demonstrates how careful use of io_uring with asynchronous direct I/O can improve throughput compared to using io_uring for asynchronous buffered I/O on an Optane device. It required Glauber to have a userspace readahead implementation (without which buffered I/O was a clear winner) but the improvement was impressive.
* The context of this answer is clearly in relation to storage (after all the word buffered was mentioned). For network I/O io_uring has steadily improved in later kernels to the extent that it can trade blows with things like epoll() and if it continues it will one day be either equal or better in all cases.
I don't think the Linux kernel implementation of asynchronous file I/O is really usable unless you also use O_DIRECT, sorry.
There's more information about the current state of the world here: https://github.com/littledan/linux-aio . It was updated in 2012 by someone who used to work at Google.

The state of Linux async IO?

I ask here since googling leads you on a merry trip around archives with no hint as to what the current state is. If you go by Google, it seems that async IO was all the rage in 2001 to 2003, and by 2006 some stuff like epoll and libaio was turning up; kevent appeared but seems to have disappeared, and as far as I can tell, there is still no good way to mix completion-based and ready-based signaling, async sendfile - is that even possible? - and everything else in a single-threaded event loop.
So please tell me I'm wrong and it's all rosy! - and, importantly, what APIs to use.
How does Linux compare to FreeBSD and other operating systems in this regard?
AIO as such is still somewhat limited and a real pain to get started with, but it kind of works for the most part, once you've dug through it.
It has some in my opinion serious bugs, but those are really features. For example, when submitting a certain amount of commands or data, your submitting thread will block. I don't remember the exact justification for this feature, but the reply I got back then was something like "yes of course, the kernel has a limit on its queue size, that is as intended". Which is acceptable if you submit a few thousand requests... obviously there has to be a limit somewhere. It might make sense from a DoS point of view, too (otherwise a malicious program could force the kernel to run out of memory by posting a billion requests). But still, it's something that you can realistically encounter with "normal" numbers (a hundred or so) and it will strike you unexpectedly, which is no good. Plus, if you only submit half a dozen or so requests and they're a bit larger (some megabytes of data) the same may happen, apparently because the kernel breaks them up in sub-requests. Which, again, kind of makes sense, but seeing how the docs don't tell you, one should expect that it makes no difference (apart from taking longer) whether you read 500 bytes or 50 megabytes of data.
Also, there seems to be no way of doing buffered AIO, at least on any of my Debian and Ubuntu systems (although I've seen other people complain about the exact opposite, i.e. unbuffered writes in fact going via the buffers). From what I can see on my systems, AIO is only really asynchronous with buffering turned off, which is a shame (it is why I am presently using an ugly construct around memory mapping and a worker thread instead).
An important issue with anything asynchronous is being able to epoll_wait() on it, which is important if you are doing anything else apart from disk IO (such as receiving network traffic). Of course there is io_getevents, but it is not so desirable/useful, as it only works for one singular thing.
In recent kernels, there is support for eventfd. At first sight, it appears useless, since it is not obvious how it may be helpful in any way.
However, to your rescue, there is the undocumented function io_set_eventfd which lets you associate AIO with an eventfd, which is epoll_wait()-able. You have to dig through the headers to find out about it, but it's certainly there, and it works just fine.
Asynchronous disc IO is alive and kicking ... it is actually supported and works reasonably well now, but has significant limitations (but with enough functionality that some of the major users can usefully use it - for example MySQL's Innodb does in the latest version).
Asynchronous disc IO is the ability to invoke disc IO operations in a non-blocking manner (in a single thread) and wait for them to complete. This works fine, http://lse.sourceforge.net/io/aio.html has more info.
AIO does enough for a typical application (database server) to be able to use it. AIO is a good alternative to either creating lots of threads doing synchronous IO, or using scatter/gather in the preadv family of system calls which now exist.
It's possible to do a "shopping list" synchronous IO job using the newish preadv call where the kernel will go and get a bunch of pages from different offsets in a file. This is ok as long as you have only one file to read. (NB: Equivalent write function exists).
poll, epoll etc, are just fancy ways of doing select() that suffer from fewer limitations and scalability problems - they may not be able to be mixed with disc aio easily, but in a real-world application, you can probably get around this fairly trivially by using threads (some database servers tend to do these kinds of operations in separate threads anyway). Poll() is good, epoll is better, for large numbers of file descriptors. select() is ok too for small numbers of file descriptors (or specifically, low file descriptor numbers).
(At the tail end of 2019 there's a glimmer of hope almost a decade after the original question was asked)
If you have a 5.1 or later Linux kernel you can use the io_uring interface which will hopefully usher in a better asynchronous I/O future for Linux (see one of the answers to the Stack Overflow question "Is there really no asynchronous block I/O on Linux?" for benefits io_uring provides over KAIO). Hopefully this will allow Linux to provide stiff competition to FreeBSD's asynchronous AIO without huge contortions!
Most of what I've learned about asynchronous I/O in Linux was by working on the Lighttpd source. It is a single-threaded web server that handles many simultaneous connections, using the what it believes is the best of whatever asynchronous I/O mechanisms are available on the running system. Take a look at the source, it supports Linux, BSD, and (I think) a few other operating systems.

What is the status of POSIX asynchronous I/O (AIO)?

There are pages scattered around the web that describe POSIX AIO facilities in varying amounts of detail. None of them are terribly recent. It's not clear what, exactly, they're describing. For example, the "official" (?) web site for Linux kernel asynchronous I/O support here says that sockets don't work, but the "aio.h" manual pages on my Ubuntu 8.04.1 workstation all seem to imply that it works for arbitrary file descriptors. Then there's another project that seems to work at the library layer with even less documentation.
I'd like to know:
What is the purpose of POSIX AIO? Given that the most obvious example of an implementation I can find says it doesn't support sockets, the whole thing seems weird to me. Is it just for async disk I/O? If so, why the hyper-general API? If not, why is disk I/O the first thing that got attacked?
Where are there example complete POSIX AIO programs that I can look at?
Does anyone actually use it, for real?
What platforms support POSIX AIO? What parts of it do they support? Does anyone really support the implied "Any I/O to any FD" that <aio.h> seems to promise?
The other multiplexing mechanisms available to me are perfectly good, but the random fragments of information floating around out there have made me curious.
Doing socket I/O efficiently has been solved with kqueue, epoll, IO completion ports and the likes. Doing asynchronous file I/O is sort of a late comer (apart from windows' overlapped I/O and solaris early support for posix AIO).
If you're looking for doing socket I/O, you're probably better off using one of the above mechanisms.
The main purpose of AIO is hence to solve the problem of asynchronous disk I/O. This is most likely why Mac OS X only supports AIO for regular files, and not sockets (since kqueue does that so much better anyway).
Write operations are typically cached by the kernel and flushed out at a later time. For instance when the read head of the drive happens to pass by the location where the block is to be written.
However, for read operations, if you want the kernel to prioritize and order your reads, AIO is really the only option. Here's why the kernal can (theoretically) do that better than any user level application:
The kernel sees all disk I/O, not just your applications disk jobs, and can order them at a global level
The kernel (may) know where the disk read head is, and can pick the read jobs you pass on to it in optimal order, to move the head the shortest distance
The kernel can take advantage of native command queuing to optimize your read operations further
You may be able to issue more read operations per system call using lio_listio() than with readv(), especially if your reads are not (logically) contiguous, saving a tiny bit of system call overhead.
Your program might be slightly simpler with AIO since you don't need an extra thread to block in a read or write call.
That said, posix AIO has a quite awkward interface, for instance:
The only efficient and well supported mean of event callbacks are via signals, which makes it hard to use in a library, since it means using signal numbers from the process-global signal namespace. If your OS doesn't support realtime signals, it also means you have to loop through all your outstanding requests to figure out which one actually finished (this is the case for Mac OS X for instance, not Linux). Catching signals in a multi-threaded environment also makes for some tricky restrictions. You can typically not react to the event inside the signal handler, but you have to raise a signal, write to a pipe or use signalfd() (on linux).
lio_suspend() has the same issues as select() does, it doesn't scale very well with the number of jobs.
lio_listio(), as implemented has fairly limited number of jobs you can pass in, and it's not trivial to find this limit in a portable way. You have to call sysconf(_SC_AIO_LISTIO_MAX), which may fail, in which case you can use the AIO_LISTIO_MAX define, which are not necessarily defined, but then you can use 2, which is defined as guaranteed to be supported.
As for real-world application using posix AIO, you could take a look at lighttpd (lighty), which also posted a performance measurement when introducing support.
Most posix platforms supports posix AIO by now (Linux, BSD, Solaris, AIX, tru64). Windows supports it via its overlapped file I/O. My understanding is that only Solaris, Windows and Linux truly supports async. file I/O all the way down to the driver, whereas the other OSes emulate the async. I/O with kernel threads. Linux being the exception, its posix AIO implementation in glibc emulates async operations with user level threads, whereas its native async I/O interface (io_submit() etc.) are truly asynchronous all the way down to the driver, assuming the driver supports it.
I believe it's fairly common among OSes to not support posix AIO for any fd, but restrict it to regular files.
Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.
Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.
Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.
So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.
A libtorrent developer provides a report on this: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/
There is aio_write - implemented in glibc; first call of the aio_read or aio_write function spawns a number of user mode threads, aio_write or aio_read post requests to that thread, the thread does pread/pwrite and when it is finished the answer is posted back to the blocked calling thread.
Ther is also 'real' aio - supported by the kernel level (need libaio for that, see the io_submit call http://linux.die.net/man/2/io_submit ); also need O_DIRECT for that (also may not be supported by all file systems, but the major ones do support it)
see here:
http://lse.sourceforge.net/io/aio.html
http://linux.die.net/man/2/io_submit
Difference between POSIX AIO and libaio on Linux?

Resources