The state of Linux async IO? - linux

I ask here since googling leads you on a merry trip around archives with no hint as to what the current state is. If you go by Google, it seems that async IO was all the rage in 2001 to 2003, and by 2006 some stuff like epoll and libaio was turning up; kevent appeared but seems to have disappeared, and as far as I can tell, there is still no good way to mix completion-based and ready-based signaling, async sendfile - is that even possible? - and everything else in a single-threaded event loop.
So please tell me I'm wrong and it's all rosy! - and, importantly, what APIs to use.
How does Linux compare to FreeBSD and other operating systems in this regard?

AIO as such is still somewhat limited and a real pain to get started with, but it kind of works for the most part, once you've dug through it.
It has some in my opinion serious bugs, but those are really features. For example, when submitting a certain amount of commands or data, your submitting thread will block. I don't remember the exact justification for this feature, but the reply I got back then was something like "yes of course, the kernel has a limit on its queue size, that is as intended". Which is acceptable if you submit a few thousand requests... obviously there has to be a limit somewhere. It might make sense from a DoS point of view, too (otherwise a malicious program could force the kernel to run out of memory by posting a billion requests). But still, it's something that you can realistically encounter with "normal" numbers (a hundred or so) and it will strike you unexpectedly, which is no good. Plus, if you only submit half a dozen or so requests and they're a bit larger (some megabytes of data) the same may happen, apparently because the kernel breaks them up in sub-requests. Which, again, kind of makes sense, but seeing how the docs don't tell you, one should expect that it makes no difference (apart from taking longer) whether you read 500 bytes or 50 megabytes of data.
Also, there seems to be no way of doing buffered AIO, at least on any of my Debian and Ubuntu systems (although I've seen other people complain about the exact opposite, i.e. unbuffered writes in fact going via the buffers). From what I can see on my systems, AIO is only really asynchronous with buffering turned off, which is a shame (it is why I am presently using an ugly construct around memory mapping and a worker thread instead).
An important issue with anything asynchronous is being able to epoll_wait() on it, which is important if you are doing anything else apart from disk IO (such as receiving network traffic). Of course there is io_getevents, but it is not so desirable/useful, as it only works for one singular thing.
In recent kernels, there is support for eventfd. At first sight, it appears useless, since it is not obvious how it may be helpful in any way.
However, to your rescue, there is the undocumented function io_set_eventfd which lets you associate AIO with an eventfd, which is epoll_wait()-able. You have to dig through the headers to find out about it, but it's certainly there, and it works just fine.

Asynchronous disc IO is alive and kicking ... it is actually supported and works reasonably well now, but has significant limitations (but with enough functionality that some of the major users can usefully use it - for example MySQL's Innodb does in the latest version).
Asynchronous disc IO is the ability to invoke disc IO operations in a non-blocking manner (in a single thread) and wait for them to complete. This works fine, http://lse.sourceforge.net/io/aio.html has more info.
AIO does enough for a typical application (database server) to be able to use it. AIO is a good alternative to either creating lots of threads doing synchronous IO, or using scatter/gather in the preadv family of system calls which now exist.
It's possible to do a "shopping list" synchronous IO job using the newish preadv call where the kernel will go and get a bunch of pages from different offsets in a file. This is ok as long as you have only one file to read. (NB: Equivalent write function exists).
poll, epoll etc, are just fancy ways of doing select() that suffer from fewer limitations and scalability problems - they may not be able to be mixed with disc aio easily, but in a real-world application, you can probably get around this fairly trivially by using threads (some database servers tend to do these kinds of operations in separate threads anyway). Poll() is good, epoll is better, for large numbers of file descriptors. select() is ok too for small numbers of file descriptors (or specifically, low file descriptor numbers).

(At the tail end of 2019 there's a glimmer of hope almost a decade after the original question was asked)
If you have a 5.1 or later Linux kernel you can use the io_uring interface which will hopefully usher in a better asynchronous I/O future for Linux (see one of the answers to the Stack Overflow question "Is there really no asynchronous block I/O on Linux?" for benefits io_uring provides over KAIO). Hopefully this will allow Linux to provide stiff competition to FreeBSD's asynchronous AIO without huge contortions!

Most of what I've learned about asynchronous I/O in Linux was by working on the Lighttpd source. It is a single-threaded web server that handles many simultaneous connections, using the what it believes is the best of whatever asynchronous I/O mechanisms are available on the running system. Take a look at the source, it supports Linux, BSD, and (I think) a few other operating systems.

Related

How io-uring implementation is different from AIO?

Apparently, Linux already had an Asyn-IO (AIO) API. I believe it is not fully asynchronous. So what was the issue with AIO? and how io_uring overcomes it?
PS: I tried to read https://kernel.dk/io_uring.pdf but couldn't fully understand as I am out of touch with the C language.
Apparently, Linux already had an Asyn[c]-IO (AIO) API. I believe it is not fully asynchronous. So what was the issue with AIO?
If you're extra careful and match all its constraints, the "old" Linux AIO interface will behave asynchronously. However, if you "break" ANY of the (hidden) rules submission can suddenly (and silently) behave in a synchronous fashion (i.e. submission blocks non-deterministically and for inconveniently long periods of time). Some of the many "rules" are given in answers to asynchronous IO io_submit latency in Ubuntu Linux (the overall issues are also listed in section 1.0 of the io_uring document you linked).
how io_uring overcomes it?
It is a radically different interface (see this answer on the "Is there really no asynchronous block I/O on Linux?") which is harder to get wrong.
It has a workqueue/thread pool mechanism which it will punt requests to when it is aware that blocking will take place before the result of submission can be returned (and thus it is able to return control back to the caller). This allows it to retain asynchrony in more (hopefully all!) submission cases.
It has an optional privileged mode (IORING_SETUP_SQPOLL) where you don't even have to make syscalls to submit/retrieve results. If you're "just" manipulating memory contents it's going to be hard to be blocked on a call you never made!
How io_uring internally works?
There are two ring buffers where the first ring is used to queue submissions and when the work has been completed the "results" are announced via the second ring buffer (which contains completions). While it's hard to give you something more than a very high level view if you're uncomfortable with things like C structs/C function interfaces you may enjoy this video by Jens presenting io_uring nonetheless and find the explanations in https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ and https://mattermost.com/blog/iouring-and-go/ more accessible.
io_uring's advantages over Linux AIO don't stop at better asynchrony though! See the aforementioned link for "Is there really no asynchronous block I/O on Linux?" for a list of other benefits...

Performance implications of using inter-process communication (IPC)

What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC? Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC?
At it's core, IPC is what it says on the tin. It's a tool to use when you need to communicate information between processes, whatever that may be. The topic is very broad, and technically includes allocating shared memory and doing the communication manually, but given the tone of the question, and the tags, I'm assuming you're talking about the OS provided facilities.
Wikipedia does a pretty good job discussing how IPC is used, and I don't think I can do much better, so I'll concentrate on the second question.
Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
This smells a bit like a micro-optimization. I can't say definitively, because I'm not privy to the source code at Microsoft and Apple, and I really don't want to dig through the Linux kernel's implementation of IPC, but, here's a couple points:
IPC is a common operation, so OS designers are likely to optimize it for efficiency. There are teams of engineers that have considered the problem and figured out how to make this fast.
The bottleneck in communication across processes/threads is almost always synchronization. Delays are bad, but race conditions and deadlocks are worse. There are, however, lots of creative ways that OS designers can speed up the procedure, since the system controls the process scheduler and memory manager.
There's lots of ways to make the data transfer itself fast. For the OS, if the data needs to cross process boundaries, then there is some copying that may need to take place, but the OS copies memory all over the place all the time. Think about a command line utility, like netstat. When that executable is run, memory needs to be allocated, the process needs to be loaded from disk, and any address fixing that the OS needs to do is done, before the process can even start. This is done so quickly that you hardly even notice. On Windows netstat is about 40k, and it loads into memory almost instantly. (Notepad, another fast loader is 10 times that size, but it still launches in a tiny amount of time.)
The big exception to #2 above is if you're talking about IPC between processes that aren't on the same computer. (Think Windows RPC) Then you're really bound by the speed of the networking/communication stack, but at that point a few kb here or there isn't going to make a whole lot of difference. (You could consider AJAX to be a form of IPC where the 'processes' are the server and your browser. Now consider how fast Google Docs operates.)
If the IPC is between processes on the same system, I don't think that it's worth a ton of effort shaving bytes from your message. Make your message easy to debug.
In the case that the communication is happening between processes on different machines, then you may have something to think about, having spent a lot of time debugging issues that would have been simple with a better data format, a few dozen extra milliseconds transit time isn't worth making the data harder to parse/debug. Remember the three rules of optimization1:
Don't.
Don't... yet. (For experts)
Profile before you do.
1 The first two rules are usually attributed to Michael Jackson. (This one not this one)

Why not to use massively multi-threaded code?

Asynchronous and other event-based programming paradigms seem to be spreading like wildfire these days, with the popularity of node.js, Python 3.5's recent async improvements, and what not else.
Not that I particularly mind this or that I haven't already been doing it for a long time myself, but I've been trying to wrap my head around the real reasons why. Searching around for the evils of synchronous programming consistently seems to net the preconceived notion that "you can't have a thread for each request", without really qualifying that statement.
Why not, though? A thread might not be the cheapest resource one could think of, but it hardly seems "expensive". On 64-bit machines, we have more than enough virtual address space to handle all the threads we could ever want, and, unless your call chains are fairly deep, each thread shouldn't necessarily have to require more physical RAM than a single page* for stack plus whatever little overhead the kernel and libc need. As for performance, my own casual testing shows that Linux can handle well over 100,000 thread creations and tear-downs per second on a single CPU, which can hardly be a bottleneck.
That being said, it's not like I think event-based programming is all just a ruse, seeing as how it seems to have been the primary driver allowing such HTTP servers as lighttpd/nginx/whatever to overtake Apache in highly concurrent performance**. However, I've been trying to find some kind of actual inquiry into the reason why massively-multithreaded programs are slower without being able to find any.
So then, why is this?
*My testing seems to show that each thread actually requires two pages. Perhaps there's some dirtying of the TLS going on or something, but nevertheless it doesn't seem to change a lot.
**Though it should also be said that Apache, at that time, was using process-based concurrency rather than thread-based, which obviously makes a lot of difference.
If you have a thread for each request, then you can't do a little bit of work for each of 100 requests without switching contexts 100 times. While many things computers have to do have gotten faster over time, context switching is still expensive because it blows out the caches and modern systems are more dependent on these caches than ever.
That is a loaded question. I've heard different responses over time because I've had that conversation so many times before with different developers. Mainly, my gut feeling is most developers hate it because it is harder to write multi-threaded code and sometimes it is easy to shoot yourself in the foot unnecessarily. That said, each situation is different. Some programs lend themselves to multi-threading rather nicely, like a webserver. Each thread can take a request and essentially processes it without needing much outside resources. It has a set of procedures to apply on a request to decide how to process it. It decides what to do with it and passes it off. So it is fairly independent and can operate in its own world fairly safely. So it is a nice thread.
Other situations might not lend themselves so nicely. Especially when you need shared resources. Things can get hairy fast. Even if you do what seems like perfect context switching, you might still get race conditions. Then the nightmares begin. This is seen quite often in huge monolithic applications where they opted to use threads and open the gates of hell upon their dev team.
In the end, I think we will probably not see more threading in the day-to-day development, but we will move to a more event driven like world. We are going down that route with web development with the emergence of micro-services. So there will probably be more threading used, but not in a way that is visible to the developer using the framework. It will just be apart of the framework. At least that is my opinion.
Once the number of ready or running threads (versus threads pending on events) and/or processes goes beyond the number of cores, then those threads and/or processes are competing for the same cores, same cache, and the same memory bus.
Unless there are a massive number of simultaneous events to pend on, I don't see the purpose of massively multi-threaded code, except for super computers with a large number of processors and cores, and that code is usually massively multi-processing, with multiple memory buses.

How is Wait() functionality Implemented

How are wait (Eg: WaitForSingleObject) functions implemented internally in Windows or any OS?
How is it any different from a spin lock?
Does the CPU/hardware provide special functionality to do this?
Hazy view of What's Going On follows... Focusing on IO mainly.
Unix / Linux / Posix
The Unix equivalent, select(), epoll(), and similar have been implemented in various ways. In the early days the implementations were rubbish, and amounted to little more than busy polling loops that used up all your CPU time. Nowadays it's much better, and takes no CPU time whilst blocked.
They can do this I think because the device driver model for devices like Ethernet, serial and so forth has been designed to support the select() family of functions. Specifically the model must allow the kernel to tell the devices to raise an interrupt when something has happened. The kernel can then decide whether or not that will result in a select() unblocking, etc etc. The result is efficient process blocking.
Windows
In Windows the WaitfFor when applied to asynchronous IO is completely different. You actually have to start a thread reading from an IO device, and when that read completes (note, not starts) you have that thread return something that wakes up the WaitFor. That gets dressed up in object.beginread(), etc, but they all boil down to that underneath.
This means you can't replicate the select() functionality in Windows for serial, pipes, etc. But there is a select function call for sockets. Weird.
To me this suggests that the whole IO architecture and device driver model of the Windows kernel can drive devices only by asking them to perform an operation and blocking until the device has completed it. There would seem to be no truly asynchronous way for the device to notify the kernel of events, and the best that can be achieved is to have a separate thread doing the synchronous operation for you. I've no idea how they've done select for sockets, but I have my suspicions.
CYGWIN, Unix on Windows
When the cygwin guys came to implement their select() routine on Windows they were horrified to discover that it was impossible to implement for anything other than sockets. What they did was for each file descriptor passed to select they would spawn a thread. This would poll the device, pipe, whatever waiting for the available data count to be non zero, etc. That thread would then notify the thread that is actually calling select() that something had happened. This is very reminiscent of select() implementations from the dark days of Unix, and is massively inefficient. But it does work.
I would bet a whole 5 new pence that that's how MS did select for sockets too.
My Experience Thus Far
Windows' WaitFors... are fine for operations that are guaranteed to complete or proceed and nice fixed stages, but are very unpleasant for operations that aren't (like IO). Cancelling an asynchronous IO operation is deeply unpleasant. The only way I've found for doing it is to close the device, socket, pipe, etc which is not always what you want to do.
Attempt to Answer the Question
The hardware's interrupt system supports the implementation of select() because it's a way for the devices to notify the CPU that something has happened without the CPU having to poll / spin on a register in the device.
Unix / Linux uses that interrupt system to provide select() / epoll() functionality, and also incorporates purely internal 'devices' (pipes, files, etc) into that functionality.
Windows' equivalent facility, WaitForMultipleObjects() fundamentally does not incorporate IO devices of any sort, which is why you have to have a separate thread doing the IO for you whilst you're waiting for that thread to complete. The interrupt system on the hardware is (I'm guessing) used solely to tell the device drivers when a read or write operation is complete. The exception is the select() function call in Windows which operates only on sockets, not anything else.
A big clue to the architectural differences between Unix/Linux and Windows is that a PC can run either, but you get a proper IO-centric select() only on Unix/Linux.
Guesswork
I'm guessing that the reason Windows has never done a select() properly is that the early device drivers for Windows could in no way support it, a bit like the early days of Linux.
However, Windows became very popular quite early on, and loads of device drivers got written against that (flawed?) device driver standard.
If at any point MS had thought "perhaps we'd better improve on that" they would have faced the problem of getting everyone to rewrite their device drivers, a massive undertaking. So they decided not to, and instead implemented the separate IO thread / WaitFor... model instead. This got promoted by MS as being somehow superior to the Unix way of doing things. And now that Windows has been that way for so long I'm guessing that there's no one in MS who perceives that Things Could Be Improved.
==EDIT==
I have since stumbled across Named Pipes - Asynchronous Peeking. This is fascinating because it would seem (I'm very glad to say) to debunk pretty much everything I'd thought about Windows and IO. The article applies to pipes, though presumably it would also apply to any IO stream.
It seems to hinge on starting an asynchronous read operation to read zero bytes. The read will not return until there are some bytes available, but none of them are read from the stream. You can therefore use something like WaitForMultipleObjects() to wait for more than one such asynchronous operation to complete.
As the comment below the accepted answer recognises this is very non-obvious in all the of the Microsoft documentation that I've ever read. I wonder about it being an unintended but useful behaviour in the OS. I've been ploughing through Windows Internals by Mark Russinovich, but I've not found anything yet.
I've not yet had a chance to experiment with this in anyway, but if it does work then that means that one can implement something equivalent to Unix's select() on Windows, so it must therefore be supported all the way down to the device driver level and interrupts. Hence the extensive strikeouts above...

buffered asynchronous file I/O on linux

I am looking for the most efficient way to do asynchronous file I/O on linux.
The POSIX glibc implementation uses threads in userland.
The native aio kernel api only works with unbuffered operations, patches for the kernel to add support for buffered operations exist, but those are >3 years old and no one seems to care about integrating them into the mainline.
I found plenty of other ideas, concepts, patches that would allow asynchronous I/O, though most of them in articles that are also >3 years old. What of all this is really available in todays kernel? I've read about servlets, acalls, stuff with kernel threads and more things I don't even remember right now.
What is the most efficient way to do buffered asynchronous file input/output in todays kernel?
Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.
The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.
Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.
From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.
Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.
Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.
EDIT:
As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.
If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.
The material seems old -- well, it is old -- because it's been around for long and, while by no means trivial, is well understood. A solution you can lift is published in W. Richard Stevens's superb and unparalleled book (read "bible"). The book is the rare treasure that is clear, concise, and complete: every page gives real and immediate value:
Advanced Programming in the UNIX Environment
Two other such, also by Stevens, are the first two volumes of his Unix Network Programming collection:
Volume 1: The Sockets Networking API (with Fenner and Rudoff) and
Volume 2: Interprocess Communications
I can't imagine being without these three fundamental books; I'm dumbstruck when I find someone who hasn't heard of them.
Still more of Steven's books, just as precious:
TCP/IP Illustrated, Vol. 1: The Protocols
(2021) If your Linux kernel is new enough (at least 5.1 but newer kernels bring improvements) then io_uring will be "the most efficient way to do asynchronous file input/output" *. That applies to both buffered and direct I/O!
In the Kernel Recipes 2019 video "Faster IO through io_uring", io_uring author Jens Axboe demonstrates buffered I/O via io_uring finishing in almost half the time of synchronous buffered I/O. As #Marenz noted, unless you want to userspace threads io_uring is the only way to do buffered asynchronous I/O because Linux AIO (aka libaio/io_submit()) doesn't have the ability to always do buffered asynchronous I/O...
Additionally, in the article "Modern storage is plenty fast." Glauber Costa demonstrates how careful use of io_uring with asynchronous direct I/O can improve throughput compared to using io_uring for asynchronous buffered I/O on an Optane device. It required Glauber to have a userspace readahead implementation (without which buffered I/O was a clear winner) but the improvement was impressive.
* The context of this answer is clearly in relation to storage (after all the word buffered was mentioned). For network I/O io_uring has steadily improved in later kernels to the extent that it can trade blows with things like epoll() and if it continues it will one day be either equal or better in all cases.
I don't think the Linux kernel implementation of asynchronous file I/O is really usable unless you also use O_DIRECT, sorry.
There's more information about the current state of the world here: https://github.com/littledan/linux-aio . It was updated in 2012 by someone who used to work at Google.

Resources