How io-uring implementation is different from AIO?

How io-uring implementation is different from AIO? - linux

Apparently, Linux already had an Asyn-IO (AIO) API. I believe it is not fully asynchronous. So what was the issue with AIO? and how io_uring overcomes it?
PS: I tried to read https://kernel.dk/io_uring.pdf but couldn't fully understand as I am out of touch with the C language.

Apparently, Linux already had an Asyn[c]-IO (AIO) API. I believe it is not fully asynchronous. So what was the issue with AIO?
If you're extra careful and match all its constraints, the "old" Linux AIO interface will behave asynchronously. However, if you "break" ANY of the (hidden) rules submission can suddenly (and silently) behave in a synchronous fashion (i.e. submission blocks non-deterministically and for inconveniently long periods of time). Some of the many "rules" are given in answers to asynchronous IO io_submit latency in Ubuntu Linux (the overall issues are also listed in section 1.0 of the io_uring document you linked).
how io_uring overcomes it?
It is a radically different interface (see this answer on the "Is there really no asynchronous block I/O on Linux?") which is harder to get wrong.
It has a workqueue/thread pool mechanism which it will punt requests to when it is aware that blocking will take place before the result of submission can be returned (and thus it is able to return control back to the caller). This allows it to retain asynchrony in more (hopefully all!) submission cases.
It has an optional privileged mode (IORING_SETUP_SQPOLL) where you don't even have to make syscalls to submit/retrieve results. If you're "just" manipulating memory contents it's going to be hard to be blocked on a call you never made!
How io_uring internally works?
There are two ring buffers where the first ring is used to queue submissions and when the work has been completed the "results" are announced via the second ring buffer (which contains completions). While it's hard to give you something more than a very high level view if you're uncomfortable with things like C structs/C function interfaces you may enjoy this video by Jens presenting io_uring nonetheless and find the explanations in https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ and https://mattermost.com/blog/iouring-and-go/ more accessible.
io_uring's advantages over Linux AIO don't stop at better asynchrony though! See the aforementioned link for "Is there really no asynchronous block I/O on Linux?" for a list of other benefits...

Related

Multi threading analysis techniques

Does anyone know of any analysis techniques that can be used to design/debug thread locking and unlocking sequences? Essentially a technique (like a truth table) I can use to prove that my sequence of locks won't deadlock.
This is not the sort of problem that programming by trial and error works well in.
My particular problem is a read write lock - but I ask this in the general sense. I believe it would be a useful technique to learn if one exists.
I have tried a causal graph in which I have boxes and arrows that I can use to follow the flow of control and that has solved 80% of my problem. But I am still getting occasional deadlocks under stress testing when one thread sneaks though the "gap between instructions" if that makes any sense.
To summarize; what I need is some way of representing the problem so that I can formally analyze the overlap of mutex locks.

Bad news I'm afraid. There are no techniques that I know of that can "prove" that a system that uses locks to control access to shared memory. By "prove" I mean that you cannot demonstrate analytically that a program won't deadlock, livelock, etc.
The problem is that threads run asynchronously. As soon as you start having a sensible number of threads and shared resources, the number of possible sequences of events (e.g. locking/unlocking shared resources) is astronomically high and you cannot model / analyse each and every one of them.
For this reason Communicating Sequential Processes was developed by Tony Hoare, way back in 1978. It is a development of the Actor model which itself goes a long way to resolving the problem.
Actor and CSP
Briefly, in the Actor model data is not communicated via shared memory with a lock. Instead a copy is sent down a communications channel of some sort (e.g. a socket, or pipe) between two threads. This means that you're never locking memory. In effect all memory is private to threads, with copies of it being sent as and when required to other threads. It's a very 'object orientated' thing; private data (thread-owned memory), public interface (messages emitted and received on communications channels). It's also very scalable - pipes can become sockets, threads can become processes on other computers.
The CSP model is just like that, except that the communications channel won't accept a message unless the receiving end is ready to read it.
This addition is crucial - it means that a system design can be analysed algebraically. Indeed Tony Hoare formulated a process calculi for CSP. The Wikipedia page on CSP cites use of this to prove an eCommerce system's design.
So if one is developing a strict CSP system, it is possible to prove analytically that it cannot deadlock, etc.
Real World Experience
I've done many a CSP (or CSP-ish) system, and it's always been good. Instead of doing the maths I've used intuition to help me avoid problems. In effect CSP ensures that if I've gone and built a system that can deadlock, it will deadlock every time. So at least I find it in development, not 2 years later when some network link gets a bit busier than normal.
Real World Options
For Actor model programming there's a lot of options. ZeroMQ, nanomsg, Microsoft's .NET Data Flow library.
They're all pretty good, and with care you can make a system that'll be pretty good. I like ZeroMQ and nanomsg a lot - they make it trivial to split a bunch of threads up into separate processes on separate computers and you've not changed the architecture at all. If absolute performance isn't essential coupling these two up with, for example, Google Protocol Buffers makes for a really tidy system with huge options for incorporating different OSes, languages and systems into your design.
I suspect that MS's DataFlow library for .NET moves owner of references to the data around instead of copying it. That ought to make it pretty performant (though I've not actually tried it to see).
CSP is a bit harder to come by. You can nearly make ZeroMQ and DataFlow into CSP by setting message buffer lengths. Unfortunately you cannot set the buffer length to zero (which is what would make it CSP). MS's documentation even talks about the benefits to system robustness achieved by setting the queue length to 1.
You can synthesize CSP on top of Actor by having flows of synchronisation messages across the links. This is annoying to have to implement.
I've quite often spun up my own comms framework to get a CSP environment.
There's libraries for Java I think, don't know how actively developed they are.
However as you have existing code written around locked shared memory it'll be a tough job to adapt your code. So....
Kernel Shark
If you're on Linux and your kernel has FTRACE compiled in you can use Kernel Shark to see what has happened in your system. Similarly with DTRACE on Solaris, WindView on VxWorks, TATL on MCOS.
What you do is run your system until it stops, and then very quickly preserve the FTRACE log (it gets overwritten in a circular buffer by the OS). You can then see graphically what has happened (turn on Kernel Shark's process view), which may give clues as to what did what and when.
This helps you diagnose your application's deadlock, which may lead you towards getting things right, but ultimately you can never prove that it is correct this way. That doesn't stop you having a Eureka moment where you now know in your bones that you've got it right.
I know of no equivalent of FTRACE / Kernel shark for Windows.

For a broad range of multithreading tasks, we can draw a graph which reflects the order of locking of resources. If that graph has cycles, this means that deadlock is well possible. If there is no cycles, deadlock never occur.
For example, consider the Dining Philosophers task. If each philosopher takes left fork first, and then the right fork, then the graph of order of locking is a ring connecting all the forks. Deadlock is very possible in this situation. However, if one of philosophers changes his order, the ring become a line and deadlock would never occur. If all philosophers change their order and all would take right fork first, the graph again shapes a ring and deadlock is real.

Why do we need both threading and asynchronous calls

My understanding is that threads exist as a way of doing several things in parallel that share the same address space but each has its individual stack. Asynchronous programming is basically a way of using fewer threads. I don't understand why it's undesirable to just have blocking calls and a separate thread for each blocked command?
For example, suppose I want to scrape a large part of the web. A presumably uncontroversial implementation might be to have a large number of asynchronous loops. Each loop would ask for a webpage, wait to avoid overloading the remote server, then ask for another webpage from the same website until done. The loops would then be executed on a much smaller number of threads (which is fine because they mostly wait). So to restate the question, I don't see why it's any cheaper to e.g. maintain a threadpool within the language runtime than it would be to just have one (mostly blocked) OS thread per loop and let the operating system deal with the complexity of scheduling the work? After all, if piling two different schedulers on top of each other is so good, it can still be implemented that way in the OS :-)
It seems obvious the answer is something like "threads are expensive". However, a thread just needs to keep track of where it has got to before it was interrupted the last time. This is pretty much exactly what an asynchronous command needs to know before it blocks (perhaps representing what happens next by storing a callback). I suppose one difference is that a blocking asynchronous command does so at a well defined point whereas a thread can be interrupted anywhere. If there really is a substantial difference in terms of the cost of keeping the state, where does it come from? I doubt it's the cost of the stack since that wastes at most a 4KB page, so it's a non-issue even for 1000s of blocked threads.
Many thanks, and sorry if the question is very basic. It might be I just cannot figure out the right incantation to type into Google.

Threads consume memory, as they need to have their state preserved even if they aren't doing anything. If you make an asynchronous call on a single thread, it's literally (aside from registering that the call was made somewhere) not consuming any resources until it needs to be resolved, because if it isn't actively being processed you don't care about it.
If the architecture of your application is written in a way that the resources it needs scale linearly (or worse) with the number of users / traffic you receive, then that is going to be a problem. You can watch this talk about node.js if you want to watch someone talk at length about this.
https://www.youtube.com/watch?v=ztspvPYybIY

How is Wait() functionality Implemented

How are wait (Eg: WaitForSingleObject) functions implemented internally in Windows or any OS?
How is it any different from a spin lock?
Does the CPU/hardware provide special functionality to do this?

Hazy view of What's Going On follows... Focusing on IO mainly.
Unix / Linux / Posix
The Unix equivalent, select(), epoll(), and similar have been implemented in various ways. In the early days the implementations were rubbish, and amounted to little more than busy polling loops that used up all your CPU time. Nowadays it's much better, and takes no CPU time whilst blocked.
They can do this I think because the device driver model for devices like Ethernet, serial and so forth has been designed to support the select() family of functions. Specifically the model must allow the kernel to tell the devices to raise an interrupt when something has happened. The kernel can then decide whether or not that will result in a select() unblocking, etc etc. The result is efficient process blocking.
Windows
In Windows the WaitfFor when applied to asynchronous IO is completely different. You actually have to start a thread reading from an IO device, and when that read completes (note, not starts) you have that thread return something that wakes up the WaitFor. That gets dressed up in object.beginread(), etc, but they all boil down to that underneath.
This means you can't replicate the select() functionality in Windows for serial, pipes, etc. But there is a select function call for sockets. Weird.
To me this suggests that the whole IO architecture and device driver model of the Windows kernel can drive devices only by asking them to perform an operation and blocking until the device has completed it. There would seem to be no truly asynchronous way for the device to notify the kernel of events, and the best that can be achieved is to have a separate thread doing the synchronous operation for you. I've no idea how they've done select for sockets, but I have my suspicions.
CYGWIN, Unix on Windows
When the cygwin guys came to implement their select() routine on Windows they were horrified to discover that it was impossible to implement for anything other than sockets. What they did was for each file descriptor passed to select they would spawn a thread. This would poll the device, pipe, whatever waiting for the available data count to be non zero, etc. That thread would then notify the thread that is actually calling select() that something had happened. This is very reminiscent of select() implementations from the dark days of Unix, and is massively inefficient. But it does work.
I would bet a whole 5 new pence that that's how MS did select for sockets too.
My Experience Thus Far
Windows' WaitFors... are fine for operations that are guaranteed to complete or proceed and nice fixed stages, but are very unpleasant for operations that aren't (like IO). Cancelling an asynchronous IO operation is deeply unpleasant. The only way I've found for doing it is to close the device, socket, pipe, etc which is not always what you want to do.
Attempt to Answer the Question
The hardware's interrupt system supports the implementation of select() because it's a way for the devices to notify the CPU that something has happened without the CPU having to poll / spin on a register in the device.
Unix / Linux uses that interrupt system to provide select() / epoll() functionality, and also incorporates purely internal 'devices' (pipes, files, etc) into that functionality.
Windows' equivalent facility, WaitForMultipleObjects() fundamentally does not incorporate IO devices of any sort, which is why you have to have a separate thread doing the IO for you whilst you're waiting for that thread to complete. The interrupt system on the hardware is (I'm guessing) used solely to tell the device drivers when a read or write operation is complete. The exception is the select() function call in Windows which operates only on sockets, not anything else.
A big clue to the architectural differences between Unix/Linux and Windows is that a PC can run either, but you get a proper IO-centric select() only on Unix/Linux.
Guesswork
I'm guessing that the reason Windows has never done a select() properly is that the early device drivers for Windows could in no way support it, a bit like the early days of Linux.
However, Windows became very popular quite early on, and loads of device drivers got written against that (flawed?) device driver standard.
If at any point MS had thought "perhaps we'd better improve on that" they would have faced the problem of getting everyone to rewrite their device drivers, a massive undertaking. So they decided not to, and instead implemented the separate IO thread / WaitFor... model instead. This got promoted by MS as being somehow superior to the Unix way of doing things. And now that Windows has been that way for so long I'm guessing that there's no one in MS who perceives that Things Could Be Improved.
==EDIT==
I have since stumbled across Named Pipes - Asynchronous Peeking. This is fascinating because it would seem (I'm very glad to say) to debunk pretty much everything I'd thought about Windows and IO. The article applies to pipes, though presumably it would also apply to any IO stream.
It seems to hinge on starting an asynchronous read operation to read zero bytes. The read will not return until there are some bytes available, but none of them are read from the stream. You can therefore use something like WaitForMultipleObjects() to wait for more than one such asynchronous operation to complete.
As the comment below the accepted answer recognises this is very non-obvious in all the of the Microsoft documentation that I've ever read. I wonder about it being an unintended but useful behaviour in the OS. I've been ploughing through Windows Internals by Mark Russinovich, but I've not found anything yet.
I've not yet had a chance to experiment with this in anyway, but if it does work then that means that one can implement something equivalent to Unix's select() on Windows, so it must therefore be supported all the way down to the device driver level and interrupts. Hence the extensive strikeouts above...

buffered asynchronous file I/O on linux

I am looking for the most efficient way to do asynchronous file I/O on linux.
The POSIX glibc implementation uses threads in userland.
The native aio kernel api only works with unbuffered operations, patches for the kernel to add support for buffered operations exist, but those are >3 years old and no one seems to care about integrating them into the mainline.
I found plenty of other ideas, concepts, patches that would allow asynchronous I/O, though most of them in articles that are also >3 years old. What of all this is really available in todays kernel? I've read about servlets, acalls, stuff with kernel threads and more things I don't even remember right now.
What is the most efficient way to do buffered asynchronous file input/output in todays kernel?

Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.
The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.
Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.
From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.
Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.
Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.
EDIT:
As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.
If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.

The material seems old -- well, it is old -- because it's been around for long and, while by no means trivial, is well understood. A solution you can lift is published in W. Richard Stevens's superb and unparalleled book (read "bible"). The book is the rare treasure that is clear, concise, and complete: every page gives real and immediate value:
Advanced Programming in the UNIX Environment
Two other such, also by Stevens, are the first two volumes of his Unix Network Programming collection:
Volume 1: The Sockets Networking API (with Fenner and Rudoff) and
Volume 2: Interprocess Communications
I can't imagine being without these three fundamental books; I'm dumbstruck when I find someone who hasn't heard of them.
Still more of Steven's books, just as precious:
TCP/IP Illustrated, Vol. 1: The Protocols

(2021) If your Linux kernel is new enough (at least 5.1 but newer kernels bring improvements) then io_uring will be "the most efficient way to do asynchronous file input/output" *. That applies to both buffered and direct I/O!
In the Kernel Recipes 2019 video "Faster IO through io_uring", io_uring author Jens Axboe demonstrates buffered I/O via io_uring finishing in almost half the time of synchronous buffered I/O. As #Marenz noted, unless you want to userspace threads io_uring is the only way to do buffered asynchronous I/O because Linux AIO (aka libaio/io_submit()) doesn't have the ability to always do buffered asynchronous I/O...
Additionally, in the article "Modern storage is plenty fast." Glauber Costa demonstrates how careful use of io_uring with asynchronous direct I/O can improve throughput compared to using io_uring for asynchronous buffered I/O on an Optane device. It required Glauber to have a userspace readahead implementation (without which buffered I/O was a clear winner) but the improvement was impressive.
* The context of this answer is clearly in relation to storage (after all the word buffered was mentioned). For network I/O io_uring has steadily improved in later kernels to the extent that it can trade blows with things like epoll() and if it continues it will one day be either equal or better in all cases.

I don't think the Linux kernel implementation of asynchronous file I/O is really usable unless you also use O_DIRECT, sorry.
There's more information about the current state of the world here: https://github.com/littledan/linux-aio . It was updated in 2012 by someone who used to work at Google.

The state of Linux async IO?

I ask here since googling leads you on a merry trip around archives with no hint as to what the current state is. If you go by Google, it seems that async IO was all the rage in 2001 to 2003, and by 2006 some stuff like epoll and libaio was turning up; kevent appeared but seems to have disappeared, and as far as I can tell, there is still no good way to mix completion-based and ready-based signaling, async sendfile - is that even possible? - and everything else in a single-threaded event loop.
So please tell me I'm wrong and it's all rosy! - and, importantly, what APIs to use.
How does Linux compare to FreeBSD and other operating systems in this regard?

AIO as such is still somewhat limited and a real pain to get started with, but it kind of works for the most part, once you've dug through it.
It has some in my opinion serious bugs, but those are really features. For example, when submitting a certain amount of commands or data, your submitting thread will block. I don't remember the exact justification for this feature, but the reply I got back then was something like "yes of course, the kernel has a limit on its queue size, that is as intended". Which is acceptable if you submit a few thousand requests... obviously there has to be a limit somewhere. It might make sense from a DoS point of view, too (otherwise a malicious program could force the kernel to run out of memory by posting a billion requests). But still, it's something that you can realistically encounter with "normal" numbers (a hundred or so) and it will strike you unexpectedly, which is no good. Plus, if you only submit half a dozen or so requests and they're a bit larger (some megabytes of data) the same may happen, apparently because the kernel breaks them up in sub-requests. Which, again, kind of makes sense, but seeing how the docs don't tell you, one should expect that it makes no difference (apart from taking longer) whether you read 500 bytes or 50 megabytes of data.
Also, there seems to be no way of doing buffered AIO, at least on any of my Debian and Ubuntu systems (although I've seen other people complain about the exact opposite, i.e. unbuffered writes in fact going via the buffers). From what I can see on my systems, AIO is only really asynchronous with buffering turned off, which is a shame (it is why I am presently using an ugly construct around memory mapping and a worker thread instead).
An important issue with anything asynchronous is being able to epoll_wait() on it, which is important if you are doing anything else apart from disk IO (such as receiving network traffic). Of course there is io_getevents, but it is not so desirable/useful, as it only works for one singular thing.
In recent kernels, there is support for eventfd. At first sight, it appears useless, since it is not obvious how it may be helpful in any way.
However, to your rescue, there is the undocumented function io_set_eventfd which lets you associate AIO with an eventfd, which is epoll_wait()-able. You have to dig through the headers to find out about it, but it's certainly there, and it works just fine.

Asynchronous disc IO is alive and kicking ... it is actually supported and works reasonably well now, but has significant limitations (but with enough functionality that some of the major users can usefully use it - for example MySQL's Innodb does in the latest version).
Asynchronous disc IO is the ability to invoke disc IO operations in a non-blocking manner (in a single thread) and wait for them to complete. This works fine, http://lse.sourceforge.net/io/aio.html has more info.
AIO does enough for a typical application (database server) to be able to use it. AIO is a good alternative to either creating lots of threads doing synchronous IO, or using scatter/gather in the preadv family of system calls which now exist.
It's possible to do a "shopping list" synchronous IO job using the newish preadv call where the kernel will go and get a bunch of pages from different offsets in a file. This is ok as long as you have only one file to read. (NB: Equivalent write function exists).
poll, epoll etc, are just fancy ways of doing select() that suffer from fewer limitations and scalability problems - they may not be able to be mixed with disc aio easily, but in a real-world application, you can probably get around this fairly trivially by using threads (some database servers tend to do these kinds of operations in separate threads anyway). Poll() is good, epoll is better, for large numbers of file descriptors. select() is ok too for small numbers of file descriptors (or specifically, low file descriptor numbers).

(At the tail end of 2019 there's a glimmer of hope almost a decade after the original question was asked)
If you have a 5.1 or later Linux kernel you can use the io_uring interface which will hopefully usher in a better asynchronous I/O future for Linux (see one of the answers to the Stack Overflow question "Is there really no asynchronous block I/O on Linux?" for benefits io_uring provides over KAIO). Hopefully this will allow Linux to provide stiff competition to FreeBSD's asynchronous AIO without huge contortions!

Most of what I've learned about asynchronous I/O in Linux was by working on the Lighttpd source. It is a single-threaded web server that handles many simultaneous connections, using the what it believes is the best of whatever asynchronous I/O mechanisms are available on the running system. Take a look at the source, it supports Linux, BSD, and (I think) a few other operating systems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string