Does the Thundering Herd Problem exist on Linux anymore? - linux

Many linux/unix programming books and tutorials speak about the "Thundering Herd Problem" which happens when multiple threads or forks are blocked on a select() call waiting for readability of a listening socket. When the connection comes in, all threads and forks are woken up but only one "wins" with a successful call to "accept()". In the meantime, a lot of cpu time is wasted waking up all the threads/forks for no reason.
I noticed a project which provides a "fix" for this problem in the linux kernel, but this is a very old patch.
I think there are two variants; One where each fork does select() and then accept(), and one that just does accept().
Do modern unix/linux kernels still have the Thundering Herd Problem in both these cases or only the "select() then accept()" version?

For years, most unix/linux kernels serialize response to accept(2)s, in other words, only one thread is waken up if more than one are blocking on accept(2) against a single open file descriptor.
OTOH, many (if not all) kernels still have the thundering herd problem in the select-accept pattern as you describe.
I have written a simple script ( https://gist.github.com/kazuho/10436253 ) to verify the existence of the problem, and found out that the problem exists on linux 2.6.32 and Darwin 12.5.0 (OS X 10.8.5).

This is a very old problem, and for the most part does not exist any more. The Linux kernel (for the past few years) has had a number of changes with the way it handles and routes packets up the network stack, and includes many optimizations to ensure both low latency, and fairness (i.e., minimize starvation).
That said, the select system has a number of scalability issues simply by way of its API. When you have a large number of file descriptors, the cost of a select call is very high. This is primarily due to having to build, check, and maintain the FD sets that are passed to and from the system call.
Now days, the preferred way to do asynchronous IO is with epoll. The API is far simpler and scales very nicely across various types of load (many connections, lots of throughput, etc.)

I recently saw tested a scenario where multiple threads polled on a listening unix-domain socket and then accepted the connection. All threads woke up using the poll() system call.
This was a custom build of the linux kernel rather than a distro build so perhaps there is a kernel configure option that changes it but I don't know what that would be.
We did not try epoll.

Refer the link below which talks about separate flags to epoll to avoid this problem.
http://lwn.net/Articles/632590/

Related

How is Wait() functionality Implemented

How are wait (Eg: WaitForSingleObject) functions implemented internally in Windows or any OS?
How is it any different from a spin lock?
Does the CPU/hardware provide special functionality to do this?
Hazy view of What's Going On follows... Focusing on IO mainly.
Unix / Linux / Posix
The Unix equivalent, select(), epoll(), and similar have been implemented in various ways. In the early days the implementations were rubbish, and amounted to little more than busy polling loops that used up all your CPU time. Nowadays it's much better, and takes no CPU time whilst blocked.
They can do this I think because the device driver model for devices like Ethernet, serial and so forth has been designed to support the select() family of functions. Specifically the model must allow the kernel to tell the devices to raise an interrupt when something has happened. The kernel can then decide whether or not that will result in a select() unblocking, etc etc. The result is efficient process blocking.
Windows
In Windows the WaitfFor when applied to asynchronous IO is completely different. You actually have to start a thread reading from an IO device, and when that read completes (note, not starts) you have that thread return something that wakes up the WaitFor. That gets dressed up in object.beginread(), etc, but they all boil down to that underneath.
This means you can't replicate the select() functionality in Windows for serial, pipes, etc. But there is a select function call for sockets. Weird.
To me this suggests that the whole IO architecture and device driver model of the Windows kernel can drive devices only by asking them to perform an operation and blocking until the device has completed it. There would seem to be no truly asynchronous way for the device to notify the kernel of events, and the best that can be achieved is to have a separate thread doing the synchronous operation for you. I've no idea how they've done select for sockets, but I have my suspicions.
CYGWIN, Unix on Windows
When the cygwin guys came to implement their select() routine on Windows they were horrified to discover that it was impossible to implement for anything other than sockets. What they did was for each file descriptor passed to select they would spawn a thread. This would poll the device, pipe, whatever waiting for the available data count to be non zero, etc. That thread would then notify the thread that is actually calling select() that something had happened. This is very reminiscent of select() implementations from the dark days of Unix, and is massively inefficient. But it does work.
I would bet a whole 5 new pence that that's how MS did select for sockets too.
My Experience Thus Far
Windows' WaitFors... are fine for operations that are guaranteed to complete or proceed and nice fixed stages, but are very unpleasant for operations that aren't (like IO). Cancelling an asynchronous IO operation is deeply unpleasant. The only way I've found for doing it is to close the device, socket, pipe, etc which is not always what you want to do.
Attempt to Answer the Question
The hardware's interrupt system supports the implementation of select() because it's a way for the devices to notify the CPU that something has happened without the CPU having to poll / spin on a register in the device.
Unix / Linux uses that interrupt system to provide select() / epoll() functionality, and also incorporates purely internal 'devices' (pipes, files, etc) into that functionality.
Windows' equivalent facility, WaitForMultipleObjects() fundamentally does not incorporate IO devices of any sort, which is why you have to have a separate thread doing the IO for you whilst you're waiting for that thread to complete. The interrupt system on the hardware is (I'm guessing) used solely to tell the device drivers when a read or write operation is complete. The exception is the select() function call in Windows which operates only on sockets, not anything else.
A big clue to the architectural differences between Unix/Linux and Windows is that a PC can run either, but you get a proper IO-centric select() only on Unix/Linux.
Guesswork
I'm guessing that the reason Windows has never done a select() properly is that the early device drivers for Windows could in no way support it, a bit like the early days of Linux.
However, Windows became very popular quite early on, and loads of device drivers got written against that (flawed?) device driver standard.
If at any point MS had thought "perhaps we'd better improve on that" they would have faced the problem of getting everyone to rewrite their device drivers, a massive undertaking. So they decided not to, and instead implemented the separate IO thread / WaitFor... model instead. This got promoted by MS as being somehow superior to the Unix way of doing things. And now that Windows has been that way for so long I'm guessing that there's no one in MS who perceives that Things Could Be Improved.
==EDIT==
I have since stumbled across Named Pipes - Asynchronous Peeking. This is fascinating because it would seem (I'm very glad to say) to debunk pretty much everything I'd thought about Windows and IO. The article applies to pipes, though presumably it would also apply to any IO stream.
It seems to hinge on starting an asynchronous read operation to read zero bytes. The read will not return until there are some bytes available, but none of them are read from the stream. You can therefore use something like WaitForMultipleObjects() to wait for more than one such asynchronous operation to complete.
As the comment below the accepted answer recognises this is very non-obvious in all the of the Microsoft documentation that I've ever read. I wonder about it being an unintended but useful behaviour in the OS. I've been ploughing through Windows Internals by Mark Russinovich, but I've not found anything yet.
I've not yet had a chance to experiment with this in anyway, but if it does work then that means that one can implement something equivalent to Unix's select() on Windows, so it must therefore be supported all the way down to the device driver level and interrupts. Hence the extensive strikeouts above...

Linux IPC with single writer, multible readers

In my application, I have one process (the "writer") reading data from external hardware. This process should supply consistent packets to several "readers".
Question:
How can one process send consistent data-packets (not blocking) to several clients while they see something like the end of a personal fifo? I'm on Debian Linux.
1) In my first approach I tried "datagram - unix domain sockets", which worked well.
But with the "writer" as server, all the clients have to poll the server permanently. :-(
They get one packet several times; or miss a packet if not polling fast enough.
2) My second approach was FIFOs (Named Pipes) which works too, but with several readers "strange things happen", what I got confirmed here: http://beej.us/guide/bgipc/output/html/multipage/fifos.html
I tried this, searched the net and stackoverflow the whole day, but I could not find a reasonable answer.
Edit: Sorry, I did not mention: I can't use socketpair() and fork. My programs are independently developed. I'd like to have the writer ready, while developing new readers.
If the writer process is a server, it might fork the client processes and just plain pipe(2) for communication. If there is no parent/child relationship, consider named pipes made with mkfifo(3), or AF_UNIX sockets (see unix(7) and scoket(2) ....) which are bidirectional (AF_UNIX sockets are much faster than TCP/IP or UDP/IP on the same machine).
Notice that your writer process is reading data from your hardware device and is writing or sending data to several reader clients. So your writer process deals with many file descriptors simultaneously (the hardware device to be read, and the sockets or pipes to be written to the clients, at least one file descriptor per client).
However, the important thing is to have some event loop (notably on the server side, and probably also inside clients). This means that you call some multiplexing syscall like poll(2) in the loop, and you "decide" if you are reading or writing or connecting (and which file descriptor should be read, or should be written, or should connect) in each iteration. See also read(2), write(2), connect(2), send(2), recv(2) etc... Notice that you should buffer data with an event loop (since read and write can be on "partial" or "incomplete" messages).
Notice that poll is not eating CPU resources when waiting for I/O. You could, but you should not anymore, use some older multiplexing syscall (like the obsolete select(2) ...). Use poll(2).
You may want to use a library to provide the event loop, e.g. libevent or libev... See also this answer. That event loop should also (on the server side) poll then read the hardware device.
If some of the programs are using a GUI toolkit (e.g. on the client side) like Qt or Gtk they should profit of the existing event loop provided by that toolkit...
You should read Advanced Linux Programming and know about the C10K problem.
If signals or timers are important (read carefully signal(7) and time(7)) the Linux specific signalfd(2) and timerfd_create(2) could be very helpful since they play nicely with event loops. These linux specific syscalls (signalfd & timerfd_create ...) are too recent to be mentioned in Advanced Linux Programming.
BTW, you could study the source code of existing free software similar to yours, and/or use strace(1) to understand the exact syscalls that they are doing.
If you have no loop around a multiplexing syscall (à la poll(2)) then you have no event loop and your design is buggy and cannot possibly and reliably work (since you need to react to several file descriptors at once).
You could also use a multi-threaded approach, but it is much more complex and not worth the effort in your particular case.
ZeroMQ has pattern for this problem. Is fast and supports a lot of programming languages. Pub-Sub. See: https://zeromq.org/ (free and open source).

buffered asynchronous file I/O on linux

I am looking for the most efficient way to do asynchronous file I/O on linux.
The POSIX glibc implementation uses threads in userland.
The native aio kernel api only works with unbuffered operations, patches for the kernel to add support for buffered operations exist, but those are >3 years old and no one seems to care about integrating them into the mainline.
I found plenty of other ideas, concepts, patches that would allow asynchronous I/O, though most of them in articles that are also >3 years old. What of all this is really available in todays kernel? I've read about servlets, acalls, stuff with kernel threads and more things I don't even remember right now.
What is the most efficient way to do buffered asynchronous file input/output in todays kernel?
Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.
The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.
Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.
From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.
Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.
Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.
EDIT:
As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.
If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.
The material seems old -- well, it is old -- because it's been around for long and, while by no means trivial, is well understood. A solution you can lift is published in W. Richard Stevens's superb and unparalleled book (read "bible"). The book is the rare treasure that is clear, concise, and complete: every page gives real and immediate value:
Advanced Programming in the UNIX Environment
Two other such, also by Stevens, are the first two volumes of his Unix Network Programming collection:
Volume 1: The Sockets Networking API (with Fenner and Rudoff) and
Volume 2: Interprocess Communications
I can't imagine being without these three fundamental books; I'm dumbstruck when I find someone who hasn't heard of them.
Still more of Steven's books, just as precious:
TCP/IP Illustrated, Vol. 1: The Protocols
(2021) If your Linux kernel is new enough (at least 5.1 but newer kernels bring improvements) then io_uring will be "the most efficient way to do asynchronous file input/output" *. That applies to both buffered and direct I/O!
In the Kernel Recipes 2019 video "Faster IO through io_uring", io_uring author Jens Axboe demonstrates buffered I/O via io_uring finishing in almost half the time of synchronous buffered I/O. As #Marenz noted, unless you want to userspace threads io_uring is the only way to do buffered asynchronous I/O because Linux AIO (aka libaio/io_submit()) doesn't have the ability to always do buffered asynchronous I/O...
Additionally, in the article "Modern storage is plenty fast." Glauber Costa demonstrates how careful use of io_uring with asynchronous direct I/O can improve throughput compared to using io_uring for asynchronous buffered I/O on an Optane device. It required Glauber to have a userspace readahead implementation (without which buffered I/O was a clear winner) but the improvement was impressive.
* The context of this answer is clearly in relation to storage (after all the word buffered was mentioned). For network I/O io_uring has steadily improved in later kernels to the extent that it can trade blows with things like epoll() and if it continues it will one day be either equal or better in all cases.
I don't think the Linux kernel implementation of asynchronous file I/O is really usable unless you also use O_DIRECT, sorry.
There's more information about the current state of the world here: https://github.com/littledan/linux-aio . It was updated in 2012 by someone who used to work at Google.

The state of Linux async IO?

I ask here since googling leads you on a merry trip around archives with no hint as to what the current state is. If you go by Google, it seems that async IO was all the rage in 2001 to 2003, and by 2006 some stuff like epoll and libaio was turning up; kevent appeared but seems to have disappeared, and as far as I can tell, there is still no good way to mix completion-based and ready-based signaling, async sendfile - is that even possible? - and everything else in a single-threaded event loop.
So please tell me I'm wrong and it's all rosy! - and, importantly, what APIs to use.
How does Linux compare to FreeBSD and other operating systems in this regard?
AIO as such is still somewhat limited and a real pain to get started with, but it kind of works for the most part, once you've dug through it.
It has some in my opinion serious bugs, but those are really features. For example, when submitting a certain amount of commands or data, your submitting thread will block. I don't remember the exact justification for this feature, but the reply I got back then was something like "yes of course, the kernel has a limit on its queue size, that is as intended". Which is acceptable if you submit a few thousand requests... obviously there has to be a limit somewhere. It might make sense from a DoS point of view, too (otherwise a malicious program could force the kernel to run out of memory by posting a billion requests). But still, it's something that you can realistically encounter with "normal" numbers (a hundred or so) and it will strike you unexpectedly, which is no good. Plus, if you only submit half a dozen or so requests and they're a bit larger (some megabytes of data) the same may happen, apparently because the kernel breaks them up in sub-requests. Which, again, kind of makes sense, but seeing how the docs don't tell you, one should expect that it makes no difference (apart from taking longer) whether you read 500 bytes or 50 megabytes of data.
Also, there seems to be no way of doing buffered AIO, at least on any of my Debian and Ubuntu systems (although I've seen other people complain about the exact opposite, i.e. unbuffered writes in fact going via the buffers). From what I can see on my systems, AIO is only really asynchronous with buffering turned off, which is a shame (it is why I am presently using an ugly construct around memory mapping and a worker thread instead).
An important issue with anything asynchronous is being able to epoll_wait() on it, which is important if you are doing anything else apart from disk IO (such as receiving network traffic). Of course there is io_getevents, but it is not so desirable/useful, as it only works for one singular thing.
In recent kernels, there is support for eventfd. At first sight, it appears useless, since it is not obvious how it may be helpful in any way.
However, to your rescue, there is the undocumented function io_set_eventfd which lets you associate AIO with an eventfd, which is epoll_wait()-able. You have to dig through the headers to find out about it, but it's certainly there, and it works just fine.
Asynchronous disc IO is alive and kicking ... it is actually supported and works reasonably well now, but has significant limitations (but with enough functionality that some of the major users can usefully use it - for example MySQL's Innodb does in the latest version).
Asynchronous disc IO is the ability to invoke disc IO operations in a non-blocking manner (in a single thread) and wait for them to complete. This works fine, http://lse.sourceforge.net/io/aio.html has more info.
AIO does enough for a typical application (database server) to be able to use it. AIO is a good alternative to either creating lots of threads doing synchronous IO, or using scatter/gather in the preadv family of system calls which now exist.
It's possible to do a "shopping list" synchronous IO job using the newish preadv call where the kernel will go and get a bunch of pages from different offsets in a file. This is ok as long as you have only one file to read. (NB: Equivalent write function exists).
poll, epoll etc, are just fancy ways of doing select() that suffer from fewer limitations and scalability problems - they may not be able to be mixed with disc aio easily, but in a real-world application, you can probably get around this fairly trivially by using threads (some database servers tend to do these kinds of operations in separate threads anyway). Poll() is good, epoll is better, for large numbers of file descriptors. select() is ok too for small numbers of file descriptors (or specifically, low file descriptor numbers).
(At the tail end of 2019 there's a glimmer of hope almost a decade after the original question was asked)
If you have a 5.1 or later Linux kernel you can use the io_uring interface which will hopefully usher in a better asynchronous I/O future for Linux (see one of the answers to the Stack Overflow question "Is there really no asynchronous block I/O on Linux?" for benefits io_uring provides over KAIO). Hopefully this will allow Linux to provide stiff competition to FreeBSD's asynchronous AIO without huge contortions!
Most of what I've learned about asynchronous I/O in Linux was by working on the Lighttpd source. It is a single-threaded web server that handles many simultaneous connections, using the what it believes is the best of whatever asynchronous I/O mechanisms are available on the running system. Take a look at the source, it supports Linux, BSD, and (I think) a few other operating systems.

What is the status of POSIX asynchronous I/O (AIO)?

There are pages scattered around the web that describe POSIX AIO facilities in varying amounts of detail. None of them are terribly recent. It's not clear what, exactly, they're describing. For example, the "official" (?) web site for Linux kernel asynchronous I/O support here says that sockets don't work, but the "aio.h" manual pages on my Ubuntu 8.04.1 workstation all seem to imply that it works for arbitrary file descriptors. Then there's another project that seems to work at the library layer with even less documentation.
I'd like to know:
What is the purpose of POSIX AIO? Given that the most obvious example of an implementation I can find says it doesn't support sockets, the whole thing seems weird to me. Is it just for async disk I/O? If so, why the hyper-general API? If not, why is disk I/O the first thing that got attacked?
Where are there example complete POSIX AIO programs that I can look at?
Does anyone actually use it, for real?
What platforms support POSIX AIO? What parts of it do they support? Does anyone really support the implied "Any I/O to any FD" that <aio.h> seems to promise?
The other multiplexing mechanisms available to me are perfectly good, but the random fragments of information floating around out there have made me curious.
Doing socket I/O efficiently has been solved with kqueue, epoll, IO completion ports and the likes. Doing asynchronous file I/O is sort of a late comer (apart from windows' overlapped I/O and solaris early support for posix AIO).
If you're looking for doing socket I/O, you're probably better off using one of the above mechanisms.
The main purpose of AIO is hence to solve the problem of asynchronous disk I/O. This is most likely why Mac OS X only supports AIO for regular files, and not sockets (since kqueue does that so much better anyway).
Write operations are typically cached by the kernel and flushed out at a later time. For instance when the read head of the drive happens to pass by the location where the block is to be written.
However, for read operations, if you want the kernel to prioritize and order your reads, AIO is really the only option. Here's why the kernal can (theoretically) do that better than any user level application:
The kernel sees all disk I/O, not just your applications disk jobs, and can order them at a global level
The kernel (may) know where the disk read head is, and can pick the read jobs you pass on to it in optimal order, to move the head the shortest distance
The kernel can take advantage of native command queuing to optimize your read operations further
You may be able to issue more read operations per system call using lio_listio() than with readv(), especially if your reads are not (logically) contiguous, saving a tiny bit of system call overhead.
Your program might be slightly simpler with AIO since you don't need an extra thread to block in a read or write call.
That said, posix AIO has a quite awkward interface, for instance:
The only efficient and well supported mean of event callbacks are via signals, which makes it hard to use in a library, since it means using signal numbers from the process-global signal namespace. If your OS doesn't support realtime signals, it also means you have to loop through all your outstanding requests to figure out which one actually finished (this is the case for Mac OS X for instance, not Linux). Catching signals in a multi-threaded environment also makes for some tricky restrictions. You can typically not react to the event inside the signal handler, but you have to raise a signal, write to a pipe or use signalfd() (on linux).
lio_suspend() has the same issues as select() does, it doesn't scale very well with the number of jobs.
lio_listio(), as implemented has fairly limited number of jobs you can pass in, and it's not trivial to find this limit in a portable way. You have to call sysconf(_SC_AIO_LISTIO_MAX), which may fail, in which case you can use the AIO_LISTIO_MAX define, which are not necessarily defined, but then you can use 2, which is defined as guaranteed to be supported.
As for real-world application using posix AIO, you could take a look at lighttpd (lighty), which also posted a performance measurement when introducing support.
Most posix platforms supports posix AIO by now (Linux, BSD, Solaris, AIX, tru64). Windows supports it via its overlapped file I/O. My understanding is that only Solaris, Windows and Linux truly supports async. file I/O all the way down to the driver, whereas the other OSes emulate the async. I/O with kernel threads. Linux being the exception, its posix AIO implementation in glibc emulates async operations with user level threads, whereas its native async I/O interface (io_submit() etc.) are truly asynchronous all the way down to the driver, assuming the driver supports it.
I believe it's fairly common among OSes to not support posix AIO for any fd, but restrict it to regular files.
Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.
Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.
Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.
So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.
A libtorrent developer provides a report on this: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/
There is aio_write - implemented in glibc; first call of the aio_read or aio_write function spawns a number of user mode threads, aio_write or aio_read post requests to that thread, the thread does pread/pwrite and when it is finished the answer is posted back to the blocked calling thread.
Ther is also 'real' aio - supported by the kernel level (need libaio for that, see the io_submit call http://linux.die.net/man/2/io_submit ); also need O_DIRECT for that (also may not be supported by all file systems, but the major ones do support it)
see here:
http://lse.sourceforge.net/io/aio.html
http://linux.die.net/man/2/io_submit
Difference between POSIX AIO and libaio on Linux?

Resources