libuv vs. raw epoll or IOCP - multithreading

I'm writing the IO core for a messaging library and considering libuv vs. using raw epoll on linux and IOCP on windows (and eventually others, solaris events etc.) I like the portability of libuv, I'm looking at the performance.
epoll and IOCP allow multiple threads to wait directly for IO events, the kernel does the dispatching. Potentially more efficient than user-space dispatching though I don't have any numbers.
libuv (based on my reading) has a thread-usafe event loop, but I could implement a leader-follower thread pool. By that I mean one thread (at a time) is the "leader" waiting for events. When the leader gets an event it signals that a follower should take over as leader. The ex-leader process the event and then becomes a follower.
My hope is that should be close in performance to raw multi-threaded epoll/IOCP, assuming libuv is efficiently implemented. I will do my own measurements, but I'd like to hear from anyone with experience.

Disclaimer: I'm one of the maintainers of libuv.
I'd advise you start with libuv. If only because it contains hunderds of corner cases already taken care of, and tons of knowledge built in. If you want to support other platforms you are going to end up reinventing libuv one way or another :-)
Once you have built your server prototype with libuv run some benchmarks and see where the bottleneck is. Depending on the type of server you are writing you may need / want a multi-threaded event loop, which libuv isn't, but chances are libuv will work fine enough for you.

Related

What really is asynchronous computing?

I've been reading (and working) quite a bit with massively multi-threaded applications, and with IO, and I've found that the term asynchronous has become some sort of catch-all for multiple vague ideas. I'm wondering if I understand it correctly. The way I see it is that there are two main branches of "asynchronicity".
Asynchronous I/O. Such as network read/write. What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel, exchanging data, without blocking waiting for the other to finish and return the results of it's job.
Minimizing context-switching penalties by minimizing use of threads. This seems to be what the .NET framework is focusing on with it's async/await features. Instead of spawning/closing/blocking threads, break parallel jobs into tasks, and use a software task scheduler to keep a pool of threads as busy as possible without resorting to spawning new threads.
These seem like two entirely separate concepts with no similarities that could tie them together, but are both referred to by the same "asynchronous computing" vocabulary.
Am I understanding all of this correctly?
Asynchronous basically means not blocking, i.e. not having to wait for an operation to complete.
Threads are just one way of accomplishing that. There are many ways of doing this, from hardware level, SO level, software level.
Someone with more experience than me can give examples of asyncronicity not related to threads.
What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel...
Asynchronous programming is not all about multi-core CPU's and parallelism: consider a single core CPU, with just one thread creating email messages and sends them. In a synchronous fashion, it would spend a few micro seconds to create the message, and a lot more time to send it through network, and only then create the next message. But in asynchronous program, the thread could create a new message while the previous one is being sent through the network. One implementation for that kind of program can be using .NET async/await feature, where you can have just one thread. But even a blocking IO program could be considered asynchronous: If the main thread creates the messages and queues them in a buffer, which another thread pulls them from and sends them in a blocking IO way. From the main thread's point of view - it's completely async.
.NET async/await just uses the OS api's which are already async - reading /writing a file, send /receive data through network, they are all async anyway - the OS doesn't block on them (the drivers themselves are async).
Asynchronous is a general term, which does not have widely accepted meaning. Different domains have different meanings to it.
For instance, async IO means that instead of blocking on IO call, something else happens. Something else can be really different things, but it usually involves some sort of notification of call completion. Details might differ. For instance, a notification might be built into the call itself - like in MS Completeion Ports (if memory serves). Or, it can be something verify do before you make a call so that the call can not block - this is what poll() and friends do.
Async might also well mean simply parallel execution. For instance, one might say that 'database is updated asynchronously' meaning that there is a dedicated thread which handles database connectivity, and that thread does not slow down the main processing thread.

How does process blocking apply to a multi-threaded process?

I've learned that a process has running, ready, blocked, and suspended states. Threads also have these states except for suspended because it lives in the process's address space.
A process blocks most of the time when it is doing a blocking i/o or waiting for an event.
I can easily picture out a process getting blocked if its single-threaded or if it follows a one-to-many model, but how does it work if the process is multi-threaded?
For example:
I have a process with two threads in a system that follows a one-to-one model. One handles the gui and the other handles the blocking i/o. I know the process remains responsive because the other thread handles the i/o.
So is there by any chance the process gets blocked or should I just rule it out in this case?
I'm just getting into these stuff so forgive me If I haven't understand some of the important details yet.
Let's say you have a work queue where the UI thread schedules work to be done and the I\O thread looks there for work to do. The work queue itself is data that is read and modified from both threads, therefor you must synchronize access somehow or race conditions result.
The naive approach is to synchronize access to the queue using a lock (aka critical section). If the I\O thread acquires the lock and then blocks, the UI thread will only remain responsive until it decides it needs to schedule work and tries to acquire the lock. A better approach is to use a lock-free queue about which much has been written and you can easily search for more info.
But to answer your question, yes, it is still much easier than you might think to cause UI to stutter / hang even when using multiple threads. There are various libraries that make it easier or harder to solve this problem, so depending on your OS and language of choice, there may be something better than just OS primitives. Win32 (from what I remember) doesn't it make it very easy at all despite having all sorts of synchronization primitives. Pthreads and Boost never seemed very straightforward to me either. Apple's GCD makes it semantically much easier to express what you want (in my opinion), though there are still pitfalls one must be aware of (such as scheduling too many blocking operations on a single work queue to be done in parallel and causing the processor to thrash when they all wake up at the same time).
My advice is to just dive in and write lots of multithreaded code. It can be tough to debug but you will learn a lot and eventually it becomes second nature.

Is non-blocking I/O really faster than multi-threaded blocking I/O? How?

I searched the web on some technical details about blocking I/O and non blocking I/O and I found several people stating that non-blocking I/O would be faster than blocking I/O. For example in this document.
If I use blocking I/O, then of course the thread that is currently blocked can't do anything else... Because it's blocked. But as soon as a thread starts being blocked, the OS can switch to another thread and not switch back until there is something to do for the blocked thread. So as long as there is another thread on the system that needs CPU and is not blocked, there should not be any more CPU idle time compared to an event based non-blocking approach, is there?
Besides reducing the time the CPU is idle I see one more option to increase the number of tasks a computer can perform in a given time frame: Reduce the overhead introduced by switching threads. But how can this be done? And is the overhead large enough to show measurable effects? Here is an idea on how I can picture it working:
To load the contents of a file, an application delegates this task to an event-based i/o framework, passing a callback function along with a filename
The event framework delegates to the operating system, which programs a DMA controller of the hard disk to write the file directly to memory
The event framework allows further code to run.
Upon completion of the disk-to-memory copy, the DMA controller causes an interrupt.
The operating system's interrupt handler notifies the event-based i/o framework about the file being completely loaded into memory. How does it do that? Using a signal??
The code that is currently run within the event i/o framework finishes.
The event-based i/o framework checks its queue and sees the operating system's message from step 5 and executes the callback it got in step 1.
Is that how it works? If it does not, how does it work? That means that the event system can work without ever having the need to explicitly touch the stack (such as a real scheduler that would need to backup the stack and copy the stack of another thread into memory while switching threads)? How much time does this actually save? Is there more to it?
The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).
Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:
One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
One thread for all connections.
This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
A few threads where each thread handles some of the connections.
This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.
Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).
On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.
The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.
Some more reading on the web:
Multiple Threads in the User Interface on MSDN, also shortly handling the cost of creating threads
Section Threads and Thread Pools says "Although threads are relatively easy to create and use, the operating system allocates a significant amount of time and other resources to manage them."
CreateThread documentation on MSDN says "However, your application will have better performance if you create one thread per processor and build queues of requests for which the application maintains the context information.".
Old article Why Too Many Threads Hurts Performance, and What to do About It
I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.
At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.
If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.
So, as #user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.
One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.
You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.
To presume a speed improvement due to any form of multi-computing you must presume either that multiple CPU-based tasks are being executed concurrently upon multiple computing resources (generally processor cores) or else that not all of the tasks rely upon the concurrent usage of the same resource -- that is, some tasks may depend on one system subcomponent (disk storage, say) while some tasks depend on another (receiving communication from a peripheral device) and still others may require usage of processor cores.
The first scenario is often referred to as "parallel" programming. The second scenario is often referred to as "concurrent" or "asynchronous" programming, although "concurrent" is sometimes also used to refer to the case of merely allowing an operating system to interleave execution of multiple tasks, regardless of whether such execution must take place serially or if multiple resources can be used to achieve parallel execution. In this latter case, "concurrent" generally refers to the way that execution is written in the program, rather than from the perspective of the actual simultaneity of task execution.
It's very easy to speak about all of this with tacit assumptions. For example, some are quick to make a claim such as "Asynchronous I/O will be faster than multi-threaded I/O." This claim is dubious for several reasons. First, it could be the case that some given asynchronous I/O framework is implemented precisely with multi-threading, in which case they are one in the same and it doesn't make sense to say one concept "is faster than" the other.
Second, even in the case when there is a single-threaded implementation of an asynchronous framework (such as a single-threaded event loop) you must still make an assumption about what that loop is doing. For example, one silly thing you can do with a single-threaded event loop is request for it to asynchronously complete two different purely CPU-bound tasks. If you did this on a machine with only an idealized single processor core (ignoring modern hardware optimizations) then performing this task "asynchronously" wouldn't really perform any differently than performing it with two independently managed threads, or with just one lone process -- the difference might come down to thread context switching or operating system schedule optimizations, but if both tasks are going to the CPU it would be similar in either case.
It is useful to imagine a lot of the unusual or stupid corner cases you might run into.
"Asynchronous" does not have to be concurrent, for example just as above: you "asynchronously" execute two CPU-bound tasks on a machine with exactly one processor core.
Multi-threaded execution doesn't have to be concurrent: you spawn two threads on a machine with a single processor core, or ask two threads to acquire any other kind of scarce resource (imagine, say, a network database that can only establish one connection at a time). The threads' execution might be interleaved however the operating system scheduler sees fit, but their total runtime cannot be reduced (and will be increased from the thread context switching) on a single core (or more generally, if you spawn more threads than there are cores to run them, or have more threads asking for a resource than what the resource can sustain). This same thing goes for multi-processing as well.
So neither asynchronous I/O nor multi-threading have to offer any performance gain in terms of run time. They can even slow things down.
If you define a specific use case, however, like a specific program that both makes a network call to retrieve data from a network-connected resource like a remote database and also does some local CPU-bound computation, then you can start to reason about the performance differences between the two methods given a particular assumption about hardware.
The questions to ask: How many computational steps do I need to perform and how many independent systems of resources are there to perform them? Are there subsets of the computational steps that require usage of independent system subcomponents and can benefit from doing so concurrently? How many processor cores do I have and what is the overhead for using multiple processors or threads to complete tasks on separate cores?
If your tasks largely rely on independent subsystems, then an asynchronous solution might be good. If the number of threads needed to handle it would be large, such that context switching became non-trivial for the operating system, then a single-threaded asynchronous solution might be better.
Whenever the tasks are bound by the same resource (e.g. multiple needs to concurrently access the same network or local resource), then multi-threading will probably introduce unsatisfactory overhead, and while single-threaded asynchrony may introduce less overhead, in such a resource-limited situation it too cannot produce a speed-up. In such a case, the only option (if you want a speed-up) is to make multiple copies of that resource available (e.g. multiple processor cores if the scarce resource is CPU; a better database that supports more concurrent connections if the scarce resource is a connection-limited database, etc.).
Another way to put it is: allowing the operating system to interleave the usage of a single resource for two tasks cannot be faster than merely letting one task use the resource while the other waits, then letting the second task finish serially. Further, the scheduler cost of interleaving means in any real situation it actually creates a slowdown. It doesn't matter if the interleaved usage occurs of the CPU, a network resource, a memory resource, a peripheral device, or any other system resource.
The main reason to use AIO is for scalability. When viewed in the context of a few threads, the benefits are not obvious. But when the system scales to 1000s of threads, AIO will offer much better performance. The caveat is that AIO library should not introduce further bottlenecks.
One possible implementation of non-blocking I/O is exactly what you said, with a pool of background threads that do blocking I/O and notify the thread of the originator of the I/O via some callback mechanism. In fact, this is how the AIO module in glibc works. Here are some vague details about the implementation.
While this is a good solution that is quite portable (as long as you have threads), the OS is typically able to service non-blocking I/O more efficiently. This Wikipedia article lists possible implementations besides the thread pool.
I am currently in the process of implementing async io on an embedded platform using protothreads. Non blocking io makes the difference between running at 16000fps and 160fps. The biggest benefit of non blocking io is that you can structure your code to do other things while hardware does its thing. Even initialization of devices can be done in parallel.
Martin
In Node, multiple threads are being launched, but it's a layer down in the C++ run-time.
"So Yes NodeJS is single threaded, but this is a half truth, actually it is event-driven and single-threaded with background workers. The main event loop is single-threaded but most of the I/O works run on separate threads, because the I/O APIs in Node.js are asynchronous/non-blocking by design, in order to accommodate the event loop. "
https://codeburst.io/how-node-js-single-thread-mechanism-work-understanding-event-loop-in-nodejs-230f7440b0ea
"Node.js is non-blocking which means that all functions ( callbacks ) are delegated to the event loop and they are ( or can be ) executed by different threads. That is handled by Node.js run-time."
https://itnext.io/multi-threading-and-multi-process-in-node-js-ffa5bb5cde98 
The "Node is faster because it's non-blocking..." explanation is a bit of marketing and this is a great question. It's efficient and scaleable, but not exactly single threaded.
The improvement as far as I know is that Asynchronous I/O uses ( I'm talking about MS System, just to clarify ) the so called I/O completion ports. By using the Asynchronous call the framework leverage such architecture automatically, and this is supposed to be much more efficient that standard threading mechanism. As a personal experience I can say that you would sensibly feel your application more reactive if you prefer AsyncCalls instead of blocking threads.
Let me give you a counterexample that asynchronous I/O does not work.
I am writing a proxy similar to below-using boost::asio.
https://github.com/ArashPartow/proxy/blob/master/tcpproxy_server.cpp
However, the scenario of my case is, incoming (from clients side) messages are fast while outgoing (to server side) is slow for one session, to keep up with the incoming speed or to maximize the total proxy throughput, we have to use multiple sessions under one connection.
Thus this async I/O framework does not work anymore. We do need a thread pool to send to the server by assigning each thread a session.

What's the difference between event-driven and asynchronous? Between epoll and AIO?

Event-driven and asynchronous are often used as synonyms. Are there any differences between the two?
Also, what is the difference between epoll and aio? How do they fit together?
Lastly, I've read many times that AIO in Linux is horribly broken. How exactly is it broken?
Thanks.
Events is one of the paradigms to achieve asynchronous execution.
But not all asynchronous systems use events. That is about semantic meaning of these two - one is super-entity of another.
epoll and aio use different metaphors:
epoll is a blocking operation (epoll_wait()) - you block the thread until some event happens and then you dispatch the event to different procedures/functions/branches in your code.
In AIO, you pass the address of your callback function (completion routine) to the system and the system calls your function when something happens.
Problem with AIO is that your callback function code runs on the system thread and so on top of the system stack. A few problems with that as you can imagine.
They are completely different things.
The events-driven paradigm means that an object called an "event" is sent to the program whenever something happens, without that "something" having to be polled in regular intervals to discover whether it has happened. That "event" may be trapped by the program to perform some actions (i.e. a "handler") -- either synchronous or asynchronous.
Therefore, handling of events can either be synchronous or asynchronous. JavaScript, for example, uses a synchronous eventing system.
Asynchronous means that actions can happen independent of the current "main" execution stream. Mind you, it does NOT mean "parallel", or "different thread". An "asynchronous" action may actually run on the main thread, blocking the "main" execution stream in the meantime. So don't confuse "asynchronous" with "multi-threading".
You may say that, technically speaking, an asynchronous operation automatically assumes eventing -- at least "completed", "faulted" or "aborted/cancelled" events (one or more of these) are sent to the instigator of the operation (or the underlying O/S itself) to signal that the operation has ceased. Thus, async is always event-driven, but not the other way round.
Event driven is a single thread where events are registered for a certain scenario. When that scenario is faced, the events are fired. However even at that time each of the events are fired in a sequential manner. There is nothing Asynchronous about it. Node.js (webserver) uses events to deal with multiple requests.
Asynchronous is basically multitasking. It can spawn off multiple threads or processes to execute a certain function. It's totally different from event driven in the sense that each thread is independent and hardly interact with the main thread in an easy responsive manner. Apache (webserver) uses multiple threads to deal with incoming requests.
Lastly, I've read many times that AIO in Linux is horribly broken. How exactly is it broken?
AIO as done via KAIO/libaio/io_submit comes with a lot of caveats and is tricky to use well if you want it to behave rather than silently blocking (e.g. only works on certain types of fd, when using files/block devices only actually works for direct I/O but those are the tip of the iceberg). It did eventually gain the ability to indicate file descriptor readiness with the 4.19 kernel) which is useful for programs using sockets.
POSIX AIO on Linux is actually a userspace threads implementation by glibc and comes with its own limitations (e.g. it's considered slow and doesn't scale well).
These days (2020) hope for doing arbitrary asynchronous I/O on Linux with less pain and tradeoffs is coming from io_uring...

High Performance Socket Server using Perl

i need to write a socket server using perl which will run on a 64bit linux (2.6x kernel). Is there a library to support IO Completion Ports and some equivalent on Linux?
I need to listen to multiple ports. 8000-8100 is there a smart way doing this?
The protocol has to use a length byte.
What threading library do you recommend? I have written something similar on Windows using a cooperative multitasking based threadscheduler. i mean i want to avoid creating for each socket a thread to handle more than 10.000 simultaneous conenctions.
thanks in advance.
Threading in Perl is generally not adviced.
Instead, for high performance, you should consider looking into non blocking or event driven programming.
With regular sockets, your process blocks every IO operation, i.e reading from a socket that isn't ready will put your process to sleep until data is available. with non blocking/event driven you poll the sockets and get callbacks when the sockets are ready to be read from or written to, so a single process can multiplex on many sockets, thus providing good scalable performance since you don't need to fork new processes to handle more clients.
There are many good event based frameworks in Perl, e.g POE and AnyEvent POE is a specific event loop with lots of modules and features and AnyEvent is an abstraction layer that lets you use multiple event loops in the same code.
You should also look into libev which is similar to POE but with a lot less overhead.
Writing event driven code is somewhat tricky at first, since you need to be careful with the blocking code you do have, e.g cpu intensive operation, or using libraries which aren't non blocking. because since you have only one process, if it's busy doing something, it can't do anything else - like poll on the sockets and issue callbacks.
So, if you need both non blocking and intensive computations, one way to do it is to create worker forks and use non blocking pipes to communicate between them and your event loop, which is really straight forward with the above libraries.

Resources