Is non-blocking I/O really faster than multi-threaded blocking I/O? How?

Is non-blocking I/O really faster than multi-threaded blocking I/O? How? - multithreading

I searched the web on some technical details about blocking I/O and non blocking I/O and I found several people stating that non-blocking I/O would be faster than blocking I/O. For example in this document.
If I use blocking I/O, then of course the thread that is currently blocked can't do anything else... Because it's blocked. But as soon as a thread starts being blocked, the OS can switch to another thread and not switch back until there is something to do for the blocked thread. So as long as there is another thread on the system that needs CPU and is not blocked, there should not be any more CPU idle time compared to an event based non-blocking approach, is there?
Besides reducing the time the CPU is idle I see one more option to increase the number of tasks a computer can perform in a given time frame: Reduce the overhead introduced by switching threads. But how can this be done? And is the overhead large enough to show measurable effects? Here is an idea on how I can picture it working:
To load the contents of a file, an application delegates this task to an event-based i/o framework, passing a callback function along with a filename
The event framework delegates to the operating system, which programs a DMA controller of the hard disk to write the file directly to memory
The event framework allows further code to run.
Upon completion of the disk-to-memory copy, the DMA controller causes an interrupt.
The operating system's interrupt handler notifies the event-based i/o framework about the file being completely loaded into memory. How does it do that? Using a signal??
The code that is currently run within the event i/o framework finishes.
The event-based i/o framework checks its queue and sees the operating system's message from step 5 and executes the callback it got in step 1.
Is that how it works? If it does not, how does it work? That means that the event system can work without ever having the need to explicitly touch the stack (such as a real scheduler that would need to backup the stack and copy the stack of another thread into memory while switching threads)? How much time does this actually save? Is there more to it?

The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).
Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:
One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
One thread for all connections.
This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
A few threads where each thread handles some of the connections.
This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.
Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).
On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.
The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.
Some more reading on the web:
Multiple Threads in the User Interface on MSDN, also shortly handling the cost of creating threads
Section Threads and Thread Pools says "Although threads are relatively easy to create and use, the operating system allocates a significant amount of time and other resources to manage them."
CreateThread documentation on MSDN says "However, your application will have better performance if you create one thread per processor and build queues of requests for which the application maintains the context information.".
Old article Why Too Many Threads Hurts Performance, and What to do About It

I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.
At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.
If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.
So, as #user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.
One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.
You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.

To presume a speed improvement due to any form of multi-computing you must presume either that multiple CPU-based tasks are being executed concurrently upon multiple computing resources (generally processor cores) or else that not all of the tasks rely upon the concurrent usage of the same resource -- that is, some tasks may depend on one system subcomponent (disk storage, say) while some tasks depend on another (receiving communication from a peripheral device) and still others may require usage of processor cores.
The first scenario is often referred to as "parallel" programming. The second scenario is often referred to as "concurrent" or "asynchronous" programming, although "concurrent" is sometimes also used to refer to the case of merely allowing an operating system to interleave execution of multiple tasks, regardless of whether such execution must take place serially or if multiple resources can be used to achieve parallel execution. In this latter case, "concurrent" generally refers to the way that execution is written in the program, rather than from the perspective of the actual simultaneity of task execution.
It's very easy to speak about all of this with tacit assumptions. For example, some are quick to make a claim such as "Asynchronous I/O will be faster than multi-threaded I/O." This claim is dubious for several reasons. First, it could be the case that some given asynchronous I/O framework is implemented precisely with multi-threading, in which case they are one in the same and it doesn't make sense to say one concept "is faster than" the other.
Second, even in the case when there is a single-threaded implementation of an asynchronous framework (such as a single-threaded event loop) you must still make an assumption about what that loop is doing. For example, one silly thing you can do with a single-threaded event loop is request for it to asynchronously complete two different purely CPU-bound tasks. If you did this on a machine with only an idealized single processor core (ignoring modern hardware optimizations) then performing this task "asynchronously" wouldn't really perform any differently than performing it with two independently managed threads, or with just one lone process -- the difference might come down to thread context switching or operating system schedule optimizations, but if both tasks are going to the CPU it would be similar in either case.
It is useful to imagine a lot of the unusual or stupid corner cases you might run into.
"Asynchronous" does not have to be concurrent, for example just as above: you "asynchronously" execute two CPU-bound tasks on a machine with exactly one processor core.
Multi-threaded execution doesn't have to be concurrent: you spawn two threads on a machine with a single processor core, or ask two threads to acquire any other kind of scarce resource (imagine, say, a network database that can only establish one connection at a time). The threads' execution might be interleaved however the operating system scheduler sees fit, but their total runtime cannot be reduced (and will be increased from the thread context switching) on a single core (or more generally, if you spawn more threads than there are cores to run them, or have more threads asking for a resource than what the resource can sustain). This same thing goes for multi-processing as well.
So neither asynchronous I/O nor multi-threading have to offer any performance gain in terms of run time. They can even slow things down.
If you define a specific use case, however, like a specific program that both makes a network call to retrieve data from a network-connected resource like a remote database and also does some local CPU-bound computation, then you can start to reason about the performance differences between the two methods given a particular assumption about hardware.
The questions to ask: How many computational steps do I need to perform and how many independent systems of resources are there to perform them? Are there subsets of the computational steps that require usage of independent system subcomponents and can benefit from doing so concurrently? How many processor cores do I have and what is the overhead for using multiple processors or threads to complete tasks on separate cores?
If your tasks largely rely on independent subsystems, then an asynchronous solution might be good. If the number of threads needed to handle it would be large, such that context switching became non-trivial for the operating system, then a single-threaded asynchronous solution might be better.
Whenever the tasks are bound by the same resource (e.g. multiple needs to concurrently access the same network or local resource), then multi-threading will probably introduce unsatisfactory overhead, and while single-threaded asynchrony may introduce less overhead, in such a resource-limited situation it too cannot produce a speed-up. In such a case, the only option (if you want a speed-up) is to make multiple copies of that resource available (e.g. multiple processor cores if the scarce resource is CPU; a better database that supports more concurrent connections if the scarce resource is a connection-limited database, etc.).
Another way to put it is: allowing the operating system to interleave the usage of a single resource for two tasks cannot be faster than merely letting one task use the resource while the other waits, then letting the second task finish serially. Further, the scheduler cost of interleaving means in any real situation it actually creates a slowdown. It doesn't matter if the interleaved usage occurs of the CPU, a network resource, a memory resource, a peripheral device, or any other system resource.

The main reason to use AIO is for scalability. When viewed in the context of a few threads, the benefits are not obvious. But when the system scales to 1000s of threads, AIO will offer much better performance. The caveat is that AIO library should not introduce further bottlenecks.

One possible implementation of non-blocking I/O is exactly what you said, with a pool of background threads that do blocking I/O and notify the thread of the originator of the I/O via some callback mechanism. In fact, this is how the AIO module in glibc works. Here are some vague details about the implementation.
While this is a good solution that is quite portable (as long as you have threads), the OS is typically able to service non-blocking I/O more efficiently. This Wikipedia article lists possible implementations besides the thread pool.

I am currently in the process of implementing async io on an embedded platform using protothreads. Non blocking io makes the difference between running at 16000fps and 160fps. The biggest benefit of non blocking io is that you can structure your code to do other things while hardware does its thing. Even initialization of devices can be done in parallel.
Martin

In Node, multiple threads are being launched, but it's a layer down in the C++ run-time.
"So Yes NodeJS is single threaded, but this is a half truth, actually it is event-driven and single-threaded with background workers. The main event loop is single-threaded but most of the I/O works run on separate threads, because the I/O APIs in Node.js are asynchronous/non-blocking by design, in order to accommodate the event loop. "
https://codeburst.io/how-node-js-single-thread-mechanism-work-understanding-event-loop-in-nodejs-230f7440b0ea
"Node.js is non-blocking which means that all functions ( callbacks ) are delegated to the event loop and they are ( or can be ) executed by different threads. That is handled by Node.js run-time."
https://itnext.io/multi-threading-and-multi-process-in-node-js-ffa5bb5cde98 
The "Node is faster because it's non-blocking..." explanation is a bit of marketing and this is a great question. It's efficient and scaleable, but not exactly single threaded.

The improvement as far as I know is that Asynchronous I/O uses ( I'm talking about MS System, just to clarify ) the so called I/O completion ports. By using the Asynchronous call the framework leverage such architecture automatically, and this is supposed to be much more efficient that standard threading mechanism. As a personal experience I can say that you would sensibly feel your application more reactive if you prefer AsyncCalls instead of blocking threads.

Let me give you a counterexample that asynchronous I/O does not work.
I am writing a proxy similar to below-using boost::asio.
https://github.com/ArashPartow/proxy/blob/master/tcpproxy_server.cpp
However, the scenario of my case is, incoming (from clients side) messages are fast while outgoing (to server side) is slow for one session, to keep up with the incoming speed or to maximize the total proxy throughput, we have to use multiple sessions under one connection.
Thus this async I/O framework does not work anymore. We do need a thread pool to send to the server by assigning each thread a session.

Related

Does multithreading actually work in uniprocessor environment

Suppose we have a process with multiple threads in a uniprocessor.
Now I know that if we have several processes, only one of them will be processed at a time in a uniprocessor and hence the processes are not concurrent.
If my understanding is correct, similarly each thread will be processed at a time and not concurrent in a uniprocessor. Is this statement true? If so then does multithreading mean having more than one thread in a process and does not mean running multiple threads at a time? And does that mean there's no benefit of creating user threads in a uniprocessor environment?

TL;DR: threads are switching more often than processes and in real time we have an effect of concurrency because it is happens really fast.
when you wrote:
each thread will be processed at a time and not concurrent in a uni processor
Notice the word "concurrent", there is no real concurrency in uni processor, there is only effect of that thanks to the multiple number of context switches between processes.
Let's clarify something here, the single core of the CPU can handle one thread at a given time, each process has a main thread and (if needed) more threads running together. If a process A is now running and it has 3 threads: A1(main thread), A2, A3 all three will be running as long as process A is being processed by the CPU core. When a context switch occur process A is no longer running and now process B will run with his threads.
About this statement:
there's no benefit of creating user threads in a uni processor environment
That is not true. there is a benefit in creating threads, they are easier to create ("spawn" as in the books) and shearing the process heap memory. Creating a sub process ("child" as in the books) is a overhead comparing to a thread because a process need to have his own memory. For example each google chrome tab is a process not a thread, but this tab has multiple threads running concurrency with little responsibility.

If you are still somehow running a computer with just one, single-core, CPU, then you would be correct to observe that only one thread can be physically executing at one time. But that does not negate the value of breaking up the application into multiple threads and/or processes.
The essential benefit is concurrency. When one thread is waiting (e.g. for an input/output operation to complete), there is something else for the CPU to be doing in the meantime: it can be running a different thread that isn't waiting. With a carefully designed application, you can get much better utilization of every part of the hardware, more parallelism, and thus, more throughput.
My favorite go-to example is a fast food restaurant. About a dozen workers, each one doing different things, cooperate to bring your order to you. Even if one of them (say, "the fry guy") is standing around, someone else always has something to do. Several orders are in-process at once. This overlap, this "concurrency," is what you are shooting for – regardless of how many CPUs you have.
Multithreading is also commonly used with GUI applications that also need to do some kind of "heavy lifting." One thread handles the GUI interaction (and has no other real responsibilities) while other threads, with a slightly inferior priority (or "niceness") do the lifting. When a GUI event comes in, the GUI thread pre-empts the others and responds to it immediately, then of course goes right back to sleep again. But in this way the GUI always remains very responsive – even though the other threads are doing "heavy lifting" things, GUI messages are still handled very promptly. (I scooped-up about a 25% performance improvement by re-tooling an older application to use this approach, because the application was no longer "polling" for GUI events.)

The first question I ask about any thread is, "what does it wait for?" To me, a thread is defined by what event it waits for and what it does when that event happens.
Threads were in wide-spread use for at least a decade before multi-processor computers became commercially available. They are useful when you want to write a program that has to respond to un-synchronized events that come from multiple different sources. There's a few different ways to model a program like that. One way is to have a different thread to wait on each different event source. The next most popular is an event driven architecture in which there's a main loop that waits for all events and calls different event handler functions for each of the different kinds of event.
The multi-threaded style of program often is easier to read* because there's usually different activities going on inside the program, and the state of each activity can be implicit in the context (i.e., registers and call stack) of the thread that's driving it, while in the event-driven model, each activity's state must be explicitly encoded in some object.
The implicit-in-the-context way of keeping the state is much closer to the procedural style of coding a single activity that we learn as beginners.
*Easier to read does not mean that the code is easy to write without making bad and non-obvious mistakes!!

The main impetus for developing threads was Ada compliance. Prior to that, different operating systems had their own ways of handing multiple things at once. In eunuchs, the way to do more than one thing was to spin off a new process. In VMS, software interrupts (aka Asynchronous System Traps or Asynchronous Procedure Calls in Windoze). In those days (1970's) multiprocessor systems were rare.
One of the goals of Ada was to have a system independent way of doing things. It adopted the "task" which is effectively a thread. In order to support Ada, compiler developers had to include task (thread) libraries.
With the rise of multiprocessors, operating systems started to make threads (rather than processes) the basic schedulable unit in a system.
Threads then give a way for programs to handle multiple things simultaneously, even if there is only one processor. Sadly, support for threads in programming languages has been woefully lacking. Ada is the only major language I can think of that has real support for threads (tasks). Thread support in Java, for example, is a complete, sick joke. The result is threads are not as effective in practice as they could be.

What really is asynchronous computing?

I've been reading (and working) quite a bit with massively multi-threaded applications, and with IO, and I've found that the term asynchronous has become some sort of catch-all for multiple vague ideas. I'm wondering if I understand it correctly. The way I see it is that there are two main branches of "asynchronicity".
Asynchronous I/O. Such as network read/write. What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel, exchanging data, without blocking waiting for the other to finish and return the results of it's job.
Minimizing context-switching penalties by minimizing use of threads. This seems to be what the .NET framework is focusing on with it's async/await features. Instead of spawning/closing/blocking threads, break parallel jobs into tasks, and use a software task scheduler to keep a pool of threads as busy as possible without resorting to spawning new threads.
These seem like two entirely separate concepts with no similarities that could tie them together, but are both referred to by the same "asynchronous computing" vocabulary.
Am I understanding all of this correctly?

Asynchronous basically means not blocking, i.e. not having to wait for an operation to complete.
Threads are just one way of accomplishing that. There are many ways of doing this, from hardware level, SO level, software level.
Someone with more experience than me can give examples of asyncronicity not related to threads.

What this really boils down to is efficient parallel processing between multiple CPUs, such as your main CPU and your NIC CPU. The idea is to have multiple processors running in parallel...
Asynchronous programming is not all about multi-core CPU's and parallelism: consider a single core CPU, with just one thread creating email messages and sends them. In a synchronous fashion, it would spend a few micro seconds to create the message, and a lot more time to send it through network, and only then create the next message. But in asynchronous program, the thread could create a new message while the previous one is being sent through the network. One implementation for that kind of program can be using .NET async/await feature, where you can have just one thread. But even a blocking IO program could be considered asynchronous: If the main thread creates the messages and queues them in a buffer, which another thread pulls them from and sends them in a blocking IO way. From the main thread's point of view - it's completely async.
.NET async/await just uses the OS api's which are already async - reading /writing a file, send /receive data through network, they are all async anyway - the OS doesn't block on them (the drivers themselves are async).

Asynchronous is a general term, which does not have widely accepted meaning. Different domains have different meanings to it.
For instance, async IO means that instead of blocking on IO call, something else happens. Something else can be really different things, but it usually involves some sort of notification of call completion. Details might differ. For instance, a notification might be built into the call itself - like in MS Completeion Ports (if memory serves). Or, it can be something verify do before you make a call so that the call can not block - this is what poll() and friends do.
Async might also well mean simply parallel execution. For instance, one might say that 'database is updated asynchronously' meaning that there is a dedicated thread which handles database connectivity, and that thread does not slow down the main processing thread.

Java Threads and number of Cores

Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
If so why is this the case and what are the implications of using threads greater than the number of cpu cores ?

You will probably not get any definitive answer on the question of knowing, generally speaking, how many threads an app should have, in relation to the number of core(s) the underlying computer has.
One may also argue that, at the time of PaaS software design and/or elastic clusters, the notion of a fixed number of cores for any given process might be overrated.
Still, the first part of your question :
Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
This has a definitive answer, which is a "no" (once more : as a general rule). And the reason why, shortly, it that all created threads are not typically running (and maybe more importantly runable) at once, meaning there is an opportunity to optimize here.
As a support to this discussion, I'll oppose two ways of creating apps, you could call it "classical" versus "reactive", although this is not a generally acceptable division. Yet, let's use this as a support.
Classical application design
I label as classical applications that rely mostly on "blocking" calls and/or "thread per request" pattern. Consider the traditional way I/Os are done (socket communication like HTTP or Database connection, hard drive based file reading, ...) : your app thread calls some kind of read or write method, which usually triggers an OS level call, that blocks your app thread, fills some device buffer at the OS level (say, read from a disk). Once the buffer has received enough data, the OS signals your java app and thread to resume activity, and the read method returns with the data from the buffer.
The whole time the OS is working (usually just a tiny fraction of a second, but still some large amount of time compared to your typical GHz CPU speed), your Java thread is in state BLOCKED_WAITING, waiting for the OS to signal it can resume. This happens all the time. A code profiler tool, like JProfiler, or YourKit, can help you measure this time. If you do so, you'll notice that in many apps doing I/O, this is a significant part of the so-called "wall time" or "clock time" that is spent... waiting.
So we have one thread waiting, meaning it is not using any CPU time. It can be scheduled out, and the OS is free to give CPU time to anybody else.
Suppose this is a one core CPU, then NOW would be a good time to have another thread to feed the CPU. Meaning having two or more threads could be a good design to maximize CPU usage even on a single core CPU, and get the most out of your hardware.
Most "classical" web applications are typically subject to this type of CPU underuse if you follow the rule of "one thread per CPU core", because Socket communications (or more typically : the time spent waiting for a response to your SQL queries) will incur so much blocking.
If you raise the number of threads your app has, then even if one or two long running requests remain waiting, other faster requests will have runnable threads to run them, and you'll get better CPU usage, and better performance (number of concurrent requests). That is... untill something else reaches saturation (too many heavy requests on your DB, too many simultaneous hard drive reads/writes...)
Reactive app design
Recognizing this typical behavior of apps, and using different sets of OS features, some application frameworks now use non blocking patterns (even for I/O) to mitigate the above issues. Examples in the Java ecosystem are NIO based networking stacks like Netty, or actor pattern implementations like Akka.
In a typical "reactive" app, one usually abandons the "thread per request" pattern that we have in classical apps (meaning one thread is responsible for handling everything from start to finish of a given user request, and waiting when need be for external resources to become available), in favor of a vastly more modular, and non-blocking approach.
Threads are given more technical-grained bits of work to do, and each thread will hand-off work to one another and callbacks to hear back when work they depend upon is done. This "handing of" of units of work means each thread can quickly grab new units of work it is able to handle. Meaning one of two things : you achieve higher CPU usage with far fewer threads in your app (because each can grab work more efficiently, instead of just sitting "waiting") ; or you can instantiate many many more threads because they'll mostly be waiting (not saturating the CPUs), and the dynamical hand-off will still allow for good CPU usage.
Conclusion
Anyway, you don't design the number of threads solely based on the number of available cores. The nature of your implementation and work dictates the number of optimal threads to create.
On a classical app-design philosophy, the two numbers are more closely related than on a reactive one, but still, we have different profiles :
a very simple server app can accomodate many more threads than CPU cores, because it will allow for better throughput (the limit being, say, the output network bandwidth).
a SQL heavy app, should be scalled to the point where your app server will saturate the SQL backend. As your app server will be mostly waiting for your SQL server, then this is the limit
a mixed application consisting of some SQL heavy work, and some lightweight work, will need precision tuning, because you don't want the stuck threads (those blocked waiting for the DB) starving the light requests that would be served more rapidly
a compute intensive program (say, a cryptography service) will probably benefit from a number of threads close to the number CPU cores (if your algorithm is implemented in a classical way), because creating more threads than you are able to run is pointless. In an actor based implementation, creating more threads could actually be a win.

How do user level threads (ULTs) and kernel level threads (KLTs) differ with regards to concurrent execution?

Here's what I understand; please correct/add to it:
In pure ULTs, the multithreaded process itself does the thread scheduling. So, the kernel essentially does not notice the difference and considers it a single-thread process. If one thread makes a blocking system call, the entire process is blocked. Even on a multicore processor, only one thread of the process would running at a time, unless the process is blocked. I'm not sure how ULTs are much help though.
In pure KLTs, even if a thread is blocked, the kernel schedules another (ready) thread of the same process. (In case of pure KLTs, I'm assuming the kernel creates all the threads of the process.)
Also, using a combination of ULTs and KLTs, how are ULTs mapped into KLTs?

Your analysis is correct. The OS kernel has no knowledge of user-level threads. From its perspective, a process is an opaque black box that occasionally makes system calls. Consequently, if that program has 100,000 user-level threads but only one kernel thread, then the process can only one run user-level thread at a time because there is only one kernel-level thread associated with it. On the other hand, if a process has multiple kernel-level threads, then it can execute multiple commands in parallel if there is a multicore machine.
A common compromise between these is to have a program request some fixed number of kernel-level threads, then have its own thread scheduler divvy up the user-level threads onto these kernel-level threads as appropriate. That way, multiple ULTs can execute in parallel, and the program can have fine-grained control over how threads execute.
As for how this mapping works - there are a bunch of different schemes. You could imagine that the user program uses any one of multiple different scheduling systems. In fact, if you do this substitution:
Kernel thread <---> Processor core
User thread <---> Kernel thread
Then any scheme the OS could use to map kernel threads onto cores could also be used to map user-level threads onto kernel-level threads.
Hope this helps!

Before anything else, templatetypedef's answer is beautiful; I simply wanted to extend his response a little.
There is one area which I felt the need for expanding a little: combinations of ULT's and KLT's. To understand the importance (what Wikipedia labels hybrid threading), consider the following examples:
Consider a multi-threaded program (multiple KLT's) where there are more KLT's than available logical cores. In order to efficiently use every core, as you mentioned, you want the scheduler to switch out KLT's that are blocking with ones that in a ready state and not blocking. This ensures the core is reducing its amount of idle time. Unfortunately, switching KLT's is expensive for the scheduler and it consumes a relatively large amount of CPU time.
This is one area where hybrid threading can be helpful. Consider a multi-threaded program with multiple KLT's and ULT's. Just as templatetypedef noted, only one ULT can be running at one time for each KLT. If a ULT is blocking, we still want to switch it out for one which is not blocking. Fortunately, ULT's are much more lightweight than KLT's, in the sense that there less resources assigned to a ULT and they require no interaction with the kernel scheduler. Essentially, it is almost always quicker to switch out ULT's than it is to switch out KLT's. As a result, we are able to significantly reduce a cores idle time relative to the first example.
Now, of course, all of this depends on the threading library being used for implementing ULT's. There are two ways (which I can come up with) for "mapping" ULT's to KLT's.
A collection of ULT's for all KLT's
This situation is ideal on a shared memory system. There is essentially a "pool" of ULT's to which each KLT has access. Ideally, the threading library scheduler would assign ULT's to each KLT upon request as opposed to the KLT's accessing the pool individually. The later could cause race conditions or deadlocks if not implemented with locks or something similar.
A collection of ULT's for each KLT (Qthreads)
This situation is ideal on a distributed memory system. Each KLT would have a collection of ULT's to run. The draw back is that the user (or the threading library) would have to divide the ULT's between the KLT's. This could result in load imbalance since it is not guaranteed that all ULT's will have the same amount of work to complete and complete roughly the same amount of time. The solution to this is allowing for ULT migration; that is, migrating ULT's between KLT's.

Thread vs async execution. What's different?

I believed any kind of asynchronous execution makes a thread in invisible area. But if so,
Async codes does not offer any performance gain than threaded codes.
But I can't understand why so many developers are making many features async form.
Could you explain about difference and cost of them?

The purpose of an asynchronous execution is to prevent the code calling the asynchronous method (the foreground code) from being blocked. This allows your foreground code to go on doing useful work while the asynchronous thread is performing your requested work in the background. Without asynchronous execution, the foreground code must wait until the background task is completed before it can continue executing.
The cost of an asynchronous execution is the same as that of any other task running on a thread.
Typically, an async result object is registered with the foreground code. The async result object can either raise an event when the background task is completed, or the foreground code can periodically check the async result object to see if its completion flag has been set.

Concurrency does not necessarily require threads.
In Linux, for example, you can perform non-blocking syscalls. Using this type of calls, you can for instance start a number of network reads. Your code can keep track of the reads manually (using handles in a list or similar) and periodically ask the OS if new data is available on any of the connections. Internally, the OS also keeps a list of ongoing reads. Using this technique, you can thus achieve concurrency without any (extra) threads, neither in your program nor in the OS.
If you use threads and blocking IO, you would typically start one thread per read. In this scenario, the OS will instead have a list of ongoing threads, which it parks when the tread tries to read data when there is none available. Threads are resumed as data becomes available.
Having the OS switch between threads might involve slightly more overhead in the form of context switching - switching program counter and register content. But the real deal breaker is usually stack allocation per thread. This size is a couple of megabytes by default on Linux. If you have a lot of concurrency in your program, this might push you in the direction of using non-blocking calls to handle more concurrency per thread.
So it is possible to do async programming without threads. If you want to do async programming using only blocking OS-calls you need to dedicate a thread to do the blocking while you continue. But if you use non-blocking calls you can do a lot of concurrent things with just a single thread. Have a look at Node.js, which have great support for many concurrent connections while being single-threaded for most operations.
Also check out Golang, which achieve a similar effect using a sort of green threads called goroutines. Multiple goroutines run concurrently on the same OS thread and they are restrictive in stack memory, pushing the limit much further.

Async codes does not offer any performance gain than threaded codes.
Asynchornous execution is one of the traits of multi-threaded execution, which is becoming more relevant as processors are packing in more cores.
For servers, multi-core only mildly relevant, as they are already written with concurrency in mind and will scale natrually, but multi-core is particularly relevant for desktop apps, which traditionally do only a few things concurrently - often just one foreground task with a background thread. Now, they have to be coded to do many things concurrently if they are to take advantage of the power of the multi-core cpu.
As to the performance - on single-core - the asynchornous tasks slow down the system as much as they would if run sequentially (this a simplication, but true for the most part.) So, running task A, which takes 10s and task B which takes 5s on a single core, the total time needed will be 15s, if B is run asynchronously or not. The reason is, is that as B runs, it takes away cpu resources from A - A and B compete for the same cpu.
With a multi-core machine, additional tasks run on otherwise unused cores, and so the situation is different - the additional tasks don't really consume any time - or more correctly, they don't take away time from the core running task A. So, runing tasks A and B asynchronously on multi-core will conume just 10s - not 15s as with single core. B's execution runs at the same time as A, and on a separate core, so A's execution time is unaffected.
As the number of tasks and cores increase, then the potential improvements in performance also increase. In parallel computing, exploiting parallelism to produce an improvement in performance is known as speedup.
we are already seeing 64-core cpus, and it's esimated that we will have 1024 cores commonplace in a few years. That's a potential speedup of 1024 times, compared to the single-threaded synchronous case. So, to answer your question, there clearly is a performance gain to be had by using asynchronous execution.

I believed any kind of asynchronous execution makes a thread in invisible area.
This is your problem - this actually isn't true.
The thing is, your whole computer is actually massively asynchronous - requests to RAM, communication via a network card, accessing a HDD... those are all inherently asynchronous operations.
Modern OSes are actually built around asynchronous I/O. Even when you do a synchronous file request, for example (e.g. File.ReadAllText), the OS sends an asynchronous request. However, instead of giving control back to your code, it blocks while it waits for the response to the asynchronous request. And this is where proper asynchronous code comes in - instead of waiting for the response, you give the request a callback - a function to execute when the response comes back.
For the duration of the asynchronous request, there is no thread. The whole thing happens on a completely different level - say, the request is sent to the firmware on your NIC, and given a DMA address to fill the response. When the NIC finishes your request, it fills the memory, and signals an interrupt to the processor. The OS kernel handles the interrupt by signalling the owner application (usually an IOCP "channel") the request is done. This is still all done with no thread whatsoever - only for a short time right at the end, a thread is borrowed (in .NET this is from the IOCP thread pool) to execute the callback.
So, imagine a simple scenario. You need to send 100 simultaneous requests to a database engine. With multi-threading, you would spin up a new thread for each of those requests. That means a hundred threads, a hundread thread stacks, the cost of starting a new thread itself (starting a new thread is cheap - starting a hundred at the same time, not so much), quite a bit of resources. And those threads would just... block. Do nothing. When the response comes, the threads are awakened, one after another, and eventually disposed.
On the other hand, with asynchronous I/O, you can simply post all the requests from a single thread - and register a callback when each of those is finished. A hundred simultaneous requests will cost you just your original thread (which is free for other work as soon as the requests are posted), and a short time with threads from the thread pool when the requests are finished - in "worst" case scenario, about as many threads as you have CPU cores. Provided you don't use blocking code in the callback, of course :)
This doesn't necessarily mean that asynchronous code is automatically more efficient. If you only need a single request, and you can't do anything until you get a response, there's little point in making the request asynchronous. But most of the time, that's not your actual scenario - for example, you need to maintain a GUI in the meantime, or you need to make simultaneous requests, or your whole code is callback-based, rather than being written synchronously (a typical .NET Windows Forms application is mostly event-based).
The real benefit from asynchronous code comes from exactly that - simplified non-blocking UI code (no more "(Not Responding)" warnings from the window manager), and massively improved parallelism. If you have a web server that handles a thousand requests simultaneously, you don't want to waste 1 GiB of address space just for the completely unnecessary thread stacks (especially on a 32-bit system) - you only use threads when you have something to do.
So, in the end, asynchronous code makes UI and server code much simpler. In some cases, mostly with servers, it can also make it much more efficient. The efficiency improvements come precisely from the fact that there is no thread during the execution of the asynchronous request.
Your comment only applies to one specific kind of asynchronous code - multi-threaded parallelism. In that case, you really are wasting a thread while executing a request. However, that's not what people mean when saying "my library offers an asynchronous API" - after all, that's a 100% worthless API; you could have just called await Task.Run(TheirAPIMethod) and gotten the exact same thing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string