Understanding coroutine - multithreading

From wikipedia the
paragraph Comparison with threads states:
... This means that coroutines provide concurrency but not parallelism ...
I understand that coroutine is lighter than thread, context switching is not involved, no critical sections so mutex is also not needed. What confuses me is that the way it works seems not to scale. According to wikipedia, coroutines provide concurrency, they work cooperatively. A program with coroutines still executes instructions sequentially, this is exactly the same as threads on a single core machine, but what about multicore machines? on which threads run in parallel, while coroutines work the same as on single core machines.
My question is how coroutines will perform better than threads on multicore machines?

...what about multicore machines?...
Coroutines are a model of concurrency (in which two or more stateful activities can be in-progress at the same time), but not a model of parallelism (in which the program would able to use more hardware resources than what a single, conventional CPU core can provide).
Threads can run independently of one another, and if your hardware supports it (i.e., if your machine has more than one core) then two or more threads can be performing their independent activities at the same instant in time.
But coroutines, by definition, are interdependent. A coroutine only runs when it is called by another coroutine, and the caller is suspended until the current coroutine calls it back. Only one coroutine from a set of coroutines can ever be actually running at any given instant in time.

Related

Are coroutines single-threaded by nature?

According to Wikipedia, coroutines are based on cooperative multitasking, which makes them less resource-hungry than threads. No context switch, no blocking, no expensive system calls, no critical sections and so on.
In other words, all those coroutine benefits seem to come from disallowing multithreading in the first place. This makes coroutines single-threaded by nature: concurrency is achieved, but no true parallelism.
Is it true? Is it possible to implement coroutines by using multiple threads instead?
Coroutines allow multitasking without multithreading, but they don't disallow multithreading.
In languages that support both, a coroutine that is put to sleep can be re-awakened in a different thread.
The usual arrangement for CPU-bound tasks is to have a thread pool with about twice as many threads as you have CPU cores. This thread pool is then used to execute maybe thousands of coroutines simultaneously. The threads share a queue of coroutines ready to execute, and whenever a thread's current coroutine blocks, it just gets another one to work on from the queue.
In this situation you have enough busy threads to keep your CPU busy, and you still have thread context switches, but not enough of them to waste significant resources. The number of coroutine context switches is thousands of times higher.
Multiple coroutines can be mapped to a single OS thread. But a single OS thread can only utilize 1 CPU. So you need multiple OS threads to utilize multiple CPUs.
So if a coroutine scheduler needs to utilize multiple CPUs (very likely), it needs to make use of multiple OS threads.
Have a look at the Go scheduler and look for MN scheduler.

What profit of using multi-threaded coroutines?

I work with coroutines pretty long time, but I still don't understand completely, why do I need to prefer multi-threaded coroutines instead of single-threaded coroutines.
I can clearly see the profit of using multi-threaded coroutines when their count is less or equal to the physical thread count. But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
I'll clarify the final question: Why is 10 threads of coroutines better than only one thread with many coroutines?
Coroutines are units of computation (like tasks). The way they are dispatched onto actual threads is orthogonal to how many coroutines you have. You can use a single-threaded dispatcher or a multi-threaded dispatcher, and depending on this your coroutines will be scheduled differently.
Multi-threaded coroutines doesn't mean 1 thread per coroutine. You can dispatch 100 coroutines on 8 threads.
But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
There are multiple parts in this question.
First, if you have more tasks than logical cores, you could still dispatch all those tasks onto just the right number of threads. You don't have to completely give up on multithreading. This is actually exactly what Dispatchers.Default is about: dispatching as many coroutines as you want onto a limited number of threads equal to the number of hardware threads (logical cores) that you have. The point is to make use of all the hardware as much as possible without wasting theads (and thus memory).
Second, not every task is CPU-bound. Some I/O operations block threads (network calls, disk reads/writes etc.). When a thread is blocked on I/O, it doesn't use the CPU. If you have 8 logical cores, using only 8 threads for I/O would be suboptimal, because while some threads are blocked, the CPU cannot run other tasks. With more threads, it can (at the cost of some memory). This is the point of Dispatchers.IO, which can create more threads as needed and can exceed the number of logical cores (within a reasonable limit).
Why is 10 threads of coroutines better than only one thread with many coroutines?
Let's assume you have 100 coroutines to dispatch.
Using only one thread to run those coroutines implies that only 1 core at most is doing the work at a given time, so nothing happens in parallel. This means all the other cores are idle, which is suboptimal. Worse, any I/O operation done by a coroutine blocks this only thread and prevents the CPU from doing anything while we're waiting on I/O.
Using 10 threads, you can literally execute 10 coroutines at the same time if your hardware is sufficient, which can be 10x faster (if your coroutines don't have inter-dependencies).
Using 100 threads would not be that beneficial if your coroutines are CPU-bound, but might be useful if you have a bunch of I/O tasks (as we've seen). That said, the more threads you use, the more memory is consumed. So even with a ton of I/O operations, you have to find a balance between throughput and memory, you don't want to spawn millions of threads.
In short, using multi-threading still has the same advantages with or without coroutines: it allows to make use of your hardware resources as much as possible. Using coroutines is just an easier way to define tasks, dispatch them onto threads, express dependencies, avoid blocking threads unnecessarily, etc.

Are Coroutines preemptive or just blocking the thread that picked the Runnable?

after digging a bit inside implementations of the Coroutine dispatchers such as "Default" and "IO",
I see they are just containing a Java executor (which is a simple thread pool) and a queue of Runnables which are the coroutine logic blocks.
let's take an example scenario where I am launching 10,000 coroutines on the same coroutine context, "Default" dispatcher for example, which contains an Executor with 512 real threads in its pool.
those coroutines will be added to the dispatcher queue (in case the number of in-flight coroutines exceeded the max threshold).
let's assume for example that the first 512 coroutines I launched out of the 10,000 are really slow and heavy.
are the rest of my coroutines will be blocked until at least 1 of my real threads will finish,
or is there some time-slicing mechanism in those "user-space threads"?
Coroutines are scheduled cooperatively, not pre-emptively, so context switch is possible only at suspension points. This is actually by design, it makes execution much faster, because coroutines don't fight each other and the number of context switches is lower than in pre-emptive scheduling.
But as you noticed, it has drawbacks. If performing long CPU-intensive calculations it is advised to invoke yield() from time to time. It allows to free the thread for other coroutines. Another solution is to create a distinct thread pool for our calculations to separate them from other parts of the application. This has similar drawback as pre-emptive scheduling - it will make coroutines/threads fight for the access to CPU cores.
Once a coroutine starts executing it will continue to do so until it hits a suspension point, which is introduced by a call to suspendCoroutine or suspendCancellableCoroutine.
Suspension is the fundamental idea
This however is by design, because suspension is fundamental to the performance gains that are introduced by coroutines, whole point behind coroutines is that why keep blocking a thread when its doing nothing but wait (ex sync IO). why not use this thread to do something else
Without suspension you lose much of the performance gain
So in order to identify the switch in your particular case, you will have to define the term slow and heavy. A cpu intensive task such as generating a prime number can be slow and heavy and a API call which performs complex computation on server and then returns a result can also be slow and heavy. if 512 coroutines have no suspension point, then others will have to wait for them to complete. which actually defeats the whole point of using the coroutines, since you are effectively using coroutiens as a replacement for threads but with added overhead.
If you have to execute bunch of non-suspending operations in parallel, you should instead use a service like Executor, since in this case coroutines does nothing but add a useless layer of abstraction.

What is the difference between user level threads and coroutines?

User-level threading involves N user level threads that run on a single kernel thread. What are the details of the user-level threading and how does this differ from coroutines?
Wikipedia has a quite in-depth summary on the subject: Thread (computing).
With Green threads there's a VM executing instructions that typically decides between to switch thread in-between two instructions.
With coroutines the two functions yield to each other at specified points, possibly passing values along, and typically requiring special language support. E.g. a producer yielding to a consumer, passing along an item.
The idea behind user-level threads is to have multiple different logical threads running in the same program but to have the user program handle the mapping from logical threads to kernel threads (Which actually get scheduled) rather than having the OS handle the entire mapping. This can improve performance by letting the user program handle scheduling. Conceptually, user threads are one implementation of preemptive multitasking, where multiple jobs are run to completion in parallel by having the threads periodically stopped while other threads run.
Coroutines, on the other hand, are a generalization of standard function call and return ("subroutines") where functions pass control back and forth to one another, communicating values as they switch between routines. The switching back and forth between coroutines is under the control of the coroutines themselves; control only passes from one coroutine to another if one of the coroutines explicitly yields a value to another. This is an example of cooperative multitasking, where multiple jobs are completed in parallel by having the individual steps in the task manually coordinate who gets to run and when.
Hope this helps!

Is a Task lightweight compared to a Thread?

I overheard a coworker saying that a Task is basically a lightweight thread. Coming from a C++ background (where threads where the lightest weight processing unit), this seems counter-intuitive to me.
Aren't Tasks just as heavy as Threads?
You need to distinguish between a unit of work (Tasks) from the underlying process used to host/execute them. It isn't even necessary for Tasks to run on other threads. For example, Tasks can be executed in a single threaded application that periodically yields control to the task pool.
Even when Tasks are executed on separate threads, there is usually not a 1 to 1 relationship between Task and Thread. The threads are preallocated as part of a pool, and then tasks are scheduled to run on these threads as available. Creating a new task does not require the overhead of creating a thread, it only requires the cost of an enque in a task queue.
This makes tasks inherently more scalable. I can have millions of tasks throughout the lifetime of my application, but only ever actually use some constant number of threads.
Typically a "thread" implies mandatory concurrency. Starting up a thread requires allocating a stack and internal OS data structures for it. In contrast, a "task" often refers to a piece of work for which concurrency is optional, hence a parallel framework (such as OpenMP, Cilk Plus, TBB, PPL) can use the same thread to execute many tasks, by serializing the tasks, and converting optional parallelism to real parallelism only as necessary to keep the machine busy.
You are right - everything runs on a thread under the covers.
The reason people say that a Task is more lightweight than a Thread is that Microsoft put a lot of thought into having Tasks make efficient use of Threads, and the implementation is probably much lighter weight than what the average developer would come up with on their own using the Thread class.
EDIT
A more clear explanation is that a Task object is lighter weight than a Thread object, and while each Task is eventually run on a Thread, creating N Task objects concurrently leads to less than N concurrent Thread objects being used, for large N.

Resources