What profit of using multi-threaded coroutines? - multithreading

I work with coroutines pretty long time, but I still don't understand completely, why do I need to prefer multi-threaded coroutines instead of single-threaded coroutines.
I can clearly see the profit of using multi-threaded coroutines when their count is less or equal to the physical thread count. But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
I'll clarify the final question: Why is 10 threads of coroutines better than only one thread with many coroutines?

Coroutines are units of computation (like tasks). The way they are dispatched onto actual threads is orthogonal to how many coroutines you have. You can use a single-threaded dispatcher or a multi-threaded dispatcher, and depending on this your coroutines will be scheduled differently.
Multi-threaded coroutines doesn't mean 1 thread per coroutine. You can dispatch 100 coroutines on 8 threads.
But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
There are multiple parts in this question.
First, if you have more tasks than logical cores, you could still dispatch all those tasks onto just the right number of threads. You don't have to completely give up on multithreading. This is actually exactly what Dispatchers.Default is about: dispatching as many coroutines as you want onto a limited number of threads equal to the number of hardware threads (logical cores) that you have. The point is to make use of all the hardware as much as possible without wasting theads (and thus memory).
Second, not every task is CPU-bound. Some I/O operations block threads (network calls, disk reads/writes etc.). When a thread is blocked on I/O, it doesn't use the CPU. If you have 8 logical cores, using only 8 threads for I/O would be suboptimal, because while some threads are blocked, the CPU cannot run other tasks. With more threads, it can (at the cost of some memory). This is the point of Dispatchers.IO, which can create more threads as needed and can exceed the number of logical cores (within a reasonable limit).
Why is 10 threads of coroutines better than only one thread with many coroutines?
Let's assume you have 100 coroutines to dispatch.
Using only one thread to run those coroutines implies that only 1 core at most is doing the work at a given time, so nothing happens in parallel. This means all the other cores are idle, which is suboptimal. Worse, any I/O operation done by a coroutine blocks this only thread and prevents the CPU from doing anything while we're waiting on I/O.
Using 10 threads, you can literally execute 10 coroutines at the same time if your hardware is sufficient, which can be 10x faster (if your coroutines don't have inter-dependencies).
Using 100 threads would not be that beneficial if your coroutines are CPU-bound, but might be useful if you have a bunch of I/O tasks (as we've seen). That said, the more threads you use, the more memory is consumed. So even with a ton of I/O operations, you have to find a balance between throughput and memory, you don't want to spawn millions of threads.
In short, using multi-threading still has the same advantages with or without coroutines: it allows to make use of your hardware resources as much as possible. Using coroutines is just an easier way to define tasks, dispatch them onto threads, express dependencies, avoid blocking threads unnecessarily, etc.

Related

Are coroutines single-threaded by nature?

According to Wikipedia, coroutines are based on cooperative multitasking, which makes them less resource-hungry than threads. No context switch, no blocking, no expensive system calls, no critical sections and so on.
In other words, all those coroutine benefits seem to come from disallowing multithreading in the first place. This makes coroutines single-threaded by nature: concurrency is achieved, but no true parallelism.
Is it true? Is it possible to implement coroutines by using multiple threads instead?
Coroutines allow multitasking without multithreading, but they don't disallow multithreading.
In languages that support both, a coroutine that is put to sleep can be re-awakened in a different thread.
The usual arrangement for CPU-bound tasks is to have a thread pool with about twice as many threads as you have CPU cores. This thread pool is then used to execute maybe thousands of coroutines simultaneously. The threads share a queue of coroutines ready to execute, and whenever a thread's current coroutine blocks, it just gets another one to work on from the queue.
In this situation you have enough busy threads to keep your CPU busy, and you still have thread context switches, but not enough of them to waste significant resources. The number of coroutine context switches is thousands of times higher.
Multiple coroutines can be mapped to a single OS thread. But a single OS thread can only utilize 1 CPU. So you need multiple OS threads to utilize multiple CPUs.
So if a coroutine scheduler needs to utilize multiple CPUs (very likely), it needs to make use of multiple OS threads.
Have a look at the Go scheduler and look for MN scheduler.

Are Coroutines preemptive or just blocking the thread that picked the Runnable?

after digging a bit inside implementations of the Coroutine dispatchers such as "Default" and "IO",
I see they are just containing a Java executor (which is a simple thread pool) and a queue of Runnables which are the coroutine logic blocks.
let's take an example scenario where I am launching 10,000 coroutines on the same coroutine context, "Default" dispatcher for example, which contains an Executor with 512 real threads in its pool.
those coroutines will be added to the dispatcher queue (in case the number of in-flight coroutines exceeded the max threshold).
let's assume for example that the first 512 coroutines I launched out of the 10,000 are really slow and heavy.
are the rest of my coroutines will be blocked until at least 1 of my real threads will finish,
or is there some time-slicing mechanism in those "user-space threads"?
Coroutines are scheduled cooperatively, not pre-emptively, so context switch is possible only at suspension points. This is actually by design, it makes execution much faster, because coroutines don't fight each other and the number of context switches is lower than in pre-emptive scheduling.
But as you noticed, it has drawbacks. If performing long CPU-intensive calculations it is advised to invoke yield() from time to time. It allows to free the thread for other coroutines. Another solution is to create a distinct thread pool for our calculations to separate them from other parts of the application. This has similar drawback as pre-emptive scheduling - it will make coroutines/threads fight for the access to CPU cores.
Once a coroutine starts executing it will continue to do so until it hits a suspension point, which is introduced by a call to suspendCoroutine or suspendCancellableCoroutine.
Suspension is the fundamental idea
This however is by design, because suspension is fundamental to the performance gains that are introduced by coroutines, whole point behind coroutines is that why keep blocking a thread when its doing nothing but wait (ex sync IO). why not use this thread to do something else
Without suspension you lose much of the performance gain
So in order to identify the switch in your particular case, you will have to define the term slow and heavy. A cpu intensive task such as generating a prime number can be slow and heavy and a API call which performs complex computation on server and then returns a result can also be slow and heavy. if 512 coroutines have no suspension point, then others will have to wait for them to complete. which actually defeats the whole point of using the coroutines, since you are effectively using coroutiens as a replacement for threads but with added overhead.
If you have to execute bunch of non-suspending operations in parallel, you should instead use a service like Executor, since in this case coroutines does nothing but add a useless layer of abstraction.

Why kotlin coroutines are considered light weight?

If coroutines is still using threads to run code in parallel why it is considered light weight?
In my understanding kotlin's suspend functions is transformed by compiler into state machine where each branch can be run on the same or different thread defined by developer. Coroutine builder, e,g, launch{}, responsible for that and CoroutineContext is what defines a thread to run on.
Parallelism achieved by sending block of code to the thread pool which leverage the same threads
There was a benchmark on 100k coroutines and 100k threads where coroutines pass without issue and threads throw exception (likely OutOfMemory). It brings me to idea I am missing something here.
Could you help me to understand what is missed here what makes coroutines run code block 100k in parallel without exceeding memory limits like threads do?
Pointing from the article
Every Thread has its own stack, typically 1MB. 64k is the least amount of stack space allowed per thread in the JVM while a simple coroutine in Kotlin occupies only a few dozen bytes of heap memory.
The Coroutine Dispatcher has a limit that only a certain amount of threads can be created.
Such as Dispatchers.IO has limit of 64 threads, Dispatchers.Default has limit of number of cores on your processor (2, 4, 6, 8, etc.) Dispatchers.Unconfined cannot create new thread it runs on threads previously created by other dispatchers, here's proof: 500 operations having 10ms of sleep takes approx 5s (single-thread because it can't spawn a new) try it yourself.
Coroutines stick to a thread, and as soon as suspension point is reached, it leaves the Thread and frees it up letting it to pick up another coroutine if it is waiting. This way with less threads and less memory usage, that much concurrent work can be done.
The coroutines are managed to be suspended and resumed by a callback like object Continuation which is added as the last parameter to the function marked with suspend keyword at the time of compilation which lives in heap as other objects do and is responsible for the resume of coroutine, so Thousands of MBs space is not required in RAM to keep all the Threads alive. A typical 60-70 threads are created at max using CommonPool and are reused (if new coroutine is created it waits till another finishes).
The main saving comes from the fact that a single thread can run any number of coroutines, by way of cooperative multitasking. When you launch 100,000 coroutines, they run on as many threads as there are CPU cores, but when you start 100,000 threads, the JVM creates that many native threads. Note that the level of parallelism is equal in both cases, and limited to the number of CPU cores.
The only thing that changes is scheduling: in the classical case, the OS suspends and resumes threads, assigning them to CPU cores. With coroutines, the coroutines suspend themselves (this is their cooperative aspect) and the Dispatcher resumes them later on, running other coroutines in the meantime.
Lightweight: You can run many coroutines on a single thread due to support for suspension, which doesn't block the thread where the coroutine is running. Suspending saves memory over blocking while supporting many concurrent operations.
Fewer memory leaks: Use structured concurrency to run operations within a scope.

Understanding coroutine

From wikipedia the
paragraph Comparison with threads states:
... This means that coroutines provide concurrency but not parallelism ...
I understand that coroutine is lighter than thread, context switching is not involved, no critical sections so mutex is also not needed. What confuses me is that the way it works seems not to scale. According to wikipedia, coroutines provide concurrency, they work cooperatively. A program with coroutines still executes instructions sequentially, this is exactly the same as threads on a single core machine, but what about multicore machines? on which threads run in parallel, while coroutines work the same as on single core machines.
My question is how coroutines will perform better than threads on multicore machines?
...what about multicore machines?...
Coroutines are a model of concurrency (in which two or more stateful activities can be in-progress at the same time), but not a model of parallelism (in which the program would able to use more hardware resources than what a single, conventional CPU core can provide).
Threads can run independently of one another, and if your hardware supports it (i.e., if your machine has more than one core) then two or more threads can be performing their independent activities at the same instant in time.
But coroutines, by definition, are interdependent. A coroutine only runs when it is called by another coroutine, and the caller is suspended until the current coroutine calls it back. Only one coroutine from a set of coroutines can ever be actually running at any given instant in time.

Will my program get more cpu time if it has more threads

If currently kernel is scheduling 60 threads, they belong to 3 processes:
A: 10 threads
B: 20 threads
C: 30 threads
And they are all doing computing (no disk IO)
Will C gets more stuff done than B, and B gets more stuff done than A?
It does not seem very fair to me. If I am a irresponsible programmer, I can just spawn more threads to eat more CPU resource.
How this relate to golang:
In go, typically the go scheduler has a thread pool of #of-CPU-cores threads. Why does this make sense if a process with more threads gets more stuff done?
The situation you describe is an overloaded machine. It is only true that the process with more threads will get more work done if the CPU has no spare time.
Go was not designed to fight other processes for a greater share of an overloaded CPU. You are free to set GOMAXPROCS to any number you like if you desire to participate in such a fight.
In a more typical system where the total work is less than the total CPU time; a Go process with 8 threads and 30 goroutines will perform about equally to a process with 30 threads running at the same time.
It does not seem very fair to me. If I am a irresponsible programmer,
I can just spawn more threads to eat more CPU resource.
you can also allocate your entire free memory, cause OS-failure and hijack the network card. you can do sorts of things but then who will want to consume your softwares?
How this relate to golang: In go, typically the go scheduler has a
thread pool of #of-CPU-cores threads. Why does this make sense if a
process with more threads gets more stuff done?
Golang goroutines are basically a threadpool. each goroutine is a thread-pool work item. many things can make your thread-pool thread block, like using synchronous IO, waiting on a (non-spin-lock) lock and manually sleeping or yielding. in these cases, which are very common, having more threads than CPU's usually increase the performance of your application.
Do note that not all IO is a disk IO. writing to your console is an IO operation, but it isn't really a "Disk IO".
another thing is that context switching may not consume a large portion of your CPU, and having more threads may not make your tasks-throughput degrade. so in this case having more threads means that your parallelism is higher, and yet you don't loose performance. this is somewhat common situation. context switching between threads these days is very cheap. having a bit more threads than your cores may not necessary kill your performance or degrade it in some way.

Resources