If coroutines is still using threads to run code in parallel why it is considered light weight?
In my understanding kotlin's suspend functions is transformed by compiler into state machine where each branch can be run on the same or different thread defined by developer. Coroutine builder, e,g, launch{}, responsible for that and CoroutineContext is what defines a thread to run on.
Parallelism achieved by sending block of code to the thread pool which leverage the same threads
There was a benchmark on 100k coroutines and 100k threads where coroutines pass without issue and threads throw exception (likely OutOfMemory). It brings me to idea I am missing something here.
Could you help me to understand what is missed here what makes coroutines run code block 100k in parallel without exceeding memory limits like threads do?
Pointing from the article
Every Thread has its own stack, typically 1MB. 64k is the least amount of stack space allowed per thread in the JVM while a simple coroutine in Kotlin occupies only a few dozen bytes of heap memory.
The Coroutine Dispatcher has a limit that only a certain amount of threads can be created.
Such as Dispatchers.IO has limit of 64 threads, Dispatchers.Default has limit of number of cores on your processor (2, 4, 6, 8, etc.) Dispatchers.Unconfined cannot create new thread it runs on threads previously created by other dispatchers, here's proof: 500 operations having 10ms of sleep takes approx 5s (single-thread because it can't spawn a new) try it yourself.
Coroutines stick to a thread, and as soon as suspension point is reached, it leaves the Thread and frees it up letting it to pick up another coroutine if it is waiting. This way with less threads and less memory usage, that much concurrent work can be done.
The coroutines are managed to be suspended and resumed by a callback like object Continuation which is added as the last parameter to the function marked with suspend keyword at the time of compilation which lives in heap as other objects do and is responsible for the resume of coroutine, so Thousands of MBs space is not required in RAM to keep all the Threads alive. A typical 60-70 threads are created at max using CommonPool and are reused (if new coroutine is created it waits till another finishes).
The main saving comes from the fact that a single thread can run any number of coroutines, by way of cooperative multitasking. When you launch 100,000 coroutines, they run on as many threads as there are CPU cores, but when you start 100,000 threads, the JVM creates that many native threads. Note that the level of parallelism is equal in both cases, and limited to the number of CPU cores.
The only thing that changes is scheduling: in the classical case, the OS suspends and resumes threads, assigning them to CPU cores. With coroutines, the coroutines suspend themselves (this is their cooperative aspect) and the Dispatcher resumes them later on, running other coroutines in the meantime.
Lightweight: You can run many coroutines on a single thread due to support for suspension, which doesn't block the thread where the coroutine is running. Suspending saves memory over blocking while supporting many concurrent operations.
Fewer memory leaks: Use structured concurrency to run operations within a scope.
Related
According to Wikipedia, coroutines are based on cooperative multitasking, which makes them less resource-hungry than threads. No context switch, no blocking, no expensive system calls, no critical sections and so on.
In other words, all those coroutine benefits seem to come from disallowing multithreading in the first place. This makes coroutines single-threaded by nature: concurrency is achieved, but no true parallelism.
Is it true? Is it possible to implement coroutines by using multiple threads instead?
Coroutines allow multitasking without multithreading, but they don't disallow multithreading.
In languages that support both, a coroutine that is put to sleep can be re-awakened in a different thread.
The usual arrangement for CPU-bound tasks is to have a thread pool with about twice as many threads as you have CPU cores. This thread pool is then used to execute maybe thousands of coroutines simultaneously. The threads share a queue of coroutines ready to execute, and whenever a thread's current coroutine blocks, it just gets another one to work on from the queue.
In this situation you have enough busy threads to keep your CPU busy, and you still have thread context switches, but not enough of them to waste significant resources. The number of coroutine context switches is thousands of times higher.
Multiple coroutines can be mapped to a single OS thread. But a single OS thread can only utilize 1 CPU. So you need multiple OS threads to utilize multiple CPUs.
So if a coroutine scheduler needs to utilize multiple CPUs (very likely), it needs to make use of multiple OS threads.
Have a look at the Go scheduler and look for MN scheduler.
I work with coroutines pretty long time, but I still don't understand completely, why do I need to prefer multi-threaded coroutines instead of single-threaded coroutines.
I can clearly see the profit of using multi-threaded coroutines when their count is less or equal to the physical thread count. But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
I'll clarify the final question: Why is 10 threads of coroutines better than only one thread with many coroutines?
Coroutines are units of computation (like tasks). The way they are dispatched onto actual threads is orthogonal to how many coroutines you have. You can use a single-threaded dispatcher or a multi-threaded dispatcher, and depending on this your coroutines will be scheduled differently.
Multi-threaded coroutines doesn't mean 1 thread per coroutine. You can dispatch 100 coroutines on 8 threads.
But if we have more tasks than physical threads, why wouldn't we rather use only one coroutine thread?
There are multiple parts in this question.
First, if you have more tasks than logical cores, you could still dispatch all those tasks onto just the right number of threads. You don't have to completely give up on multithreading. This is actually exactly what Dispatchers.Default is about: dispatching as many coroutines as you want onto a limited number of threads equal to the number of hardware threads (logical cores) that you have. The point is to make use of all the hardware as much as possible without wasting theads (and thus memory).
Second, not every task is CPU-bound. Some I/O operations block threads (network calls, disk reads/writes etc.). When a thread is blocked on I/O, it doesn't use the CPU. If you have 8 logical cores, using only 8 threads for I/O would be suboptimal, because while some threads are blocked, the CPU cannot run other tasks. With more threads, it can (at the cost of some memory). This is the point of Dispatchers.IO, which can create more threads as needed and can exceed the number of logical cores (within a reasonable limit).
Why is 10 threads of coroutines better than only one thread with many coroutines?
Let's assume you have 100 coroutines to dispatch.
Using only one thread to run those coroutines implies that only 1 core at most is doing the work at a given time, so nothing happens in parallel. This means all the other cores are idle, which is suboptimal. Worse, any I/O operation done by a coroutine blocks this only thread and prevents the CPU from doing anything while we're waiting on I/O.
Using 10 threads, you can literally execute 10 coroutines at the same time if your hardware is sufficient, which can be 10x faster (if your coroutines don't have inter-dependencies).
Using 100 threads would not be that beneficial if your coroutines are CPU-bound, but might be useful if you have a bunch of I/O tasks (as we've seen). That said, the more threads you use, the more memory is consumed. So even with a ton of I/O operations, you have to find a balance between throughput and memory, you don't want to spawn millions of threads.
In short, using multi-threading still has the same advantages with or without coroutines: it allows to make use of your hardware resources as much as possible. Using coroutines is just an easier way to define tasks, dispatch them onto threads, express dependencies, avoid blocking threads unnecessarily, etc.
I have a computer with 1 cpu, 4 cores and 2 threads per core that can run. So I have effiency with maximum 8 running threads.
When I write a program in C and create threads using pthred_create function, how many threads is recommended to be created: 7 or 8? Do I have to substract the main thread, thus create 7, or main thread should not be counted and I can effiently crete 8? I know that in theory you can create much more, like thousands, but I want to be effiently planned, according with my computer architecture.
Which thread started which is not much relevant. A program's initial thread is a thread: while it is scheduled on an execution unit, no other thread can use that execution unit. You cannot have more threads executing concurrently than you have execution units, and if you have more than that eligible to run at any given time then you will pay the cost of extra context switches without receiving any offsetting gain from additional concurrency.
To a first approximation, then, yes, you must count the initial thread. But read the above carefully. The relevant metric is not how many threads exist at any given time, but rather how many are contending for execution resources. Threads that are currently blocked (on I/O, on acquiring a mutex, on pthread_join(), etc.) do not contend for execution resources.
More precisely, then, it depends on your threads' behavior. For example, if the initial thread follows the pattern of launching a bunch of other threads and then joining them all without itself performing any other work, then no, you do not count that thread, because it does not contend for CPU to any significant degree while the other threads are doing so.
after digging a bit inside implementations of the Coroutine dispatchers such as "Default" and "IO",
I see they are just containing a Java executor (which is a simple thread pool) and a queue of Runnables which are the coroutine logic blocks.
let's take an example scenario where I am launching 10,000 coroutines on the same coroutine context, "Default" dispatcher for example, which contains an Executor with 512 real threads in its pool.
those coroutines will be added to the dispatcher queue (in case the number of in-flight coroutines exceeded the max threshold).
let's assume for example that the first 512 coroutines I launched out of the 10,000 are really slow and heavy.
are the rest of my coroutines will be blocked until at least 1 of my real threads will finish,
or is there some time-slicing mechanism in those "user-space threads"?
Coroutines are scheduled cooperatively, not pre-emptively, so context switch is possible only at suspension points. This is actually by design, it makes execution much faster, because coroutines don't fight each other and the number of context switches is lower than in pre-emptive scheduling.
But as you noticed, it has drawbacks. If performing long CPU-intensive calculations it is advised to invoke yield() from time to time. It allows to free the thread for other coroutines. Another solution is to create a distinct thread pool for our calculations to separate them from other parts of the application. This has similar drawback as pre-emptive scheduling - it will make coroutines/threads fight for the access to CPU cores.
Once a coroutine starts executing it will continue to do so until it hits a suspension point, which is introduced by a call to suspendCoroutine or suspendCancellableCoroutine.
Suspension is the fundamental idea
This however is by design, because suspension is fundamental to the performance gains that are introduced by coroutines, whole point behind coroutines is that why keep blocking a thread when its doing nothing but wait (ex sync IO). why not use this thread to do something else
Without suspension you lose much of the performance gain
So in order to identify the switch in your particular case, you will have to define the term slow and heavy. A cpu intensive task such as generating a prime number can be slow and heavy and a API call which performs complex computation on server and then returns a result can also be slow and heavy. if 512 coroutines have no suspension point, then others will have to wait for them to complete. which actually defeats the whole point of using the coroutines, since you are effectively using coroutiens as a replacement for threads but with added overhead.
If you have to execute bunch of non-suspending operations in parallel, you should instead use a service like Executor, since in this case coroutines does nothing but add a useless layer of abstraction.
I see this in the book "CLR via C#" and I don't catch it. If there are still threads available in the thread pool, why does it create additional threads?
It might just be poor wording.
On a given machine the threadpool has a good guess of the optimum number of threads the machine can run without overextending resources. If, for some reason, a thread becomes IO blocked (for instance it is waiting for a long time to save or retrieve data from disk or for a response from a network device) the threadpool can start up another thread to take advantage of unused CPU time. When the other thread is no longer blocking, the threadpool will take the next freed thread out of the pool to reduce the size back to "optimum" levels.
This is part of the threadpool management to keep the system from being over-tasked (and reducing efficiency by all the context switches between too many threads) while reducing wasted cycles (while a thread is blocked there might not be enough other work to task the processor(s) fully even though there are tasks waiting to be run) and wasted memory (having threads spun up and ready but never allocated because they'd over task the CPU).
More info on the Managed Thread Pool from MSDN.
The book lied.
Threadpool only creates additional threads when all available threads have been blocked for more than 1 second. If there are free threads, it will use them to process your additional tasks. Note that after 30 seconds of thread idle, the CLR retires the thread (terminates it, gracefully of course).