Which usecases are suitable for Dispatchers.Default in Kotlin? - multithreading

Based on the documentation the threadpool size of IO and Default dispatchers behave as follows:
Dispatchers.Default: By default, the maximal level of parallelism used by this dispatcher is equal to the number of CPU cores, but is at least two.
Dispatchers.IO: It defaults to the limit of 64 threads or the number of cores (whichever is larger).
Unless there is one piece of information that I am missing, performing lots of CPU intensive works on Default is more efficient (faster) because context switching will happen less often.
But the following code actually runs much faster on Dispatchers.IO:
fun blockingWork() {
val startTime = System.currentTimeMillis()
while (true) {
Random(System.currentTimeMillis()).nextDouble()
if (System.currentTimeMillis() - startTime > 1000) {
return
}
}
}
fun main() = runBlocking {
val startTime = System.nanoTime()
val jobs = (1..24).map { i ->
launch(Dispatchers.IO) { // <-- Select dispatcher here
println("Start #$i in ${Thread.currentThread().name}")
blockingWork()
println("Finish #$i in ${Thread.currentThread().name}")
}
}
jobs.forEach { it.join() }
println("Finished in ${Duration.of(System.nanoTime() - startTime, ChronoUnit.NANOS)}")
}
I am running 24 jobs on a 8-core CPU (so, I can keep all the threads of Default dispatcher, busy). Here is the results on my machine:
Dispatchers.IO --> Finished in PT1.310262657S
Dispatchers.Default --> Finished in PT3.052800858S
Can you tell me what I am missing here? If IO works better, why should I use any dispatcher other than IO (or any threadpool with lots of threads).

Answering your question: Default dispatcher works best for tasks that do not feature blocking because there is no gain in exceeding maximum parallelism when executing such workloads concurrently(the-difference-between-concurrent-and-parallel-execution).
https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
Your experiment is flawed. As already mentioned in the comments, your blockingWork is not CPU-bound but IO-bound. It's all about waiting - periods when your task is blocked and CPU cannot execute its subsequent instructions. Your blockingWork in essence is just "wait for 1000 milliseconds" and waiting 1000ms X times in parallel is going to be faster than doing it in sequence. You perform some computation(generating random number - which in essence might also be IO-bound), but as already noted, your workers are generating more or less of those numbers, depending on how much time the underlying threads have been put to sleep.
I performed some simple experiments with generating Fibonacci numbers(often used for simulation of CPU workloads). However, after taking into the account the JIT in the JVM I couldn't easily produce any results proving that the Default dispatcher performs better. Might be that the context-switching isn't as significant as one may believe. Might be that the dispatcher wasn't creating more threads with IO dispatcher for my workload. Might be that my experiment was also flawed. Can't be certain - benchmarking on JVM is not simple by itself and adding coroutines(and their thread pools) to the mix certainly isn't making it any simpler.
However, I think there is something more important to consider here and that is blocking. Default dispatcher is more sensitive to blocking calls. With fewer threads in the pool, it is more likely that all of them become blocked and no other coroutine can execute at that time.
Your program is working in threads. If all threads are blocked, then your program isn't doing anything. Creating new threads is expensive(mostly memory-wise), so for high-load systems that feature blocking this becomes a limiting factor. Kotlin did an amazing job of introducing "suspending" functions. The concurrency of your program is not limited to the number of threads you have anymore. If one flow needs to wait, it just suspends instead of blocking the thread. However, "the world is not perfect" and not everything "suspends" - there are still "blocking" calls - how certain are you that no library that you use performs such calls under the hood? With great power comes great responsibility. With coroutines, one needs to be even more careful about deadlocks, especially when using Default dispatcher. In fact, in my opinion, IO dispatcher should be the default one.
EDIT
TL;DR: You might actually want to create your own dispatchers.
Looking back it came to my attention that my answer is somewhat superficial. It's technically incorrect to decide which dispatcher to use by only looking at the type of workload you want to run. Confining CPU-bound workload to a dispatcher that matches the number of CPU cores does indeed optimize for throughput, but that is not the only performance metric.
Indeed, by using only the Default for all CPU-bound workloads you might find that your application becomes unresponsive! For example, let's say we have a "CPU-bound " long-running background process that uses the Default dispatcher. Now if that process saturates the thread pool of the Default dispatcher then you might find that the coroutines that are started to handle immediate user actions (user click or client request) need to wait for a background process to finish first! You have achieved great CPU throughput but at the cost of latency and the overall performance of your application is actually degraded.
Kotlin does not force you to use predefined dispatchers. You can always create your own dispatchers custom-cut for the specific task you have for your coroutines.
Ultimately it's about:
Balancing resources. How many threads do you actually need? How many threads you can afford to create? Is it CPU-bound or IO-bound? Even if it is CPU-bound, are you sure you want to assign all of the CPU resources to your workload?
Assigning priorities. Understand what kind of workloads run on your dispatchers. Maybe some workloads need to run immediately and some other might wait?
Preventing starvation deadlocks. Make sure your currently running coroutines don't block waiting for a result of a coroutine that is waiting for a free thread in the same dispatcher.

Related

Are Coroutines preemptive or just blocking the thread that picked the Runnable?

after digging a bit inside implementations of the Coroutine dispatchers such as "Default" and "IO",
I see they are just containing a Java executor (which is a simple thread pool) and a queue of Runnables which are the coroutine logic blocks.
let's take an example scenario where I am launching 10,000 coroutines on the same coroutine context, "Default" dispatcher for example, which contains an Executor with 512 real threads in its pool.
those coroutines will be added to the dispatcher queue (in case the number of in-flight coroutines exceeded the max threshold).
let's assume for example that the first 512 coroutines I launched out of the 10,000 are really slow and heavy.
are the rest of my coroutines will be blocked until at least 1 of my real threads will finish,
or is there some time-slicing mechanism in those "user-space threads"?
Coroutines are scheduled cooperatively, not pre-emptively, so context switch is possible only at suspension points. This is actually by design, it makes execution much faster, because coroutines don't fight each other and the number of context switches is lower than in pre-emptive scheduling.
But as you noticed, it has drawbacks. If performing long CPU-intensive calculations it is advised to invoke yield() from time to time. It allows to free the thread for other coroutines. Another solution is to create a distinct thread pool for our calculations to separate them from other parts of the application. This has similar drawback as pre-emptive scheduling - it will make coroutines/threads fight for the access to CPU cores.
Once a coroutine starts executing it will continue to do so until it hits a suspension point, which is introduced by a call to suspendCoroutine or suspendCancellableCoroutine.
Suspension is the fundamental idea
This however is by design, because suspension is fundamental to the performance gains that are introduced by coroutines, whole point behind coroutines is that why keep blocking a thread when its doing nothing but wait (ex sync IO). why not use this thread to do something else
Without suspension you lose much of the performance gain
So in order to identify the switch in your particular case, you will have to define the term slow and heavy. A cpu intensive task such as generating a prime number can be slow and heavy and a API call which performs complex computation on server and then returns a result can also be slow and heavy. if 512 coroutines have no suspension point, then others will have to wait for them to complete. which actually defeats the whole point of using the coroutines, since you are effectively using coroutiens as a replacement for threads but with added overhead.
If you have to execute bunch of non-suspending operations in parallel, you should instead use a service like Executor, since in this case coroutines does nothing but add a useless layer of abstraction.

Profiling Ratpack: ExecControllerBindingThreadFactory high CPU usage and lots of threads

We have a mobile app API server written with Ratpack 1.5.1 about to go live soon, and we're currently profiling the application to catch any performance bottlenecks. The app is backed by an SQL database and we're careful to always run queries using the Blocking class. The code is written in Kotlin and we wrote some coroutine glue code to force blocking operations to be executed on Ratpack's blocking threads.
Since Ratpack's thread model is unique we'd like to make sure this situation is normal: we simulated 2500 concurrent users of the application and our thread count went up to 400 (and even 600 at one point), most of these being ratpack-blocking-x-yyy threads.
Sampling the CPU we get 92% time spent in the ratpack.exec.internal.DefaultExecController$ExecControllerBindingThreadFactory.lambda$newThread$0 method, but this could be an artifact of sampling.
So, to ask concrete questions: given Ratpack's thread model, is the high blocking thread count normal and should we be worrying about the high CPU time spent in the above mentioned method?
Ratpack creates unlimited(*) thread-pool for blocking operations. It gets created in DefaultExecController:
public DefaultExecController(int numThreads) {
this.numThreads = numThreads;
this.eventLoopGroup = ChannelImplDetector.eventLoopGroup(numThreads, new ExecControllerBindingThreadFactory(true, "ratpack-compute", Thread.MAX_PRIORITY));
this.blockingExecutor = Executors.newCachedThreadPool(new ExecControllerBindingThreadFactory(false, "ratpack-blocking", Thread.NORM_PRIORITY));
}
Threads that are created in this pool don't get killed right after blocking operation is done - they are idling in the pool and waiting for the next job to do. The main reason behind it is that keeping thread in idle state is cheaper than spawning new threads when they are needed. That's why when you simulate 2500 concurrent users calling and endpoint which executes blocking operation, you will see 2500 threads in this pool. Cached thread-pool that gets created uses following ThreadPoolExecutor object:
public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
return new ThreadPoolExecutor(0, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory);
}
where 2147483647 is maximum pool size, 60L is TTL expressed in seconds. It means that executor service will keep those threads for 60 seconds and when they don't get re-used after 60 seconds, it will clean them up.
High CPU in this case is actually expected. 2500 threads are utilizing a few cores of the CPU. It's also important - where your SQL database is running? If you run it on the same machine then your CPU has even harder job to do. If the operations you run on blocking thread-pool are consuming significant CPU time, then you have to optimize those blocking operations. Ratpack's power comes with async and non-blocking architecture - handlers use ratpack-compute thread-pool and delegate all blocking operations to ratpack-blocking so your application is not blocked and can handle tons of requests.
(*) unlimited in this case means limited by available memory or if you have enough memory it is limited by 2147483647 threads (this value is used in ExecutorService.newCachedThreadPool(factory)).
Just to build on Szymon's answer…
Ratpack doesn't inherently throttle any operations. That's effectively up to you. One option you have is to use https://ratpack.io/manual/current/api/ratpack/exec/Throttle.html to constrain and queue access to a resource.

Why is optimal thread count of a program related to number of cores when there are thousands of background threads

I've been reading about multi-threaded programming and number of optimal threads. I understand that it is very subjective, varies case by case basis, and the real optimal can be found only through trial-and-error.
However, I've found so many posts saying that if the task is non-I/O-bound, then
Optimal: numberOf(threads) ~= numberOf(cores)
Please take a look at Optimal number of threads per core
Q) How can the above equation be valid if hundreds/thousands of background (OS/other stuff) threads are already fighting to get their turn?
Q) Doesn't having a bit more number of threads increase the probability of being allotted with a core?
The "optimal" only applies to threads that are executing full throttle. The 1000+ threads you can see in use in, say, the Windows Task Manager are threads that are not executing. They are waiting for a notification, blocking on a synchronization object's wait() call.
Which includes I/O but can also be a timer, a driver event, a process interop synch object, an UI thread waiting for a message, etcetera. The latter are much less visible since they are usually wrapped by a friendly api.
Writing a program that has as many threads as the machine has cores, all burning 100% core, is not actually that common. You'd have to solve the kind of problem that requires pure calculation. Real programs are typically bogged down by the need to read/write the data to perform an operation or are throttled by the rate at which data arrives.
Overscheduling the processor is not a good strategy if you have threads burning 100% core. They'll start to fight with each other, the context switching overhead causes less work to be done. It is fine when they block. Blocking automatically makes a core available to do something else.

Thread vs async execution. What's different?

I believed any kind of asynchronous execution makes a thread in invisible area. But if so,
Async codes does not offer any performance gain than threaded codes.
But I can't understand why so many developers are making many features async form.
Could you explain about difference and cost of them?
The purpose of an asynchronous execution is to prevent the code calling the asynchronous method (the foreground code) from being blocked. This allows your foreground code to go on doing useful work while the asynchronous thread is performing your requested work in the background. Without asynchronous execution, the foreground code must wait until the background task is completed before it can continue executing.
The cost of an asynchronous execution is the same as that of any other task running on a thread.
Typically, an async result object is registered with the foreground code. The async result object can either raise an event when the background task is completed, or the foreground code can periodically check the async result object to see if its completion flag has been set.
Concurrency does not necessarily require threads.
In Linux, for example, you can perform non-blocking syscalls. Using this type of calls, you can for instance start a number of network reads. Your code can keep track of the reads manually (using handles in a list or similar) and periodically ask the OS if new data is available on any of the connections. Internally, the OS also keeps a list of ongoing reads. Using this technique, you can thus achieve concurrency without any (extra) threads, neither in your program nor in the OS.
If you use threads and blocking IO, you would typically start one thread per read. In this scenario, the OS will instead have a list of ongoing threads, which it parks when the tread tries to read data when there is none available. Threads are resumed as data becomes available.
Having the OS switch between threads might involve slightly more overhead in the form of context switching - switching program counter and register content. But the real deal breaker is usually stack allocation per thread. This size is a couple of megabytes by default on Linux. If you have a lot of concurrency in your program, this might push you in the direction of using non-blocking calls to handle more concurrency per thread.
So it is possible to do async programming without threads. If you want to do async programming using only blocking OS-calls you need to dedicate a thread to do the blocking while you continue. But if you use non-blocking calls you can do a lot of concurrent things with just a single thread. Have a look at Node.js, which have great support for many concurrent connections while being single-threaded for most operations.
Also check out Golang, which achieve a similar effect using a sort of green threads called goroutines. Multiple goroutines run concurrently on the same OS thread and they are restrictive in stack memory, pushing the limit much further.
Async codes does not offer any performance gain than threaded codes.
Asynchornous execution is one of the traits of multi-threaded execution, which is becoming more relevant as processors are packing in more cores.
For servers, multi-core only mildly relevant, as they are already written with concurrency in mind and will scale natrually, but multi-core is particularly relevant for desktop apps, which traditionally do only a few things concurrently - often just one foreground task with a background thread. Now, they have to be coded to do many things concurrently if they are to take advantage of the power of the multi-core cpu.
As to the performance - on single-core - the asynchornous tasks slow down the system as much as they would if run sequentially (this a simplication, but true for the most part.) So, running task A, which takes 10s and task B which takes 5s on a single core, the total time needed will be 15s, if B is run asynchronously or not. The reason is, is that as B runs, it takes away cpu resources from A - A and B compete for the same cpu.
With a multi-core machine, additional tasks run on otherwise unused cores, and so the situation is different - the additional tasks don't really consume any time - or more correctly, they don't take away time from the core running task A. So, runing tasks A and B asynchronously on multi-core will conume just 10s - not 15s as with single core. B's execution runs at the same time as A, and on a separate core, so A's execution time is unaffected.
As the number of tasks and cores increase, then the potential improvements in performance also increase. In parallel computing, exploiting parallelism to produce an improvement in performance is known as speedup.
we are already seeing 64-core cpus, and it's esimated that we will have 1024 cores commonplace in a few years. That's a potential speedup of 1024 times, compared to the single-threaded synchronous case. So, to answer your question, there clearly is a performance gain to be had by using asynchronous execution.
I believed any kind of asynchronous execution makes a thread in invisible area.
This is your problem - this actually isn't true.
The thing is, your whole computer is actually massively asynchronous - requests to RAM, communication via a network card, accessing a HDD... those are all inherently asynchronous operations.
Modern OSes are actually built around asynchronous I/O. Even when you do a synchronous file request, for example (e.g. File.ReadAllText), the OS sends an asynchronous request. However, instead of giving control back to your code, it blocks while it waits for the response to the asynchronous request. And this is where proper asynchronous code comes in - instead of waiting for the response, you give the request a callback - a function to execute when the response comes back.
For the duration of the asynchronous request, there is no thread. The whole thing happens on a completely different level - say, the request is sent to the firmware on your NIC, and given a DMA address to fill the response. When the NIC finishes your request, it fills the memory, and signals an interrupt to the processor. The OS kernel handles the interrupt by signalling the owner application (usually an IOCP "channel") the request is done. This is still all done with no thread whatsoever - only for a short time right at the end, a thread is borrowed (in .NET this is from the IOCP thread pool) to execute the callback.
So, imagine a simple scenario. You need to send 100 simultaneous requests to a database engine. With multi-threading, you would spin up a new thread for each of those requests. That means a hundred threads, a hundread thread stacks, the cost of starting a new thread itself (starting a new thread is cheap - starting a hundred at the same time, not so much), quite a bit of resources. And those threads would just... block. Do nothing. When the response comes, the threads are awakened, one after another, and eventually disposed.
On the other hand, with asynchronous I/O, you can simply post all the requests from a single thread - and register a callback when each of those is finished. A hundred simultaneous requests will cost you just your original thread (which is free for other work as soon as the requests are posted), and a short time with threads from the thread pool when the requests are finished - in "worst" case scenario, about as many threads as you have CPU cores. Provided you don't use blocking code in the callback, of course :)
This doesn't necessarily mean that asynchronous code is automatically more efficient. If you only need a single request, and you can't do anything until you get a response, there's little point in making the request asynchronous. But most of the time, that's not your actual scenario - for example, you need to maintain a GUI in the meantime, or you need to make simultaneous requests, or your whole code is callback-based, rather than being written synchronously (a typical .NET Windows Forms application is mostly event-based).
The real benefit from asynchronous code comes from exactly that - simplified non-blocking UI code (no more "(Not Responding)" warnings from the window manager), and massively improved parallelism. If you have a web server that handles a thousand requests simultaneously, you don't want to waste 1 GiB of address space just for the completely unnecessary thread stacks (especially on a 32-bit system) - you only use threads when you have something to do.
So, in the end, asynchronous code makes UI and server code much simpler. In some cases, mostly with servers, it can also make it much more efficient. The efficiency improvements come precisely from the fact that there is no thread during the execution of the asynchronous request.
Your comment only applies to one specific kind of asynchronous code - multi-threaded parallelism. In that case, you really are wasting a thread while executing a request. However, that's not what people mean when saying "my library offers an asynchronous API" - after all, that's a 100% worthless API; you could have just called await Task.Run(TheirAPIMethod) and gotten the exact same thing.

Thread Pool vs Thread Spawning

Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.
Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.
A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.
For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).
My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...
All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.
Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.
It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.
The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.
There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.
This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern
For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.
Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.

Resources