Profiling Ratpack: ExecControllerBindingThreadFactory high CPU usage and lots of threads - multithreading

We have a mobile app API server written with Ratpack 1.5.1 about to go live soon, and we're currently profiling the application to catch any performance bottlenecks. The app is backed by an SQL database and we're careful to always run queries using the Blocking class. The code is written in Kotlin and we wrote some coroutine glue code to force blocking operations to be executed on Ratpack's blocking threads.
Since Ratpack's thread model is unique we'd like to make sure this situation is normal: we simulated 2500 concurrent users of the application and our thread count went up to 400 (and even 600 at one point), most of these being ratpack-blocking-x-yyy threads.
Sampling the CPU we get 92% time spent in the ratpack.exec.internal.DefaultExecController$ExecControllerBindingThreadFactory.lambda$newThread$0 method, but this could be an artifact of sampling.
So, to ask concrete questions: given Ratpack's thread model, is the high blocking thread count normal and should we be worrying about the high CPU time spent in the above mentioned method?

Ratpack creates unlimited(*) thread-pool for blocking operations. It gets created in DefaultExecController:
public DefaultExecController(int numThreads) {
this.numThreads = numThreads;
this.eventLoopGroup = ChannelImplDetector.eventLoopGroup(numThreads, new ExecControllerBindingThreadFactory(true, "ratpack-compute", Thread.MAX_PRIORITY));
this.blockingExecutor = Executors.newCachedThreadPool(new ExecControllerBindingThreadFactory(false, "ratpack-blocking", Thread.NORM_PRIORITY));
}
Threads that are created in this pool don't get killed right after blocking operation is done - they are idling in the pool and waiting for the next job to do. The main reason behind it is that keeping thread in idle state is cheaper than spawning new threads when they are needed. That's why when you simulate 2500 concurrent users calling and endpoint which executes blocking operation, you will see 2500 threads in this pool. Cached thread-pool that gets created uses following ThreadPoolExecutor object:
public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
return new ThreadPoolExecutor(0, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory);
}
where 2147483647 is maximum pool size, 60L is TTL expressed in seconds. It means that executor service will keep those threads for 60 seconds and when they don't get re-used after 60 seconds, it will clean them up.
High CPU in this case is actually expected. 2500 threads are utilizing a few cores of the CPU. It's also important - where your SQL database is running? If you run it on the same machine then your CPU has even harder job to do. If the operations you run on blocking thread-pool are consuming significant CPU time, then you have to optimize those blocking operations. Ratpack's power comes with async and non-blocking architecture - handlers use ratpack-compute thread-pool and delegate all blocking operations to ratpack-blocking so your application is not blocked and can handle tons of requests.
(*) unlimited in this case means limited by available memory or if you have enough memory it is limited by 2147483647 threads (this value is used in ExecutorService.newCachedThreadPool(factory)).

Just to build on Szymon's answer…
Ratpack doesn't inherently throttle any operations. That's effectively up to you. One option you have is to use https://ratpack.io/manual/current/api/ratpack/exec/Throttle.html to constrain and queue access to a resource.

Related

Which usecases are suitable for Dispatchers.Default in Kotlin?

Based on the documentation the threadpool size of IO and Default dispatchers behave as follows:
Dispatchers.Default: By default, the maximal level of parallelism used by this dispatcher is equal to the number of CPU cores, but is at least two.
Dispatchers.IO: It defaults to the limit of 64 threads or the number of cores (whichever is larger).
Unless there is one piece of information that I am missing, performing lots of CPU intensive works on Default is more efficient (faster) because context switching will happen less often.
But the following code actually runs much faster on Dispatchers.IO:
fun blockingWork() {
val startTime = System.currentTimeMillis()
while (true) {
Random(System.currentTimeMillis()).nextDouble()
if (System.currentTimeMillis() - startTime > 1000) {
return
}
}
}
fun main() = runBlocking {
val startTime = System.nanoTime()
val jobs = (1..24).map { i ->
launch(Dispatchers.IO) { // <-- Select dispatcher here
println("Start #$i in ${Thread.currentThread().name}")
blockingWork()
println("Finish #$i in ${Thread.currentThread().name}")
}
}
jobs.forEach { it.join() }
println("Finished in ${Duration.of(System.nanoTime() - startTime, ChronoUnit.NANOS)}")
}
I am running 24 jobs on a 8-core CPU (so, I can keep all the threads of Default dispatcher, busy). Here is the results on my machine:
Dispatchers.IO --> Finished in PT1.310262657S
Dispatchers.Default --> Finished in PT3.052800858S
Can you tell me what I am missing here? If IO works better, why should I use any dispatcher other than IO (or any threadpool with lots of threads).
Answering your question: Default dispatcher works best for tasks that do not feature blocking because there is no gain in exceeding maximum parallelism when executing such workloads concurrently(the-difference-between-concurrent-and-parallel-execution).
https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
Your experiment is flawed. As already mentioned in the comments, your blockingWork is not CPU-bound but IO-bound. It's all about waiting - periods when your task is blocked and CPU cannot execute its subsequent instructions. Your blockingWork in essence is just "wait for 1000 milliseconds" and waiting 1000ms X times in parallel is going to be faster than doing it in sequence. You perform some computation(generating random number - which in essence might also be IO-bound), but as already noted, your workers are generating more or less of those numbers, depending on how much time the underlying threads have been put to sleep.
I performed some simple experiments with generating Fibonacci numbers(often used for simulation of CPU workloads). However, after taking into the account the JIT in the JVM I couldn't easily produce any results proving that the Default dispatcher performs better. Might be that the context-switching isn't as significant as one may believe. Might be that the dispatcher wasn't creating more threads with IO dispatcher for my workload. Might be that my experiment was also flawed. Can't be certain - benchmarking on JVM is not simple by itself and adding coroutines(and their thread pools) to the mix certainly isn't making it any simpler.
However, I think there is something more important to consider here and that is blocking. Default dispatcher is more sensitive to blocking calls. With fewer threads in the pool, it is more likely that all of them become blocked and no other coroutine can execute at that time.
Your program is working in threads. If all threads are blocked, then your program isn't doing anything. Creating new threads is expensive(mostly memory-wise), so for high-load systems that feature blocking this becomes a limiting factor. Kotlin did an amazing job of introducing "suspending" functions. The concurrency of your program is not limited to the number of threads you have anymore. If one flow needs to wait, it just suspends instead of blocking the thread. However, "the world is not perfect" and not everything "suspends" - there are still "blocking" calls - how certain are you that no library that you use performs such calls under the hood? With great power comes great responsibility. With coroutines, one needs to be even more careful about deadlocks, especially when using Default dispatcher. In fact, in my opinion, IO dispatcher should be the default one.
EDIT
TL;DR: You might actually want to create your own dispatchers.
Looking back it came to my attention that my answer is somewhat superficial. It's technically incorrect to decide which dispatcher to use by only looking at the type of workload you want to run. Confining CPU-bound workload to a dispatcher that matches the number of CPU cores does indeed optimize for throughput, but that is not the only performance metric.
Indeed, by using only the Default for all CPU-bound workloads you might find that your application becomes unresponsive! For example, let's say we have a "CPU-bound " long-running background process that uses the Default dispatcher. Now if that process saturates the thread pool of the Default dispatcher then you might find that the coroutines that are started to handle immediate user actions (user click or client request) need to wait for a background process to finish first! You have achieved great CPU throughput but at the cost of latency and the overall performance of your application is actually degraded.
Kotlin does not force you to use predefined dispatchers. You can always create your own dispatchers custom-cut for the specific task you have for your coroutines.
Ultimately it's about:
Balancing resources. How many threads do you actually need? How many threads you can afford to create? Is it CPU-bound or IO-bound? Even if it is CPU-bound, are you sure you want to assign all of the CPU resources to your workload?
Assigning priorities. Understand what kind of workloads run on your dispatchers. Maybe some workloads need to run immediately and some other might wait?
Preventing starvation deadlocks. Make sure your currently running coroutines don't block waiting for a result of a coroutine that is waiting for a free thread in the same dispatcher.

.net 4.0 c# : Pausing/Resuming parallel running threads from threadpool temporarily?

I could setup a multi-threaded environment using the .net ThreadPool and I do get a significant performance benefit. This runs in the background of my application.
Now when a new task is requested by the user, I want it to get maximum CPU resources to maximize performance. Hence I would like to temporarily pause all the threads that I began (via the ThreadPool.Queueuserworkitem method) and then resume once the new task, requested by the user in foreground, is completed.
There could be several solutions to my problem:
a. Starting lesser background threads so that any new user request gets some share of the CPU resources. (but I loose the performance gain I had :( )
b. Set higher priority for the thread for a new user requested task. (not sure if this works?)
c. Suspending/resuming the ThreadPool threads I began. But suspending / resuming / interrupting threads is highly discouraged. Moreover, this could get tricky and error prone.
Any other ideas?
Note: when the user makes a request, performing the task would normally not take more than 300ms. However, when I start ThreadPool threads in background, it now takes about 3 seconds to complete (10 times worse)! I am OK if it takes 500-800ms though. All background threads complete in about 8 seconds (and I am OK if they take 1-2 seconds more). Hence, I am trying out option ( a ) for now.
Thanks in advance!
Be noted that Thread scheduling is done by CPU and hence cannot be directed from within a program. Only thing that can be done is setting ThreadPriority (that too on new Threads, not on ThreadPool threads). Check section Limitations of Using the Thread Pool
As your requirement is to suspend all background threads while executing a new task, what you can do is to create a class level flag.
Now you can put checkpoints in methods to be executed in Background task. At the checkpoints, check the class level flag, if it is set, call Thread.Sleep, which should (NOT MUST) trigger thread context switch by OS/CPU thread scheduler.
Putting checkpoints in methods (to be executed by ThreadPool) is analogous to putting checkpoints for cancellation support in background worker.

[CLR Threading]When a thread pool thread blocks, the thread pool creates additional threads

I see this in the book "CLR via C#" and I don't catch it. If there are still threads available in the thread pool, why does it create additional threads?
It might just be poor wording.
On a given machine the threadpool has a good guess of the optimum number of threads the machine can run without overextending resources. If, for some reason, a thread becomes IO blocked (for instance it is waiting for a long time to save or retrieve data from disk or for a response from a network device) the threadpool can start up another thread to take advantage of unused CPU time. When the other thread is no longer blocking, the threadpool will take the next freed thread out of the pool to reduce the size back to "optimum" levels.
This is part of the threadpool management to keep the system from being over-tasked (and reducing efficiency by all the context switches between too many threads) while reducing wasted cycles (while a thread is blocked there might not be enough other work to task the processor(s) fully even though there are tasks waiting to be run) and wasted memory (having threads spun up and ready but never allocated because they'd over task the CPU).
More info on the Managed Thread Pool from MSDN.
The book lied.
Threadpool only creates additional threads when all available threads have been blocked for more than 1 second. If there are free threads, it will use them to process your additional tasks. Note that after 30 seconds of thread idle, the CLR retires the thread (terminates it, gracefully of course).

Thread Pool vs Thread Spawning

Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.
Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.
A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.
For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).
My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...
All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.
Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.
It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.
The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.
There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.
This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern
For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.
Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.

Many threads or as few threads as possible?

As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)
use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads
To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.
I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.
What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.
I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.

Resources