Performance problem of java.lang.Thread and threads in ThreadPool - multithreading

A Kafka consumer application has severe latencies (not consuming the kafka events fast enough during the peak hour). The kafka topic has 120 partitions, and the consumer group has a total of 30 hosts, and each host has two consumers, so each consumer consumes from 2 kafka partitions. The hosts we use are AWS C5.9xlarge instances with 32cores. Each consumer was put into one java.lang.Thread, and within each thread, a ThreadPool is created with 250 threads.
We have verified that the none of the CPU/Memory/IO is bottleneck. Then we increased the 250 workers to 500 workers, but latency stayed. Then we changed back to 250 workers, but increased to 4 consumers per host from 2. As a result, each consumer consumes from one kafka partition. And now problem is solved, the latency dropped to very low.
My question is, why increasing from 250 to 500 in the Threadpool did not help, but increasing from 2 to 4 consumers per host helped?
private class ConsumerThread extends Thread {
public ConsumerThread(StremProcessor processor) {
this.processor = processor;
this.consumer = new KafkaConsumer()
}
#Override
public void run() {
ExecutorService executor = Executors.newFixedThreadPool(250);
while (true) {
Data data = consumer.poll()
executor.invokeAll(getTasks(data, processor)); //processor is
}
}
}

First of all: You should include some delay in your while-loop between each cycle to prevent your application from flooding your memory.
Basically the ExecutorService.invokeAll() method returns a list of Futures. You can use them to "control" your threads.
How are the threads in ThreadPool different from the java.lang.Thread?
They do not differ but you get a wrapper (Future) back which let you control the thread at execution time. The underlying Thread works like a usual Java thread.
Is it because all the threads in the ThreadPool use a single processor
core?
No

Thread pool is nothing but a reusable pool of java.lang.Thread. Generally, Thread pool has a queue of tasks and if any thread from the thread pool is free it can execute the task and when the task is done that Thread back to the pool and try to find if there any other task waiting in the queue.
How are the threads in ThreadPool different from the java.lang.Thread?
There is no difference. Only the difference in usage.
Is it because all the threads in the ThreadPool use a single
processor core?
No, it can use any number of available processor.
I remember that the default threads in the ExecutorPool is 250 per
processor, does that mean the ExecutorPool is not smart enough to
distribute the 250 threads to 16 cores?
From where you get the info like " ExecutorPool is 250 per processor"?. I don't understand your question completely. A Thread pool's thread can execute on any core as normal thread there is no restrictions for Thread pool's thread.

Related

Can we achieve parallel processing using multiple cpu cores in NodeJs with worker threads?

I know "cluster" and "child_process" can use multiple cores of a CPU so that we can achieve true parallel processing.
I also know that the async event loop is single-threaded so we can only achieve concurrency.
My question is about worker_threads:
Assume that My computer has 4 core CPU And I'm executing a nodejs script. The script creates three worker threads.
Would the three worker thread make use of the remaining 3 cores in the CPU to achieve parallelism?
or the three worker threads will only use the main core and the remaining 3 core are not used just like the event loop?
Would the three worker thread make use of the remaining 3 cores in the CPU to achieve parallelism?
Yes, you can achieve parallelism. The actual CPU allocation is, of course, up to the operating system, but these will be true OS threads and will be able to take advantage of multiple CPUs.
or the three worker threads will only use the main core and the remaining 3 core are not used just like the event loop?
No. Each worker thread can use a separate CPU. Each thread has its own separate event loop.
The main time that the four threads will not be independent is when they wish to communicate with each other via messaging because those messages will go through the recipient's event loop. So, if thread A sends a message to the main thread, then that message will go into the main thread's event queue and won't be received by the main loop until the main loop gets back to the event loop retrieve that next message from the event queue. The same is true for the reverse. If you sent a message from the main thread to thread A, but thread A was busy executing a CPU intensive task, that message won't be received until thread A gets back to the event loop (e.g. finishes its CPU-intensive task).
Also, be careful if your threads are doing I/O (particularly disk I/O) as they may be competing for access to those resources and may get stuck waiting for other threads to finish using a resource before they can proceed.

Number of threads in a thread pool

I have two questions..
1. What is he difference between thread and thread pool? Can I have multiple thread pools (not threads) in my system.
2. I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the processor. I am using a quad core processor, that means I can have 4 or 5 threads in a thread pool. However under task manager my system shows more than 1000 threads active anytime..??
What is he difference between thread and thread pool?
A thread is a single flow of execution. A thread pool is a group of these threads; usually the threads in a thread pool are kept alive indefinitely (i.e. until program shutdown) so that as each new work-request comes in, it can be handed to the next available thread in the thread-pool for processing. (This is beneficial because it's more efficient to just wake up an existing thread and hand it some work than it is to always create a new thread every time a new work-request comes in, and then destroy the thread afterwards)
Can I have multiple thread pools (not threads) in my system.
Yes.
I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the
processor.
That's a good heuristic, but it's not a requirement; your thread pool can have as many or as few threads in it as you like. The reason people suggest that number is that if you have fewer threads in your thread pool than you have physical CPU cores, then under heavy load not all of your CPU cores will get used (e.g. if you have a 3-thread pool and 4 CPU cores, then under heavy load you'll have 3 CPU cores busy and 1 CPU core idle/wasted, and your program will take ~25% longer to finish the work than if it would have if it had 4 threads in the pool). On the other hand, if you have more threads than CPU cores, then under heavy load the "extra" threads merely end up time-sharing a CPU core together, slowing each other down and not providing any additional benefit in terms of work-completion-rate.
However under task manager my system shows more than 1000 threads
active anytime..??
The thing to notice about those 1000 threads is that probably 99% of them are actually asleep at any given moment. Most threads aren't doing work all the time; rather they spend most of their lives waiting for some particular event to occur, quickly handling it, and then going back to sleep until the next event comes along for them to handle. That's the reason why you can have 1000 threads present on just a handful of CPU cores without everything bogging down.

Profiling Ratpack: ExecControllerBindingThreadFactory high CPU usage and lots of threads

We have a mobile app API server written with Ratpack 1.5.1 about to go live soon, and we're currently profiling the application to catch any performance bottlenecks. The app is backed by an SQL database and we're careful to always run queries using the Blocking class. The code is written in Kotlin and we wrote some coroutine glue code to force blocking operations to be executed on Ratpack's blocking threads.
Since Ratpack's thread model is unique we'd like to make sure this situation is normal: we simulated 2500 concurrent users of the application and our thread count went up to 400 (and even 600 at one point), most of these being ratpack-blocking-x-yyy threads.
Sampling the CPU we get 92% time spent in the ratpack.exec.internal.DefaultExecController$ExecControllerBindingThreadFactory.lambda$newThread$0 method, but this could be an artifact of sampling.
So, to ask concrete questions: given Ratpack's thread model, is the high blocking thread count normal and should we be worrying about the high CPU time spent in the above mentioned method?
Ratpack creates unlimited(*) thread-pool for blocking operations. It gets created in DefaultExecController:
public DefaultExecController(int numThreads) {
this.numThreads = numThreads;
this.eventLoopGroup = ChannelImplDetector.eventLoopGroup(numThreads, new ExecControllerBindingThreadFactory(true, "ratpack-compute", Thread.MAX_PRIORITY));
this.blockingExecutor = Executors.newCachedThreadPool(new ExecControllerBindingThreadFactory(false, "ratpack-blocking", Thread.NORM_PRIORITY));
}
Threads that are created in this pool don't get killed right after blocking operation is done - they are idling in the pool and waiting for the next job to do. The main reason behind it is that keeping thread in idle state is cheaper than spawning new threads when they are needed. That's why when you simulate 2500 concurrent users calling and endpoint which executes blocking operation, you will see 2500 threads in this pool. Cached thread-pool that gets created uses following ThreadPoolExecutor object:
public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
return new ThreadPoolExecutor(0, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory);
}
where 2147483647 is maximum pool size, 60L is TTL expressed in seconds. It means that executor service will keep those threads for 60 seconds and when they don't get re-used after 60 seconds, it will clean them up.
High CPU in this case is actually expected. 2500 threads are utilizing a few cores of the CPU. It's also important - where your SQL database is running? If you run it on the same machine then your CPU has even harder job to do. If the operations you run on blocking thread-pool are consuming significant CPU time, then you have to optimize those blocking operations. Ratpack's power comes with async and non-blocking architecture - handlers use ratpack-compute thread-pool and delegate all blocking operations to ratpack-blocking so your application is not blocked and can handle tons of requests.
(*) unlimited in this case means limited by available memory or if you have enough memory it is limited by 2147483647 threads (this value is used in ExecutorService.newCachedThreadPool(factory)).
Just to build on Szymon's answer…
Ratpack doesn't inherently throttle any operations. That's effectively up to you. One option you have is to use https://ratpack.io/manual/current/api/ratpack/exec/Throttle.html to constrain and queue access to a resource.

Thread optimal pool size configuration?

What is the reason of keeping thread pool size equal to the number of processors/cores for CPU-intensive tasks? And why I/O bound tasks should have larger pool size?
There is a correlation between the optimal number of threads to the number of central processing units because a thread can be thought of as a program. Programs requires run time. Run time is provided by a central processing unit.
A producer - consumer analogy would have the program as the consumer and the central processing units as the producer. So theoretically - if a producer (cpu) can handle T consumers (threads) and there are C producers - the optimal number of consumers would be T * C.
Too many threads would cause for too much context switch overhead, which is practically wasted cpu time to manage the threads themselves. Too few would cause idle cpus while tasks are still in queue.
I/O bound tasks communicate with slow devices (that's the reason they're called I/O bound). While requests are made to a slow device (such as the hard drive), the scheduler can have the cpu run other threads instead of waiting for the device's output.
An analogy for that would be you (the scheduler) ordering food in a restaurant (thread 1) and then sending an SMS to your friend (thread 2). The fact that you're waiting for your food shouldn't deny you of completing other tasks, such as sending the SMS to your friend.
To have deeper knowledge about possible optimizations you may want to read about affinity and scheduling.

[CLR Threading]When a thread pool thread blocks, the thread pool creates additional threads

I see this in the book "CLR via C#" and I don't catch it. If there are still threads available in the thread pool, why does it create additional threads?
It might just be poor wording.
On a given machine the threadpool has a good guess of the optimum number of threads the machine can run without overextending resources. If, for some reason, a thread becomes IO blocked (for instance it is waiting for a long time to save or retrieve data from disk or for a response from a network device) the threadpool can start up another thread to take advantage of unused CPU time. When the other thread is no longer blocking, the threadpool will take the next freed thread out of the pool to reduce the size back to "optimum" levels.
This is part of the threadpool management to keep the system from being over-tasked (and reducing efficiency by all the context switches between too many threads) while reducing wasted cycles (while a thread is blocked there might not be enough other work to task the processor(s) fully even though there are tasks waiting to be run) and wasted memory (having threads spun up and ready but never allocated because they'd over task the CPU).
More info on the Managed Thread Pool from MSDN.
The book lied.
Threadpool only creates additional threads when all available threads have been blocked for more than 1 second. If there are free threads, it will use them to process your additional tasks. Note that after 30 seconds of thread idle, the CLR retires the thread (terminates it, gracefully of course).

Resources