Threading taking up large amounts of CPU

Threading taking up large amounts of CPU - multithreading

I'm using Thread to help do threads in perl; I'd say I'm fairly new to threading.
I have a variable in my program called "max threads". If the number of threads falls below this number, it will prompt a new one. I'm using a while loop to compare the current number of existing threads to the maximum threads variable.
I'm assuming that the while loop is the thing consuming my cpu.
Is there anyway that I can have the 'boss' or 'manager' thread (The core thread) not take up as much cpu while arranging and managing threads? If my CPU is raising just because of the manager thread, then there's ultimately no point to threading at all!

If you want to keep the current model, you should have some kind of signal (probably a semaphore) on which the thread launcher can block when there are too many workers.
A much simpler model is to have a pool of workers, and given them work via a Thread::Queue.
my $q = Thread::Queue->new();
my #workers;
for (1..$MAX_WORKERS) {
push #workers, async {
while (my $job = $q->dequeue()) {
...
}
};
}
for (...) {
$q->enqueue(...);
}
# Time to exit
$q->enqueue(undef) for 0..$#workers;
# Wait for workers to finish.
$_->join() for #workers;

I don't use Perl, but speaking from a general asynchronous programming perspective, you want a thread pool manager that isn't clogging up the main thread, and this can be accomplished multiple ways. For one thing, you can dedicate a thread (yay!) to doing something like this (pseudocode):
while program not terminating:
wait a quarter-second or so, then
do your "are-there-enough-threads" check
The OS, or your abstracted run-time library, will generally supply some kind of wait function that halts the thread until a specific amount of time has passed (thus taking up no scheduler resource during that time).
Alternatively, if your program is event-driven (as in a GUI environment), you could do similar pool management off the main thread by posting yourself timer messages, which is another service generally supplied by the OS.

Perl threads are heavy-weight compared to other languages. They take a lot of resources to start; try to start all the threads you need up front and just keep them running. Starting new threads every time you have an asynchronous task to do will be very inefficient.

Related

Which usecases are suitable for Dispatchers.Default in Kotlin?

Based on the documentation the threadpool size of IO and Default dispatchers behave as follows:
Dispatchers.Default: By default, the maximal level of parallelism used by this dispatcher is equal to the number of CPU cores, but is at least two.
Dispatchers.IO: It defaults to the limit of 64 threads or the number of cores (whichever is larger).
Unless there is one piece of information that I am missing, performing lots of CPU intensive works on Default is more efficient (faster) because context switching will happen less often.
But the following code actually runs much faster on Dispatchers.IO:
fun blockingWork() {
val startTime = System.currentTimeMillis()
while (true) {
Random(System.currentTimeMillis()).nextDouble()
if (System.currentTimeMillis() - startTime > 1000) {
return
}
}
}
fun main() = runBlocking {
val startTime = System.nanoTime()
val jobs = (1..24).map { i ->
launch(Dispatchers.IO) { // <-- Select dispatcher here
println("Start #$i in ${Thread.currentThread().name}")
blockingWork()
println("Finish #$i in ${Thread.currentThread().name}")
}
}
jobs.forEach { it.join() }
println("Finished in ${Duration.of(System.nanoTime() - startTime, ChronoUnit.NANOS)}")
}
I am running 24 jobs on a 8-core CPU (so, I can keep all the threads of Default dispatcher, busy). Here is the results on my machine:
Dispatchers.IO --> Finished in PT1.310262657S
Dispatchers.Default --> Finished in PT3.052800858S
Can you tell me what I am missing here? If IO works better, why should I use any dispatcher other than IO (or any threadpool with lots of threads).

Answering your question: Default dispatcher works best for tasks that do not feature blocking because there is no gain in exceeding maximum parallelism when executing such workloads concurrently(the-difference-between-concurrent-and-parallel-execution).
https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
Your experiment is flawed. As already mentioned in the comments, your blockingWork is not CPU-bound but IO-bound. It's all about waiting - periods when your task is blocked and CPU cannot execute its subsequent instructions. Your blockingWork in essence is just "wait for 1000 milliseconds" and waiting 1000ms X times in parallel is going to be faster than doing it in sequence. You perform some computation(generating random number - which in essence might also be IO-bound), but as already noted, your workers are generating more or less of those numbers, depending on how much time the underlying threads have been put to sleep.
I performed some simple experiments with generating Fibonacci numbers(often used for simulation of CPU workloads). However, after taking into the account the JIT in the JVM I couldn't easily produce any results proving that the Default dispatcher performs better. Might be that the context-switching isn't as significant as one may believe. Might be that the dispatcher wasn't creating more threads with IO dispatcher for my workload. Might be that my experiment was also flawed. Can't be certain - benchmarking on JVM is not simple by itself and adding coroutines(and their thread pools) to the mix certainly isn't making it any simpler.
However, I think there is something more important to consider here and that is blocking. Default dispatcher is more sensitive to blocking calls. With fewer threads in the pool, it is more likely that all of them become blocked and no other coroutine can execute at that time.
Your program is working in threads. If all threads are blocked, then your program isn't doing anything. Creating new threads is expensive(mostly memory-wise), so for high-load systems that feature blocking this becomes a limiting factor. Kotlin did an amazing job of introducing "suspending" functions. The concurrency of your program is not limited to the number of threads you have anymore. If one flow needs to wait, it just suspends instead of blocking the thread. However, "the world is not perfect" and not everything "suspends" - there are still "blocking" calls - how certain are you that no library that you use performs such calls under the hood? With great power comes great responsibility. With coroutines, one needs to be even more careful about deadlocks, especially when using Default dispatcher. In fact, in my opinion, IO dispatcher should be the default one.
EDIT
TL;DR: You might actually want to create your own dispatchers.
Looking back it came to my attention that my answer is somewhat superficial. It's technically incorrect to decide which dispatcher to use by only looking at the type of workload you want to run. Confining CPU-bound workload to a dispatcher that matches the number of CPU cores does indeed optimize for throughput, but that is not the only performance metric.
Indeed, by using only the Default for all CPU-bound workloads you might find that your application becomes unresponsive! For example, let's say we have a "CPU-bound " long-running background process that uses the Default dispatcher. Now if that process saturates the thread pool of the Default dispatcher then you might find that the coroutines that are started to handle immediate user actions (user click or client request) need to wait for a background process to finish first! You have achieved great CPU throughput but at the cost of latency and the overall performance of your application is actually degraded.
Kotlin does not force you to use predefined dispatchers. You can always create your own dispatchers custom-cut for the specific task you have for your coroutines.
Ultimately it's about:
Balancing resources. How many threads do you actually need? How many threads you can afford to create? Is it CPU-bound or IO-bound? Even if it is CPU-bound, are you sure you want to assign all of the CPU resources to your workload?
Assigning priorities. Understand what kind of workloads run on your dispatchers. Maybe some workloads need to run immediately and some other might wait?
Preventing starvation deadlocks. Make sure your currently running coroutines don't block waiting for a result of a coroutine that is waiting for a free thread in the same dispatcher.

What happens to threads in python when there is no .join()?

Suppose, we have a multi-thread Python code which looks like this:
import threading
import time
def short_task():
print 'Hey!'
for x in range(10000):
t = threading.Thread(target=short_task)
t.daemon = True # All non-daemon threads will be ".join()"'ed when main thread dies, so we mark this one as daemon
t.start()
time.sleep(100)
Are there any side-effects from using similar approach in long-running applications (e.g. Django+uwsgi)? Like no garbage collection, extra memory consumption, etc?
What I am trying to do is to do some costly logging (urlopen() to external API url) without blocking the main thread. Spawning infinite new threads with no .join() looks like best possible approach here, but maybe I am wrong?

Not a 100% confident answer, but since nobody else has weighed in...
I can't find any place in the Python documentation that says you must join threads. Python's threading model looks Java-like to me: In Java t.join() means "wait for t to die," but it does not mean anything else. In particular, t.join() does not do anything to thread t.
I'm not an expert, but it looks like the same is true in Python.
Are there any side-effects...Like...extra memory consumption
Every Python thread must have its own, fixed-size call stack, and the threading module documentation says that the minimum size of a stack is 32K bytes. If you create ten thousand of those, like in your code snippet, and if they all manage to exist at the same time, then just the stacks alone are going to occupy 320 megabytes of real memory.
It's unusual to find a good reason for a program to have that many simultaneous threads.
If you're expecting those threads to die so quickly that there's never more than a few of them living at the same time, then you probably could improve the performance of your program by using a thread pool. A thread pool is an object that manages a small number of worker threads and a blocking queue of tasks (i.e., functional objects). Each worker sits in a loop, picking tasks from the queue and performing them.
A program that uses a thread pool effectively re-uses its worker threads instead of continually letting threads die and creating new ones to replace them.

Is it bad practice to just kick off new threads for blocking operations (Perl)

If doing CPU intensive tasks I believe it is optimal to have one thread per core. If you have a 4 core CPU you can run 4 instances of a CPU intensive subroutine without any penalty. For example I once experimentally ran four instances of a CPU intensive algorithm on a four core CPU. Up to four times the time per process did not decrease. At the fifth instances all instances took longer.
What is the case for blocking operations? Let's say I have a list of 1,000 URLs. I have been doing the following:
(Please don't mind any syntax errors, I just mocked this up)
my #threads;
foreach my $url (#urlList) {
push #threads, async {
my $response = $ua->get($url);
return $response->content;
}
}
foreach my $thread (#threads) {
my $response = $thread->join;
do_stuff($response);
}
I am essentially kicking off as many threads as there are URLs in the URL list. If there are a million URLs then a million threads will be kicked off. Is this optimal, if not what is an optimal number of threads? Is using threads a good practice for ANY blocking I/O operation that can wait (reading a file, database queries, etc)?
Related Bonus Question
Out of curiosity does Perl threads work the same as Python and it's GIL? With python to get the benefit of multithreading and utilize all cores for CPU intensive tasks you have to use multiprocessing.

Out of curiosity does Perl threads work the same as Python and it's GIL? With python to get the benefit of multithreading and utilize all cores for CPU intensive tasks you have to use multiprocessing.
No, but the conclusion is the same. Perl doesn't have a big lock protecting the interpreter across threads; instead it has a duplicate interpreter for each different thread. Since a variable belongs to an interpreter (and only one interpreter), no data is shared by default between threads. When variables are explicitly shared they're placed in a shared interpreter which serializes all accesses to shared variables on behalf of the other threads. In addition to the memory issues mentioned by others here, there are also some serious performance issues with threads in Perl, as well as limitations on the kind of data that can be shared and what you can do with it (see perlthrtut for more info).
The upshot is, if you need to parallelize a lot of IO and you can make it non-blocking, you'll get a lot more performance out of an event loop model than threads. If you need to parallelize stuff that can't be made non-blocking, you'll probably have a lot more luck with multi-process than with perl threads (and once you're familiar with that kind of code, it's also easier to debug).
It's also possible to combine the two models (for example, a mostly-single-process evented app that passes off certain expensive work to child processes using POE::Wheel::Run or AnyEvent::Run, or a multi-process app that has an evented parent managing non-evented children, or a Node Cluster type setup where you have a number of preforked evented webservers, with a parent that just accepts and passes FDs to its children).
There's no silver bullets, though, at least not yet.

From here:
http://perldoc.perl.org/threads.html
Memory consumption
On most systems, frequent and continual creation and destruction of threads can lead to ever-increasing growth in the memory footprint of the Perl interpreter. While it is simple to just launch threads and then ->join() or ->detach() them, for long-lived applications, it is better to maintain a pool of threads, and to reuse them for the work needed, using queues to notify threads of pending work. The CPAN distribution of this module contains a simple example (examples/pool_reuse.pl) illustrating the creation, use and monitoring of a pool of reusable threads.

Let's look at your code. I see three problems with it:
Easy one first: You use ->content instead of ->decoded_content(charset => 'none').
->content returns the raw HTML response body which is useless without information in the headers to decode it (e.g. it might be gzipped). It works sometimes.
->decoded_content(charset => 'none') gives you the actual response. It works always.
You process responses in the order requests were made. That means you could be blocked while responses are waiting to be serviced.
The simplest solution is to place the responses in a Thread::Queue::Any object.
use Thread::Queue::Any qw( );
my $q = Thread::Queue::Any->new();
my $requests = 0;
for my $url (#urls) {
++$requests;
async {
...
$q->enqueue($response);
};
}
while ($requests && my $response = $q->dequeue()) {
--$requests;
$_->join for threads->list(threads::joinable);
...
}
$_->join for threads->list();
You create a lot of threads that are only used once.
There is a significant amount of overhead to that approach. A common multithreading practice is to create a pool of persistent worker threads. These workers perform whatever job needs to be done, then move on to the next job rather than exiting. Jobs to the pool rather than a specific thread so that the job can be started as soon as possible. In addition to removing thread creation overhead, this allows the number of threads running at a time to be controlled. This is great for CPU-bound tasks.
However, your needs are different since you're using threads to do asynchronous IO. The CPU overhead of thread creation doesn't impact you as much (though it may impose a startup lag). Memory is fairly cheap, but you're still using far more than you need. Threads are really not ideal for this task.
There are much better systems for doing asynchronous IO, but they are not necessarily easily available from Perl. In your specific case, though, you're much better to avoid threads and go with Net::Curl::Multi. Follow the example in the Synopsis, and you'll get a very fast engine capable of making parallel web requests with very little overhead.
We at my former employer have switched to Net::Curl::Multi without problem for a high-load mission-critical web site, and we love it.
It's easy to create a wrapper that creates HTTP::Response objects if you want to limit changes to surrounding code. (This was the case for us.) Note that it helps to have a reference to the underlying library (libcurl) handy since the Perl code is a thin layer over the underlying library, since the documentation is very good, and since it documents all the options you can provide.

You might simply want to consider a non-blocking user agent. I like Mojo::UserAgent which is part of the Mojolicious suite. You might want to look at an example that I mocked up for a non-blocking crawler for another question.

Mutithreading thread control

How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken

Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.

A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.

There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.

Thread Pool vs Thread Spawning

Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.

Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.

A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.

For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).

My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...

All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.

Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.

It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.

The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.

There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.

This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern

For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.

Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string