Suppose, we have a multi-thread Python code which looks like this:
import threading
import time
def short_task():
print 'Hey!'
for x in range(10000):
t = threading.Thread(target=short_task)
t.daemon = True # All non-daemon threads will be ".join()"'ed when main thread dies, so we mark this one as daemon
t.start()
time.sleep(100)
Are there any side-effects from using similar approach in long-running applications (e.g. Django+uwsgi)? Like no garbage collection, extra memory consumption, etc?
What I am trying to do is to do some costly logging (urlopen() to external API url) without blocking the main thread. Spawning infinite new threads with no .join() looks like best possible approach here, but maybe I am wrong?
Not a 100% confident answer, but since nobody else has weighed in...
I can't find any place in the Python documentation that says you must join threads. Python's threading model looks Java-like to me: In Java t.join() means "wait for t to die," but it does not mean anything else. In particular, t.join() does not do anything to thread t.
I'm not an expert, but it looks like the same is true in Python.
Are there any side-effects...Like...extra memory consumption
Every Python thread must have its own, fixed-size call stack, and the threading module documentation says that the minimum size of a stack is 32K bytes. If you create ten thousand of those, like in your code snippet, and if they all manage to exist at the same time, then just the stacks alone are going to occupy 320 megabytes of real memory.
It's unusual to find a good reason for a program to have that many simultaneous threads.
If you're expecting those threads to die so quickly that there's never more than a few of them living at the same time, then you probably could improve the performance of your program by using a thread pool. A thread pool is an object that manages a small number of worker threads and a blocking queue of tasks (i.e., functional objects). Each worker sits in a loop, picking tasks from the queue and performing them.
A program that uses a thread pool effectively re-uses its worker threads instead of continually letting threads die and creating new ones to replace them.
Related
https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?
Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource
Short version: How do I release multiple locks from a single thread, without being preempted halfway through?
I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
for i in workers_to_wake:
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
...do some work...
main_semaphore.increment() # report done with step
All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.
I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?
TL; DR
If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.
Context
First I need to summarize the broader context in our discussion...
You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.
The problem
Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.
So what can we do instead? The problems you need to solve are:
Precisely timing the start of all required workers.
Prodding the workers that actually need to run in the next frame.
Options
The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.
But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?
Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.
However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.
A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.
A solution
Applying the above, your pseudo-code would then look something like this:
def main_thread(n):
main_event = event()
for i = 1 to n:
worker_scheduled[i] = False
spawn_thread(worker_thread, i)
main_barrier = barrier(n+1)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_scheduled[i] = True
main_event.set()
main_barrier.enter() # wait for all workers
main_event.reset()
def worker_thread(i):
while True:
main_event.wait()
if worker_scheduled[i]:
worker_scheduled[i] = False
...do some work...
main_barrier.enter() # report finished for this frame.
main_event.reset() # to catch the case that a worker is scheduled before the main thread
Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.
I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.
It is not possible when you use multiple synchronization objects (semaphores) when wake-up algorithm complexity is O(n). There are few ways how to solve it though.
release all at once
I'm not sure whether Python has the necessary method (is your question Python-specific?), but generally, semaphores have operations with argument specifying the number to decrements/increments. Thus, you just put all your threads on the same semaphore and wake them all together. Similar approach is to use conditional variable and notify all.
event loops
If you still want to to be able to control each thread individually but like the approach with one-to-many notification, try libraries for asynchronous I/O like libuv (and its Python counterpart). Here, you can create one single event that wakes all the threads at once and also create for each thread its individual event, then just wait on both (or more) event objects in event loops in each thread.
Another library is pevents which implements WaitForMultipleObjects on top of pthreads' conditional variables.
delegate waking up
Another approach is to replace your O(n) algorithm with tree-like algorithm ( O(log n) ) where each thread wakes up only fixed number of other threads but delegates them to wake-up others. In the edge case, main thread can wake up only one other thread which will wake-up everyone else or start the chain-reaction. It can be useful if you want to reduce latency for the main thread at expense of wake-up latenies of other threads.
Reader/Writer Lock
The solution I would normally use on POSIX systems for a one to many relationship is a reader/writer lock. It is a surprise to me that they aren't a complete universal, but most languages either implement a version, or at least have a package available to implement them on whatever primitives exist, for example, python's prwlock:
from prwlock import RWLock
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_lock = RWLock()
while True:
main_lock.acquire_write()
...do some work...
workers_to_wake = foo()
# The above acquire could be moved as low as here,
# depending on how independent the above processing is..
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
main_lock.release()
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
main_lock.acquire_read()
...do some work...
main_lock.release() # report done with step
Barriers
Barriers seem like Python's closest intended built-in mechanism to hold up all the threads until they are all alerted, but:
They are a pretty unusual solution, so they would make your code/experience harder to translate to other languages.
I wouldn't like to use them for this case where the number of threads to wake keeps changing. Given that your n sounds small, I would be tempted to use constant Barrier(n) and notify all threads to check if they are running this cycle. But:
I would be concerned that using a barrier would backfire since any of the threads being held up by something external will hold them all up and even a scheduler with resource dependency boosting might miss this relationship. Needing all n to reach the barrier could only make this worse.
Peter Brittain's solution, plus Anton's suggestion of a "tree-like wakeup", led me to another solution: Chained wakeups. Basically, rather than the main thread doing all the wakeups, it only wakes up one thread; and then each thread is then responsible for waking up the next one. The elegant bit here is that there's only ever one suspended thread ready to run, so threads rarely end up switching cores. In fact, this works fine with strict processor affinities, even if one of the worker threads shares affinity with the main thread.
The other thing I did was to use an atomic counter that worker threads decrement before sleeping; that way, only the last one wakes the main thread, so there's also no chance of the main thread being woken several times just to do more semaphore waiting.
workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the first worker
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore[i].decrement() # wait to be woken
if workers_to_wake.len() > 0: # more pending wakeups
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the next worker
...do some work...
if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
main_semaphore.increment() # report all done with step
I'm using Thread to help do threads in perl; I'd say I'm fairly new to threading.
I have a variable in my program called "max threads". If the number of threads falls below this number, it will prompt a new one. I'm using a while loop to compare the current number of existing threads to the maximum threads variable.
I'm assuming that the while loop is the thing consuming my cpu.
Is there anyway that I can have the 'boss' or 'manager' thread (The core thread) not take up as much cpu while arranging and managing threads? If my CPU is raising just because of the manager thread, then there's ultimately no point to threading at all!
If you want to keep the current model, you should have some kind of signal (probably a semaphore) on which the thread launcher can block when there are too many workers.
A much simpler model is to have a pool of workers, and given them work via a Thread::Queue.
my $q = Thread::Queue->new();
my #workers;
for (1..$MAX_WORKERS) {
push #workers, async {
while (my $job = $q->dequeue()) {
...
}
};
}
for (...) {
$q->enqueue(...);
}
# Time to exit
$q->enqueue(undef) for 0..$#workers;
# Wait for workers to finish.
$_->join() for #workers;
I don't use Perl, but speaking from a general asynchronous programming perspective, you want a thread pool manager that isn't clogging up the main thread, and this can be accomplished multiple ways. For one thing, you can dedicate a thread (yay!) to doing something like this (pseudocode):
while program not terminating:
wait a quarter-second or so, then
do your "are-there-enough-threads" check
The OS, or your abstracted run-time library, will generally supply some kind of wait function that halts the thread until a specific amount of time has passed (thus taking up no scheduler resource during that time).
Alternatively, if your program is event-driven (as in a GUI environment), you could do similar pool management off the main thread by posting yourself timer messages, which is another service generally supplied by the OS.
Perl threads are heavy-weight compared to other languages. They take a lot of resources to start; try to start all the threads you need up front and just keep them running. Starting new threads every time you have an asynchronous task to do will be very inefficient.
How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken
Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.
A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.
There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.
Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.
Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.
A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.
For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).
My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...
All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.
Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.
It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.
The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.
There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.
This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern
For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.
Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.