Asynchronous and load dependent thread/task management with TBB - multithreading

I'm trying for first time the Intel TBB and I'm stucked right at the beginning.
I've attached a simple image to show how I want to build my concurrency program.
I've took a look at here Simplest TBB example and here using TBB for non-parallel tasks.
TBB is nice but I don't know how to handle following problem: How can I define threadpools/taskpools which are started or stopped in dependency of memory consumption. To be precise, if memory-consumption of some dataclass is too much, its filling threadpool shall be stalled (e.g. no new spawns until the other threads have consumed the data inside coresspondig dataclass.
Thre result should be a CPU running all cores without memory overflow.
Are there any examples?

Related

What is the difference between CPU core's thread and any software's application thread?

I have a web application which supports multi threading in which we can run async tasks simultaneously on different thread. I understood what that thread mean.
Now suppose the server on which the application is running has multiple cores CPU with hyper threading enabled.
Now, how my application is supposed to take advantage of these threads. Is there any relation between these two which I am missing.
What i understand from CPU's threads is that
A thread is a single line of commands that are getting processed, each application has at least one thread, most have multiples. A core is the physical hardware that works on the thread. In general a processor can only work on one thread per core, CPUs with hyper threading can work on up to two threads per core.
For processors with hyper threading, there are extra registers and execution units in the core so it can store the state of two threads and work on them both, normally to change threads you have to empty the registers into the cache, write that back to the main memory, then load up the cache with the new values and load up the registers, context switches hurt performance significantly.
But when you have too much backgrounds tasks running, how they are utilizing just limited number of core's threads (i.e. 2 to 8).
PS: I have already checked What is the difference between a process and a thread? and not looking for definition of process. So its not a duplicate.
If you are making use of multiple cores in your program, then the os will schedule which cores run which threads and will take many factors into account, including other processes running, what exactly your code is trying to do, and much more. In regards to async tasks, these may not necessarily be running on a different thread or core, they may be tasks that are not instantaneous, so some scheduler may decide to start doing other things until there is a signal that the async task is complete. It will vary widely depending on the language you are programming the web application in, and the implementation.

how are the multiprocessing and threading and thread pooling working

https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?
Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource

Node.js single thread VS Tranditonal webserver thread pool

I am a newbie to node.js. I am currently reading the book called 'Beignning Node.js' by Basarat Ali Syed.
Here is an excerpt from it which states the disadvantage of thread pool of traditional web servers:
Most web servers used thread pool this
method a few years back and many continue to use today. However,
this method is not without drawbacks. Again there is wasting of RAM
between threads. Also the OS needs to context switch between threads
(even when they are idle), and this results in wasted CPU resources.
I don't quite understand why there is context switch between threads inside thread pool. As far as I could understand, one thread will last during the duration of a task. And once the task is completed, the thread will be free to receive the next task.
So My Q1: Why does it need context switch? When will the context switch between threads happen?
My Q2: Why does not node.js use multiple threads to handle events in the event queue? Isn't it more efficient and reduce the queuing time of events?
Context switch is when the OS need to run more threads than there are CPU cores. Say for example you have 10 threads. And they are all busy (meaning none of them have finished completing their tasks). But your CPU is only a dual core CPU (assume no hyperthreading for simplicity). So, how can all 10 threads run? It's not possible!!
The answer is context switch. The OS, when presented with lots of processes and threads to execute, will allocate a certain amount of time for each thread to run. After this time the OS will switch to another thread so that all threads will get some time to use the CPU.
The term "context switch" refers to the fact that when the OS needs to give the CPU to another thread/process it needs to copy all the values in registers temporarily to that thread's memory otherwise the other process/thread will mess up the calculation of the switched thread when it resumes. The OS will also need to re-point the virtual memory tables so that two processes will not mess up each other's memory. How expensive this operation is depends on the CPU architecture. Some architectures like the Sparc are optimized for context switching. Hyperthreading is a feature that implements context switching in hardware so it's faster (but then again, you only get one extra context per CPU with Hyperthreading as implemented on Intel/AMD64 architecture).
Not using multiple threads completely avoids context switching. Especially if your program is the only program running. So on a single core CPU, a nonblocking, single-threaded program can often beat a multithreaded program.
However, it's rare to find a single core CPU these days. The ideal number of threads you'd want to run is equal to the number of cores you have. Doing so would also avoid context switching. But even so, getting a complex multithreaded program to run fast is not easy. It's easier to get a nonblocking singlethreaded program to run fast. And in most web applications a multithreaded program wouldn't have any advantage over a nonblocking singlethreaded program because they're both I/O bound.
A nonblocking singlethreaded program is basically implementing thread-like behavior in userspace using events. This is sometimes called "green threads" in languages that support syntax that make event-oriented programming look like multithreaded programming.

Openmp thread divergence?

The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big performance hit.
I was wondering, is there a similar penalty for doing this in openmp? For example, say I have a 6 core processor and a program with 6 threads. If I have a conditional that makes 3 threads perform a certain task, and then have the other three threads perform a completely different task, will there be a big performance hit? I guess in essence it's sort of using openmp to do MIMD.
Basically, I'm writing a program with openmp and CUDA. I want two threads to run a CUDA kernel while the other left over threads run C code. Thanks.
No, there is no performance hit for diverging threads using OpenMP. It is a problem in CUDA because of the way instructions are broadcast simultaneously to a set of cores. When an OpenMP thread targets a CPU core, each CPU core has its own independent set of instructions to follow, and it runs just like any other single-threaded program would.
You may see some of your cores being underutilized if you have synchronization barriers following thread divergence, because that would force faster threads to wait for the slower threads to catch up.
When speaking about CPU parallelism, there's no intrinsic performance hit from using a certain threading design pattern. Not at the theoretical level at least.
The only problem I see is that since the threads are doing different things which may have varying completion times, some of the threads may sit idle after finishing their work, waiting for the others to finish a longer task.
The term thread divergence in CUDA refers to the situation when not all threads of a bock evaluate a conditional with the same outcome. Such threads are said to diverge. If diverging threads are in the same warp then such threads may perform work serially which leads to performance loss.
I am not sure that OpenMP has the same issue, though. When different threads perform different work then load balancing may be used by the runtime perhaps, but it doesn't lead to work serialization necessarily.
there is no this kind of problem in openmp because every openmp thread has its own PC.

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.
In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.
Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Resources