Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.
Related
So what I am aware of is that Simultaneous Multithreading (Intel's Hyperthreading for example) enables a single CPU core to efficiently manage several threads at once. And most explainations I find is that it's like you have more than one core at your disposal. But what I'm wondering is if this is what is actually going on at a low level (machine level)? Or is it more like to the OS it just looks ike it is being operated on 2 cores, but in the end Simultaneous Multithreading just makes it much more efficient at going back and forth between two (or more) different threads, giving the illusion of having more than one core?
Simultancous multithreading is defined in "Simultaneous Multithreading: Maximizing On-Chip Parallelism" (Dean M. Tullsen et al., 1995, PDF) as "a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle" ("issue" means initiation of execution — an alternative use of the term means entering into an instruction scheduler). "Simultaneous" refers to the issue of instructions from different threads at the same time, distinguishing SMT from fine-grained multithreading that rapidly switches between threads in execution (e.g., choosing each cycle which thread's instructions to execute) and switch-on-event multithreading (which is more similar to OS-level context switches).
SMT implementations often interleave instruction fetch and decode and commit, making these pipeline stages look more like those of a fine-grain multithreaded or non-multithreaded core. SMT exploits an out-of-order superscalar already choosing dynamically between arbitrary (within a window) instructions recognizing that typically execution resources are not fully used. (In-order SMT provides relatively greater benefits since in-order execution lacks the latency hiding of out-of-order execution, but the pipeline control complexity is increased.)
A barrel processor (pure round-robin, fine-grained thread scheduling with nops issued for non-ready threads) with shared caches would look more like separate cores at 1/thread_count the clock frequency (and shared caches) since such lacks dynamic contention for execution resources. It is also arguable that having instructions from multiple threads in the processor pipeline at the same time represents parallel instruction processing; distinct threads can have instructions being processed (in different pipeline stages) at the same time. Even with switch-on-event multithreading, a cache miss can be processed in parallel with the execution of another thread, i.e., multiple instructions from another thread can be processed during the "processing" of a load instruction.
The distinction from OS-level context switching can be even more fuzzy if the ISA provides instructions that are not interrupt-atomic. For example, on x86 a timer interrupt can lead an OS to perform a context switch while a string instruction is in progress. In some sense, during the entire time slice of the other thread, the string instruction might be considered still to be "executing" since its operation was not completed. With hardware prefetching, some degree of forward progress of the earlier thread might, in theory, continue past the time when another thread starts running, so even a requirement of simultaneous activity in multiple threads might be satisfied. (If processing of long x86 string instructions was handed off to an accelerator, the instruction might run fully in parallel with another thread running on the core that initiated the instruction.)
INTRO
multiprocessing = using multiple CPU cores to complete a task (each core has separate memory, thus requires pipes and data structures for each core to "talk" to each other")
multithreading = using multiple threads (that are on a single CPU core) with a task scheduler to complete a task (all threads share same memory on CPU core)
static (temporal) multithreading - take advantage of idle I/O time by scheduling tasks to occur sequentially without pause during cache misses (i.e. waiting to read/write to an I/O device); used for I/O-bound tasks
dynamic (simultaneous) multithreading - take advantage of instructions that can happen at the same time (on Intel chips, this is called "Hyperthreading"); used for CPU-bound tasks
e.g.
a = b*c //Task 1
d = e*f //Task 2
g = a*d //Task 3
// Task 1 and 2 don't depend on each other, and hence can be run in parallel
QUESTION
Given the above, how can I control in LabVIEW which cores I use to multiprocess a task (not multithread)?
LabVIEW inherently parses the dataflow out to multiple processors and multiple threads to as much parallelism as the system is analyzed to stand. THERE ARE ALMOST ZERO CASES WHERE YOU SHOULD SPECIFY THE THREADING MODEL OF THE CODE. The Timed Loop and Timed Structure capabilities should be considered strictly for real-time systems, not for execution on the desktop systems (Windows, Mac, or Linux). If you attempt to specify the threading model, you will almost certainly get less performance than the sophisticated model already computed by the compiler and run-time engine.
As of NI LabVIEW version 8.5 the Timed Loop and Timed Sequence structures include a Processor input that allows you to manually assign available processors to handle the execution of the structures. You can configure the processor assignment by wiring an input to the Processor input of the Input Node for the structure or for frames of the structure.
http://www.ni.com/product-documentation/6400/en/
https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?
Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource
I'm trying for first time the Intel TBB and I'm stucked right at the beginning.
I've attached a simple image to show how I want to build my concurrency program.
I've took a look at here Simplest TBB example and here using TBB for non-parallel tasks.
TBB is nice but I don't know how to handle following problem: How can I define threadpools/taskpools which are started or stopped in dependency of memory consumption. To be precise, if memory-consumption of some dataclass is too much, its filling threadpool shall be stalled (e.g. no new spawns until the other threads have consumed the data inside coresspondig dataclass.
Thre result should be a CPU running all cores without memory overflow.
Are there any examples?
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big performance hit.
I was wondering, is there a similar penalty for doing this in openmp? For example, say I have a 6 core processor and a program with 6 threads. If I have a conditional that makes 3 threads perform a certain task, and then have the other three threads perform a completely different task, will there be a big performance hit? I guess in essence it's sort of using openmp to do MIMD.
Basically, I'm writing a program with openmp and CUDA. I want two threads to run a CUDA kernel while the other left over threads run C code. Thanks.
No, there is no performance hit for diverging threads using OpenMP. It is a problem in CUDA because of the way instructions are broadcast simultaneously to a set of cores. When an OpenMP thread targets a CPU core, each CPU core has its own independent set of instructions to follow, and it runs just like any other single-threaded program would.
You may see some of your cores being underutilized if you have synchronization barriers following thread divergence, because that would force faster threads to wait for the slower threads to catch up.
When speaking about CPU parallelism, there's no intrinsic performance hit from using a certain threading design pattern. Not at the theoretical level at least.
The only problem I see is that since the threads are doing different things which may have varying completion times, some of the threads may sit idle after finishing their work, waiting for the others to finish a longer task.
The term thread divergence in CUDA refers to the situation when not all threads of a bock evaluate a conditional with the same outcome. Such threads are said to diverge. If diverging threads are in the same warp then such threads may perform work serially which leads to performance loss.
I am not sure that OpenMP has the same issue, though. When different threads perform different work then load balancing may be used by the runtime perhaps, but it doesn't lead to work serialization necessarily.
there is no this kind of problem in openmp because every openmp thread has its own PC.