I need to accelerate some image processing by means of multithreading on a multicore architecture.
My images are OpenCV objects.
I tried approaching the problem with the threading module, but it turns out not to perform true parallelism (!) because of the GIL issue. I have tried with the multiprocessing module, but it turns out that my objects are not shared between processes (!)
My processing needs to work on different sections of the same image simultaneously (so that the Queue paradigm is irrelevant).
What can I do ?
I eventually found a solution here to share NumPy arrays when using the multiprocessing model: http://briansimulator.org/sharing-numpy-arrays-between-processes/
The main idea is to flatten the image bitmap as a linear buffer, copy it to a shared array, and map it back to a bitmap (without copy) in the receiving processes.
This is not fully satisfying as multithreading would be more resource-friendly than multiprocessing (it takes a huge amount of memory), but at least it works, with minimal effort.
Related
Consider the following example from Python documentation. Does multiprocessing.Process use serialization (pickle) to put items in the shared queue?
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print(q.get()) # prints "[42, None, 'hello']"
p.join()
I understand multiprocessing.Process uses pickle to serialize / deserialize data to communicate with the main process. And threading.Thread does not need serialization, so it does not use pickle. But I'm not sure how communication with Queue happens in a multiprocessing.Process.
Additional Context
I want multiple workers to fetch data from a database (or local storage) to fill the shared queue where the items are consumed by the main process sequentially. Each record that is fetched is large (1-1.5 mb). The problem with using multiprocessing.Process is that serialization / deserialization of the data takes a long time. Pytorch's DataLoader makes use of this and is therefore unsuitable for my use case.
Is multi-threading the best alternative for such a use case?
Yes, mutiprocessing's queues does use Pickle internally. This can be seen in multiprocessing/queues.py of the CPython implementation. In fact, AFAIK CPython uses Pickle for transferring any object between interpreter processes. The only way to avoid this is to use shared memory but it introduces strong limitation and cannot basically be used for any type of objects.
Multithreading is limited by the Global Interpreter Lock (GIL) which basically prevent any parallel speed up except of operations releasing the GIL (eg. some Numpy functions) and IO-based ones.
Python (and especially CPython) is not the best languages for parallel computing (nor for high performance). It has not been designed with that in mind and this is nowadays a pretty strong limitation regarding the recent sharp increase of the number of cores per processor.
I am doing heavy image processing in python3 on a large batch of images using numpy and opencv. I know python has this GIL which prevents two threads running concurrently. A quick search on Google told me that, do not use threads in python for CPU intensive tasks, use them only for I/O or saving files to disk, database communication etc. I also read that GIL is released when working with C extensions. Since both numpy and opencv are C and C++ extensions I get a feeling that GIL might be released.I am not sure about it because image processing is a CPU intensive task. Is my intuition correct or I am better of using multiprocessing?
To answer it upfront, it depends on the functions you use.
The most effective way to prove if a function releases the GIL is by checking the corresponding source. Also checking the documentation helps, but often it is simply not documented. And yes, it is cumbersome.
http://scipy-cookbook.readthedocs.io/items/Multithreading.html
[...] numpy code often releases the GIL while it is calculating,
so that simple parallelism can speed up the code.
Each project might use their own macro, so if you are familiar with the default macros like Py_BEGIN_ALLOW_THREADS from the C Python API, you might find them being redefined. In Numpy for instance it would be NPY_BEGIN_THREADS_DEF, etc.
Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.
I'm writing a Python program that will load a wordlist from a text file and then try unzipping an archive with each word. It wouldn't be serious if it didn't make use of all cpu cores. Because of the GIL, threading in Python isn't a great option if I'm not mistaken.
So I want to get the number of cpu_cores, split the wordlist and use the multiprocessing.process module to process different parts of the wordlist in different processes.
But would every process get pinned to a cpu core automatically? If not, is there a way to pin them manually?
You can use Pythons multiprocessing by importing
import multiprocessing as mp and find out the number of processors by using mp.cpu_count() and should work on most platforms.
To launch programs/processes on specific CPU cores (in linux) you can use taskset and use this guide as a reference.
An alternative cross-plattform solution would be to use the psutil package for python.
However i would suggest you go with a thread/process pooling approach as in my opinion you should let the operating system assign tasks to each cpu/core. You can look at How to utilize all cores with python multiprocessing on how to approach this problem.
I've been reading a lot about multi-threaded rendering. People have been proposing all kinds of weird and wonderful schemes for submitting work to the GPU with threads in order to speed up their frame rates and get more stuff rendered, but I'm having a bit of a conceptual problem with the whole thing and I thought I'd run it by the gurus here to see what you think.
As far as I know, the basic unit of concurrency on a GPU is the Warp. That is to say, it's down at the pixel level rather than higher up at the geometry submission level. So given that the unit of concurrency on the GPU is the warp, the driver must be locked down pretty tightly with mutexes to prevent multiple threads screwing up each other's submissions. If this is the case, I don't see where the benefit is of coding to D3D or OpenGL multi-threading primitives.
Surely the most efficient method of using your GPU in a multi-threading scenario is at the higher, abstract level, where you're collecting together batches of work to do, before submitting it? I mean rather than randomly interleving commands from multiple threads, I would have thought a single block accepting work from multiple threads, but with a little intelligence inside of it to make sure things are ordered for better performance before being submitted to the renderer, would be a much bigger gain if you wanted to work with multiple threads.
So, whither D3D/OpenGL multi-threaded rendering support in the actual API?
Help me with my confusion!
Your question comes from a misunderstanding of the difference between "make their renderers multi-threaded" and "multithreaded rendering".
A "renderer", or more precisely a "rendering system," does more than just issue rendering commands to the API. It has to shuffle memory around. It may have to load textures dynamically into and out-of graphics memory. It may have to read data back after some rendering process. And so forth.
To make a renderer multithreaded means exactly that: to make the rendering system make use of multiple threads. This could be threading scene graph management tasks like building the list of objects to render (frustum culling, BSPs, portals, etc). This could be having a thread dedicated to texture storage management, which swaps textures in and out as needed, loading from disk and such. This could be as in the D3D11 case of command lists, where you build a series of rendering commands in parallel with other tasks.
The process of rendering, the submission of actual rendering commands to the API, is not threaded. You generally have one thread who is responsible for the basic glDraw* or ::DrawIndexedPrimitive work. D3D11 command lists allow you to build sequences of these commands, but they are not executed in parallel with other rendering commands. It is the rendering thread and the main context that is responsible for actually issuing the command list; the command list is just there to make putting that list together more thread-friendly.
In Direct3D 11 you generally create deferred contexts to which you make draw calls from your worker threads. Once work is complete and you are ready to render, you generate a command list from each deferred context and execute it on the immediate (front thread) context. This allows the draw calls to be composed in multiple threads whilst preserving correct ordering of the draw calls etc.