Does using queue in a multiprocessing Process use pickling? - multithreading

Consider the following example from Python documentation. Does multiprocessing.Process use serialization (pickle) to put items in the shared queue?
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print(q.get()) # prints "[42, None, 'hello']"
p.join()
I understand multiprocessing.Process uses pickle to serialize / deserialize data to communicate with the main process. And threading.Thread does not need serialization, so it does not use pickle. But I'm not sure how communication with Queue happens in a multiprocessing.Process.
Additional Context
I want multiple workers to fetch data from a database (or local storage) to fill the shared queue where the items are consumed by the main process sequentially. Each record that is fetched is large (1-1.5 mb). The problem with using multiprocessing.Process is that serialization / deserialization of the data takes a long time. Pytorch's DataLoader makes use of this and is therefore unsuitable for my use case.
Is multi-threading the best alternative for such a use case?

Yes, mutiprocessing's queues does use Pickle internally. This can be seen in multiprocessing/queues.py of the CPython implementation. In fact, AFAIK CPython uses Pickle for transferring any object between interpreter processes. The only way to avoid this is to use shared memory but it introduces strong limitation and cannot basically be used for any type of objects.
Multithreading is limited by the Global Interpreter Lock (GIL) which basically prevent any parallel speed up except of operations releasing the GIL (eg. some Numpy functions) and IO-based ones.
Python (and especially CPython) is not the best languages for parallel computing (nor for high performance). It has not been designed with that in mind and this is nowadays a pretty strong limitation regarding the recent sharp increase of the number of cores per processor.

Related

Run a parallel process saving results from a main process in Python

I have a function that creates some results for a list of tasks. I would like to save the results on the fly to 1) release memory compared to saving to appending to a results_list and 2) have the results of the first part in case of errors.
Here is a very short sample code:
for task in task_list:
result = do_awesome_stuff_to_task(task)
save_nice_results_to_db(result) # Send this job to another process and let the main process continue
Is there a way for the main process to create results for each task in task_list and each time a result is create send this to another processor/thread to save it, so the main loop can continue without waiting for the slow saving process?
I have looked at multiprocessing, but that seems mostly to speed up the loop over task_list rather than allow a secondary sub process to do other parts of the work. I have also looked into asyncio, but that seems mostly used for I/O.
All in all, I am looking for a way to have a main process looping over the task_list. For each task finished I would like to send the results to another subprocess to save the results. Notice, the do_awesome_stuff_to_task is much faster than savings process, hence, the main loop will have reached through multiple task before the first task is saved. I have thought of two ways of tackling this:
Use multiple sub process to save
Save every xx iteration - the save_results scale okay, so perhaps the save process can save xx iteration at a time while the main loop continuous?
Is this possible to do with Python? Where to look and what key considerations to take?
All help is appreciated.
It's hard to know what will be faster in your case without testing, but here's some thoughts on how to choose what to do.
If save_nice_results_to_db is slow because it's writing data to disk or network, make sure you aren't already at the maximum write speed of your hardware. Depending on the server at the other end, network traffic can sometimes benefit greatly from opening multiple ports at once to read/write so long as you stay within your total network transfer speed (of the mac interface as well as your ISP). SSD's can see some limited benefit from initiating multiple reads/writes at once, but too many will hurt performance. HDD's are almost universally slower when trying to do more than one thing at once. Everything is more efficient reading/writing larger chunks at a time.
multiprocessing must typically transfer data between the parent and child processes using pickle because they don't share memory. This has a high overhead, so if result is a large object, you may waste more time with the added overhead of sending the data to a child process than you could save by any sort of concurrency. (emphasis on the may. always test for yourself). As of 3.8 the shared_memory module was added which may be somewhat more efficient, but is much less flexible and easy to use.
threading benefits from all threads sharing memory so there is zero transfer overhead to "send" data between threads. Python threads however cannot execute bytecode concurrently due to the GIL (global interpreter lock), so multiple CPU cores cannot be leveraged to increase computation speed. This is due to python itself having many parts which are not thread-safe. Specific functions written in c may release this lock to get around this issue and leverage multiple cpu cores using threads, but once execution returns to the python interpreter, that lock is held again. Typically functions involving network access or file IO can release the GIL, as the interpreter is waiting on an operating system call which is usually thread safe. Other popular libraries like Numpy also make an effort to release the GIL while doing complex math operations on large arrays. You can only release the GIL from c/c++ code however, and not from python itself.
asyncio should get a special mention here, as it's designed specifically with concurrent network/file operations in mind. It uses coroutines instead of threads (even lower overhead than threads, which themselves are much lower overhead than processes) to queue up a bunch of operations, then uses an operating system call to wait on any of them to finish (event loop). Using this would also require your do_awesome_stuff_to_task to happen in a coroutine for it to happen at the same time as save_nice_results_to_db.
A trivial example of firing each result off to a thread to be processed:
for task in task_list:
result = do_awesome_stuff_to_task(task)
threading.Thread(target=save_nice_results_to_db, args=(result,)).start() # Send this job to another process and let the main process continue

Is the GIL released while using multithreading with python-opencv?

I am doing heavy image processing in python3 on a large batch of images using numpy and opencv. I know python has this GIL which prevents two threads running concurrently. A quick search on Google told me that, do not use threads in python for CPU intensive tasks, use them only for I/O or saving files to disk, database communication etc. I also read that GIL is released when working with C extensions. Since both numpy and opencv are C and C++ extensions I get a feeling that GIL might be released.I am not sure about it because image processing is a CPU intensive task. Is my intuition correct or I am better of using multiprocessing?
To answer it upfront, it depends on the functions you use.
The most effective way to prove if a function releases the GIL is by checking the corresponding source. Also checking the documentation helps, but often it is simply not documented. And yes, it is cumbersome.
http://scipy-cookbook.readthedocs.io/items/Multithreading.html
[...] numpy code often releases the GIL while it is calculating,
so that simple parallelism can speed up the code.
Each project might use their own macro, so if you are familiar with the default macros like Py_BEGIN_ALLOW_THREADS from the C Python API, you might find them being redefined. In Numpy for instance it would be NPY_BEGIN_THREADS_DEF, etc.

What happens to threads in python when there is no .join()?

Suppose, we have a multi-thread Python code which looks like this:
import threading
import time
def short_task():
print 'Hey!'
for x in range(10000):
t = threading.Thread(target=short_task)
t.daemon = True # All non-daemon threads will be ".join()"'ed when main thread dies, so we mark this one as daemon
t.start()
time.sleep(100)
Are there any side-effects from using similar approach in long-running applications (e.g. Django+uwsgi)? Like no garbage collection, extra memory consumption, etc?
What I am trying to do is to do some costly logging (urlopen() to external API url) without blocking the main thread. Spawning infinite new threads with no .join() looks like best possible approach here, but maybe I am wrong?
Not a 100% confident answer, but since nobody else has weighed in...
I can't find any place in the Python documentation that says you must join threads. Python's threading model looks Java-like to me: In Java t.join() means "wait for t to die," but it does not mean anything else. In particular, t.join() does not do anything to thread t.
I'm not an expert, but it looks like the same is true in Python.
Are there any side-effects...Like...extra memory consumption
Every Python thread must have its own, fixed-size call stack, and the threading module documentation says that the minimum size of a stack is 32K bytes. If you create ten thousand of those, like in your code snippet, and if they all manage to exist at the same time, then just the stacks alone are going to occupy 320 megabytes of real memory.
It's unusual to find a good reason for a program to have that many simultaneous threads.
If you're expecting those threads to die so quickly that there's never more than a few of them living at the same time, then you probably could improve the performance of your program by using a thread pool. A thread pool is an object that manages a small number of worker threads and a blocking queue of tasks (i.e., functional objects). Each worker sits in a loop, picking tasks from the queue and performing them.
A program that uses a thread pool effectively re-uses its worker threads instead of continually letting threads die and creating new ones to replace them.

Python multiprocessing on OpenCV images

I need to accelerate some image processing by means of multithreading on a multicore architecture.
My images are OpenCV objects.
I tried approaching the problem with the threading module, but it turns out not to perform true parallelism (!) because of the GIL issue. I have tried with the multiprocessing module, but it turns out that my objects are not shared between processes (!)
My processing needs to work on different sections of the same image simultaneously (so that the Queue paradigm is irrelevant).
What can I do ?
I eventually found a solution here to share NumPy arrays when using the multiprocessing model: http://briansimulator.org/sharing-numpy-arrays-between-processes/
The main idea is to flatten the image bitmap as a linear buffer, copy it to a shared array, and map it back to a bitmap (without copy) in the receiving processes.
This is not fully satisfying as multithreading would be more resource-friendly than multiprocessing (it takes a huge amount of memory), but at least it works, with minimal effort.

Goroutines vs asyncio tasks + thread pool for CPU-bound calls

Are goroutines roughly equivalent to python's asyncio tasks, with an additional feature that any CPU-bound task is routed to a ThreadPoolExecutor instead of being added to the event loop (of course, with the assumption that we use a python interpreter without GIL)?
Is there any substantial difference between the two approaches that I'm missing? Of course, apart from the efficiencies and code clarity that result from the concurrency being an integral part of Go.
I think I know part of the answer. I tried to summarize my understanding of the differences, in order of importance, between asyncio tasks and goroutines:
1) Unlike under asyncio, one rarely needs to worry that their goroutine will block for too long. OTOH, memory sharing across goroutines is akin to memory sharing across threads rather than asyncio tasks since goroutine execution order guarantees are much weaker (even if the hardware has only a single core).
asyncio will only switch context on explicit await, yield and certain event loop methods, while Go runtime may switch on far more subtle triggers (such as certain function calls). So asyncio is perfectly cooperative, while goroutines are only mostly cooperative (and the roadmap suggests they will become even less cooperative over time).
A really tight loop (such as with numeric computation) could still block Go runtime (well, the thread it's running on). If it happens, it's going to have less of an impact than in python - unless it occurs in mutliple threads.
2) Goroutines are have off-the-shelf support for parallel computation, which would require a more sophisticated approach under asyncio.
Go runtime can run threads in parallel (if multiple cores are available), and so it's somewhat similar to running multiple asyncio event loops in a thread pool under a GIL-less python runtime, with a language-aware load balancer in front.
3) Go runtime will automatically handle blocking syscalls in a separate thread; this needs to be done explicitly under asyncio (e.g., using run_in_executor).
That said, in terms of memory cost, goroutines are very much like asyncio tasks rather than threads.
I suppose you could think of it working that way underneath, sure. It's not really accurate, but, close enough.
But there is a big difference: in Go you can write straight line code, and all the I/O blocking is handled for you automatically. You can call Read, then Write, then Read, in simple straight line code. With Python asyncio, as I understand it, you need to queue up a function to handle the reads, rather than just calling Read.

Resources