I am doing heavy image processing in python3 on a large batch of images using numpy and opencv. I know python has this GIL which prevents two threads running concurrently. A quick search on Google told me that, do not use threads in python for CPU intensive tasks, use them only for I/O or saving files to disk, database communication etc. I also read that GIL is released when working with C extensions. Since both numpy and opencv are C and C++ extensions I get a feeling that GIL might be released.I am not sure about it because image processing is a CPU intensive task. Is my intuition correct or I am better of using multiprocessing?
To answer it upfront, it depends on the functions you use.
The most effective way to prove if a function releases the GIL is by checking the corresponding source. Also checking the documentation helps, but often it is simply not documented. And yes, it is cumbersome.
http://scipy-cookbook.readthedocs.io/items/Multithreading.html
[...] numpy code often releases the GIL while it is calculating,
so that simple parallelism can speed up the code.
Each project might use their own macro, so if you are familiar with the default macros like Py_BEGIN_ALLOW_THREADS from the C Python API, you might find them being redefined. In Numpy for instance it would be NPY_BEGIN_THREADS_DEF, etc.
Related
I need to accelerate some image processing by means of multithreading on a multicore architecture.
My images are OpenCV objects.
I tried approaching the problem with the threading module, but it turns out not to perform true parallelism (!) because of the GIL issue. I have tried with the multiprocessing module, but it turns out that my objects are not shared between processes (!)
My processing needs to work on different sections of the same image simultaneously (so that the Queue paradigm is irrelevant).
What can I do ?
I eventually found a solution here to share NumPy arrays when using the multiprocessing model: http://briansimulator.org/sharing-numpy-arrays-between-processes/
The main idea is to flatten the image bitmap as a linear buffer, copy it to a shared array, and map it back to a bitmap (without copy) in the receiving processes.
This is not fully satisfying as multithreading would be more resource-friendly than multiprocessing (it takes a huge amount of memory), but at least it works, with minimal effort.
I need to prevent some programs from running in parallel (last time I checked there was RUST_THREADS=1 option but can't find any docs for it) because they screw up the display.
I'm running a program from cargo, but I'd prefer to do it for executive form of program (i.e. instead of cargo run I could use ./myprogram).
By default Rust uses native threading, and you can't just disable it. It just doesn't make sense and will almost certainly break multithreaded program completely - it is because native thread scheduling is done by OS, so hypothetical "disabling" of native threading will leave all programs with their main thread only.
It is possible to disable green threading, but Rust does not use it by default for a long time already. This is possible because green threads rely on user-land scheduler, usually implemented in a language runtime, and it usually can be configured to use single OS thread. This is the default mode of operation of Go, for example. Rust switched to native threading as it is closer to underlying OS, more effective and much easier to implement and support, so RUST_THREADS does not work anymore (even if it ever worked).
I have a django application which relies heavily on threading and I'm noticing no performance increment no matter how much processes or threads I add to the WSGIDaemonProcess.
I can't find a YES/NO answer out there and I'm wondering. Could it be that mod_wsgi is using the same interpreter for each request so I'm running in a bottleneck due to a GIL limitation?
If so, would you recommend something else that would help me workaround this limitation?
For a typical configuration, yes, all requests would be handle in same sub interpreter.
If in different sub interpreters of same process, you are still affected by the GIL.
Post your actual mod_wsgi configuration to confirm you have set things up right.
Consider trying New Relic to find out where real bottlenecks are.
Watch my PyCon US 2012 talk on finding bottlenecks
Short answer:
No.
Long answer:
This ability to make good use of more than processor, even when using multithreading, is further enchanced by the fact that Apache uses multiple processes for handling requests and not just a single process. Thus, even when there is some contention for the GIL within a specific process, it doesn't stop other processes from being able to run as the GIL is only local to a process and does not extend across processes.
Citation: https://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
You haven't given enough information for anybody to recommend how to improve performance, but if you've actually written a thread-heavy program in Python, that's your first mistake.
Instead of running your program on CPython, maybe you should try Jython or IronPython instead. But then it wouldn't work with mod_wsgi, so we really need more details to understand what you're trying to do...
I haven't been able to write a program in Lua that will load more than one CPU. Since Lua supports the concept via coroutines, I believe it's achievable.
Reason for me failing can be one of:
It's not possible in Lua
I'm not able to write it ☺ (and I hope it's the case )
Can someone more experienced (I discovered Lua two weeks ago) point me in right direction?
The point is to write a number-crunching script that does hi-load on ALL cores...
For demonstrative purposes of power of Lua.
Thanks...
Lua coroutines are not the same thing as threads in the operating system sense.
OS threads are preemptive. That means that they will run at arbitrary times, stealing timeslices as dictated by the OS. They will run on different processors if they are available. And they can run at the same time where possible.
Lua coroutines do not do this. Coroutines may have the type "thread", but there can only ever be a single coroutine active at once. A coroutine will run until the coroutine itself decides to stop running by issuing a coroutine.yield command. And once it yields, it will not run again until another routine issues a coroutine.resume command to that particular coroutine.
Lua coroutines provide cooperative multithreading, which is why they are called coroutines. They cooperate with each other. Only one thing runs at a time, and you only switch tasks when the tasks explicitly say to do so.
You might think that you could just create OS threads, create some coroutines in Lua, and then just resume each one in a different OS thread. This would work so long as each OS thread was executing code in a different Lua instance. The Lua API is reentrant; you are allowed to call into it from different OS threads, but only if are calling from different Lua instances. If you try to multithread through the same Lua instance, Lua will likely do unpleasant things.
All of the Lua threading modules that exist create alternate Lua instances for each thread. Lua-lltreads just makes an entirely new Lua instance for each thread; there is no API for thread-to-thread communication outside of copying parameters passed to the new thread. LuaLanes does provide some cross-connecting code.
It is not possible with the core Lua libraries (if you don't count creating multiple processes and communicating via input/output), but I think there are Lua bindings for different threading libraries out there.
The answer from jpjacobs to one of the related questions links to LuaLanes, which seems to be a multi-threading library. (I have no experience, though.)
If you embed Lua in an application, you will usually want to have the multithreading somehow linked to your applications multithreading.
In addition to LuaLanes, take a look at llthreads
In addition to already suggested LuaLanes, llthreads and other stuff mentioned here, there is a simpler way.
If you're on POSIX system, try doing it in old-fashioned way with posix.fork() (from luaposix). You know, split the task to batches, fork the same number of processes as the number of cores, crunch the numbers, collate results.
Also, make sure that you're using LuaJIT 2 to get the max speed.
It's very easy just create multiple Lua interpreters and run lua programs inside all of them.
Lua multithreading is a shared nothing model. If you need to exchange data you must serialize the data into strings and pass them from one interpreter to the other with either a c extension or sockets or any kind of IPC.
Serializing data via IPC-like transport mechanisms is not the only way to share data across threads.
If you're programming in an object-oriented language like C++ then it's quite possible for multiple threads to access shared objects across threads via object pointers, it's just not safe to do so, unless you provide some kind of guarantee that no two threads will attempt to simultaneously read and write to the same data.
There are many options for how you might do that, lock-free and wait-free mechanisms are becoming increasingly popular.
Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?
I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.
Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.
Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.
The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.