Optimising python3 multiprocessing when each process requires multiple threads - python-3.x

I have a list with 12 items (each of which is a sample), and I need to analyse each of these samples using a function. The function is a wrapper for an external pipeline which needs four threads to run. Some pseudocode:
def my_wrapper(sample, other_arg):
cmd = 'external_pipeline --threads 4 --sample {0} --other {1}'.format(sample, other_arg)
subprocess.Popen(shlex.split(cmd),stderr=subprocess.PIPE,stdout=subprocess.PIPE).communicate()
Previously I ran the function for each sample serially using a loop, which worked, but is relatively inefficient given the my CPU has 20 cores. Example code:
sample_list = ['sample_'+str(x) for x in range(1,13)]
for sample in sample_list:
my_wrapper(sample)
I've tried to use multiprocessing to up the efficiency. I've done this successfully in the past using the starmap function from multiprocessing. Example:
with mp.Pool(mp.cpu_count()) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])
This approach has worked well previously when the function I'm calling requires only 1 thread/core per process. However, it doesn't seem to work as I naively expect/hope in my current circumstance. There are 12 samples, each needing to be analysed with 4 threads, but I only have 20 threads in total. Accordingly, I'd expect/hope for 5 samples to be run at a time (5 samples * 4 threads for each = 20 threads total). Instead, all samples appear to be analysed simultaneously, with all 20 threads being used, despite 48 threads being required for this to be efficient.
How might I efficiently run these samples so that only 5 are run in parallel (with each of these processes/jobs using 4 threads)? Do I need to specify a chunk size, or am I barking up the wrong tree with this thought?
Apologies for the vague title and post content, I wasn't sure how to word any of it better!

Limiting the number of cores in your multiprocessing pool will then save some cores free to run when running your wrapper. This will do the chunking for you:
with mp.Pool(mp.cpu_count() / num_cores_used_in_wrapper) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])

Related

What is the best concurrency way of doing 10 000 continuous opencv operations simultaneously in Python3?

I used to have a relatively simple Python3 app that read a video streaming source and did continuous opencv and I/O-heavy (with files and databases) operations:
cap_vid = cv2.VideoCapture(stream_url)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
The app ran smoothly. However, there arose a need to do this not just with 1 channel, but with many, potentially 10 000 channels. Now, let's say I have a list of these channels (stream_urls). If I wrap my usual code inside for stream_url in stream_urls:, it will of course not work, because the iteration will never proceed further than the 0th index. So, the first thing that comes to mind is concurrency.
Now, as much as I know, there are 3 ways of doing concurrent programming in Python3: threading, asyncio, and multiprocessing:
I understand that in the case of multiprocessing the OS creates new (instances of) Python interpreter so there can be at most as many instances as there are cores of the machine, which seldom exceeds 16; however, the number of processes can be potentially up to 10 000. Also, the overhead from the use of multiprocessing exceeds the performance gains if the number of processes are more than a certain amount, so this one appears to be useless.
The case of threading seems the easiest in terms of the machinery it uses. I'd just wrap my code in a function and create threads like the following:
from threading import Thread
def work_channel(ch_src):
cap_vid = cv2.VideoCapture(ch_src)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
for stream_url in stream_urls:
Thread(target=work_channel, args=(stream_url,), daemon=True).start()
But there are a few problems with threading: first, using more than 11-17 threads nullifies any of its favourable effects because of the overhead costs. Also, it's not safe to work with file I/O in threading, which is a very important concern for me.
I don't know how to use asyncio and couldn't find how to do what I want with it.
Given the above scenario, which one of the 3 (or more if there are other methods that I am unaware of) concurrency methods should I use for the fastest, most accurate (or at least expected) performance? And what way should I use that method correctly?
Any help is appreciated.
Spawning more threads than your computer has cores is very much possible.
Could be as simple as:
import threading
all_urls = [your url list]
def threadfunction(url):
cap_vid = cv2.VideoCapture(url)
while True:
# ...
# openCV operations
# file I/O operations
# ...
for stream_url in all_urls:
threading.Thread(target=threadfunction, args=(stream_url,)).start()

Python multiprocessing taking the brakes off OSX

I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()
I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.

Is there a Python pool for individual iteration in loops?

Is there any way to Python pool an iteration itself?
For example, I tried the following:
def x():
for i in range(10):
...
The number of iterations is 10 (0 to 9), can we create a pool which creates 10 separate processes for iteration i=0, i=1, ... i=9?
The language does not have it in that simple form. But I remember seeing a small package that would provide an iterator for just that once. I will try to find it,but I doubt it is still maintained.
Here it is: https://github.com/npryce/python-parallelize
And present in Pypi: https://pypi.python.org/pypi/python-parallelize/1.0.0.0
If it works, it may be a matter of 'unmainained due to being complete' - just test your stuff - a lot.

Scala - best API for doing work inside multiple threads

In Python, I am using a library called futures, which allows me to do my processing work with a pool of N worker processes, in a succinct and crystal-clear way:
schedulerQ = []
for ... in ...:
workParam = ... # arguments for call to processingFunction(workParam)
schedulerQ.append(workParam)
with futures.ProcessPoolExecutor(max_workers=5) as executor: # 5 CPUs
for retValue in executor.map(processingFunction, schedulerQ):
print "Received result", retValue
(The processingFunction is CPU bound, so there is no point for async machinery here - this is about plain old arithmetic calculations)
I am now looking for the closest possible way to do the same thing in Scala. Notice that in Python, to avoid the GIL issues, I was using processes (hence the use of ProcessPoolExecutor instead of ThreadPoolExecutor) - and the library automagically marshals the workParam argument to each process instance executing processingFunction(workParam) - and it marshals the result back to the main process, for the executor's map loop to consume.
Does this apply to Scala and the JVM? My processingFunction can, in principle, be executed from threads too (there's no global state at all) - but I'd be interested to see solutions for both multiprocessing and multithreading.
The key part of the question is whether there is anything in the world of the JVM with as clear an API as the Python futures you see above... I think this is one of the best SMP APIs I've ever seen - prepare a list with the function arguments of all invocations, and then just two lines: create the poolExecutor, and map the processing function, getting back your results as soon as they are produced by the workers. Results start coming in as soon as the first invocation of processingFunction returns and keep coming until they are all done - at which point the for loop ends.
You have way less boilerplate than that using parallel collections in Scala.
myParameters.par.map(x => f(x))
will do the trick if you want the default number of threads (same as number of cores).
If you insist on setting the number of workers, you can like so:
import scala.collection.parallel._
import scala.concurrent.forkjoin._
val temp = myParameters.par
temp.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(5))
temp.map(x => f(x))
The exact details of return timing are different, but you can put as much machinery as you want into f(x) (i.e. both compute and do something with the result), so this may satisfy your needs.
In general, simply having the results appear as completed is not enough; you then need to process them, maybe fork them, collect them, etc.. If you want to do this in general, Akka Streams (follow links from here) are nearing 1.0 and will facilitate the production of complex graphs of parallel processing.
There is both a Futures api that allows you to run work-units on a thread pool (docs: http://docs.scala-lang.org/overviews/core/futures.html) and a "parallell collections api" that you can use to perform parallell operations on collections: http://docs.scala-lang.org/overviews/parallel-collections/overview.html

overriding default Parallel Collections behavior in scala

I have a large batched parallel computation that I use a parallel map for in scala. I have noticed that there appears to be a gradual downstepping of CPU usage as the workers finish. It all comes down to a call to a call inside of the Map object
scala.collection.parallel.thresholdFromSize(length, tasksupport.parallelismLevel)
Looking at the code, I see this:
def thresholdFromSize(sz: Int, parallelismLevel: Int) = {
val p = parallelismLevel
if (p > 1) 1 + sz / (8 * p)
else sz
}
My calculation works great on a large number of cores, and now I understand why..
thesholdFromSize(1000000,24) = 5209
thesholdFromSize(1000000,4) = 31251
If I have an array of length 1000000 on 24 CPU's it will partition all the way down to 5209 elements. If I pass that same array into the parallel collections on my 4 CPU machine, it will stop partitioning at 31251 elements.
It should be noted that the runtime of my calculations is not uniform. Runtime per unit can be as much as 0.1 seconds. At 31251 items, that's 3100 seconds, or 52 minutes of time where the other workers could be stepping in and grabbing work, but are not. I have observed exactly this behavior while monitoring CPU utilization during the parallel computation. Obviously I'd love to run on a large machine, but that's not always possible.
My question is this: Is there any way to influence the parallel collections to give it a smaller threshold number that is more suited to my problem? The only thing I can think of is to make my own implementation of the class 'Map', but that seems like a very non-elegant solution.
You want to read up on Configuring Scala parallel collections. In particular, you probably need to implement a TaskSupport trait.
I think all you need to do is something like this:
yourCollection.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(24))
The parallelism parameter defaults to the number of CPU cores that you have, but you can override it like above. This is shown in the source for ParIterableLike as well.
0.1 second is large time enough to handle it separately. Wrap processing of each unit (or 10 units) in a separate Runnable and submit all of them to a FixedThreadPool. Another approach is to use ForkJoinPool - then it is easier to control the end of all computations.

Resources