I'm having performances issues in multi-threading.
I have a code snippet that reads 8MB buffers in parallel:
import copy
import itertools
import threading
import time
# Basic implementation of thread pool.
# Based on multiprocessing.Pool
class ThreadPool:
def __init__(self, nb_threads):
self.nb_threads = nb_threads
def map(self, fun, iter):
if self.nb_threads <= 1:
return map(fun, iter)
nb_threads = min(self.nb_threads, len(iter))
# ensure 'iter' does not evaluate lazily
# (generator or xrange...)
iter = list(iter)
# map to results list
results = [None] * nb_threads
def wrapper(i):
def f(args):
results[i] = map(fun, args)
return f
# slice iter in chunks
chunks = [iter[i::nb_threads] for i in range(nb_threads)]
# create threads
threads = [threading.Thread(target = wrapper(i), args = [chunk]) \
for i, chunk in enumerate(chunks)]
# start and join threads
[thread.start() for thread in threads]
[thread.join() for thread in threads]
# reorder results
r = list(itertools.chain.from_iterable(map(None, *results)))
return r
payload = [0] * (1000 * 1000) # 8 MB
payloads = [copy.deepcopy(payload) for _ in range(40)]
def process(i):
for i in payloads[i]:
j = i + 1
if __name__ == '__main__':
for nb_threads in [1, 2, 4, 8, 20]:
t = time.time()
c = time.clock()
pool = ThreadPool(nb_threads)
pool.map(process, xrange(40))
t = time.time() - t
c = time.clock() - c
print nb_threads, t, c
Output:
1 1.04805707932 1.05
2 1.45473504066 2.23
4 2.01357698441 3.98
8 1.56527090073 3.66
20 1.9085559845 4.15
Why does the threading module miserably fail at parallelizing mere buffer reads?
Is it because of the GIL? Or because of some weird configuration on my machine, one process
is allowed only one access to the RAM at a time (I have decent speed-up if I switch ThreadPool for multiprocessing.Pool is the code above)?
I'm using CPython 2.7.8 on a linux distro.
Yes, Python's GIL prevents Python code from running in parallel across multiple threads. You describe your code as doing "buffer reads", but it's really running arbitrary Python code (in this case, iterating over a list adding 1 to other integers). If your threads were making blocking system calls (like reading from a file, or from a network socket), then the GIL would usually be released while the thread blocked waiting on the external data. But since most operations on Python objects can have side effects, you can't do several of them in parallel.
One important reason for this is that CPython's garbage collector uses reference counting as its main way to know when an object can be cleaned up. If several threads try to update the reference count of the same object at the same time, they might end up in a race condition and leave the object with the wrong count. The GIL prevents that from happening, as only one thread can be making such internal changes at a time. Every time your process code does j = i + 1, it's going to be updating the reference counts of the integer objects 0 and 1 a couple of times each. That's exactly the kind of thing the GIL exists to guard.
Related
i want to do a job as fast as possible so i should paralelize it using processes (not threads because of GIL). My problem is that i cant start the processes at the sametime, it always start p1, when p1 ends, p2, and so on... how can i start all my processes at the same time? My simplified code:
import multiprocessing
import time
if __name__ == '__main__':
def work(data,num):
if(num==0):
time.sleep(5)
print("starts:",num)
******heavy works that lasts random seconds to be done*****************
print("ends",num)
**********
for k in range(0,2):
p = multiprocessing.Process(target=work(data,k))
p.daemon=True
p.start()
result:
starts 0
ends 0
starts 1
ends 1
starts 2
ends 2
What i expected:
starts 0
starts 1
starts 2
ends 1 or 2
ends 1 or 2
ends 0 (because of time.sleep)
why my scripts waits always until the first process is finished to start the next one?
First of all, making your program parallel/concurrent does not always make it faster as Amdahl's law suggests
Secondly, you want to use the join() method in order to execute them concurrently, furthermore, you need to pass the arguments with the args parameter, because what you are doing is calling the whole function each time, and blocking each run with time.sleep(5), without waiting on one process to finish as such:
process_pool = []
for k in range(0, 5):
p = multiprocessing.Process(target=work, args=('you_data', k))
p.daemon = True
process_pool.append(p)
for process in process_pool:
process.start()
for process in process_pool:
process.join()
I previously asked Repeatedly run a function in parallel on how to run a function in parallel. The function that I am wanting to run has a stochastic element, where random integers are drawn.
When I use the code in that answer it returns repeated numbers within one process (and also between runs if I add an outer loop to repeat the process). For example,
import numpy as np
from multiprocessing.pool import Pool
def f(_):
x = np.random.uniform()
return x*x
if __name__ == "__main__":
processes = 3
p = Pool(processes)
print(p.map(f, range(6)))
returns
[0.8484870744666029, 0.8484870744666029, 0.04019012715175054, 0.04019012715175054, 0.7741414835156634, 0.7741414835156634]
Another run may give
[0.17390735240615365, 0.17390735240615365, 0.5188673758527017, 1.308159884267618e-08, 0.09140498447418667, 0.021537291489524404]
It seems as if there is some internal seed that is being used -- how can I generate random numbers similar to what would be returned from np.random.uniform(size=6) please?
Same output in different workers in multiprocessing indicates that the seed needs to be included in the function. Python multiprocessing pool.map for multiple arguments provides a way to pass multiple arguments to Pool -- one for the repeats and one for a list of seeds. This allows for a new seed for each process, and is reproducible.
import numpy as np
from multiprocessing.pool import Pool
def f(reps, seed):
np.random.seed(seed)
x = np.random.uniform()
return x*x
#np.random.seed(1)
if __name__ == "__main__":
processes = 3
p = Pool(processes)
print(p.starmap(f, zip(range(6), range(6))))
Where the second argument is the vector of seeds (to see change the line to print(p.starmap(f, zip(range(0,6), np.repeat(1,6)))))
I have the following for loop:
for j in range(len(list_list_int)):
arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
arr_1[j,:] = arr_1_.data.numpy()
arr_2[j,:] = arr_2_.data.numpy()
arr_3[j,:] = arr_3_.data.numpy()
I would like to apply foo with multiprocessing, mainly because it is taking a lot of time to finish. I tried to do it in batches with funcy's chunks method:
for j in chunks(1000, list_list_int):
arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
arr_1[j,:] = arr_1_.data.numpy()
arr_2[j,:] = arr_2_.data.numpy()
arr_3[j,:] = arr_3_.data.numpy()
However, I am getting list object cannot be interpreted as an integer. What is the correct way of applying foo using multiprocessing?
list_list_int = [1,2,3,4,5,6]
for j in chunks(2, list_list_int):
for i in j:
avg_, max_, last_ = foo(bar, i)
I don't have chunks installed, but from the docs I suspect it produces (for size 2 chunks, from:
alist = [[1,2],[3,4],[5,6],[7,8]]
j = [[1,2],[3,4]]
j = [[5,6],[7,8]]
which would produce an error:
In [116]: alist[j]
TypeError: list indices must be integers or slices, not list
And if your foo can't work with the full list of lists, I don't see how it will work with that list split into chunks. Apparently it can only work with one sublist at a time.
If you are looking to perform parallel operations on a numpy array, then I would use Dask.
With just a few lines of code, your operation should be able to be easily ran on multiple processes and the highly developed Dask scheduler will balance the load for you. A huge benefit to Dask compared to other parallel libraries like joblib, is that it maintains the native numpy API.
import dask.array as da
# Setting up a random array with dimensions 10K rows and 10 columns
# This data is stored distributed across 10 chunks, and the columns are kept together (1_000, 10)
x = da.random.random((10_000, 10), chunks=(1_000, 10))
x = x.persist() # Allow the entire array to persist in memory to speed up calculation
def foo(x):
return x / 10
# Using the native numpy function, apply_along_axis, applying foo to each row in the matrix in parallel
result_foo = da.apply_along_axis(foo, 0, x)
# View original contents
x[0:10].compute()
# View sample of results
result_foo = result_foo.compute()
result_foo[0:10]
I'm trying to find the usage of mpi4py for multiprocessing in python3. In docs i read bout ranks and their usage, and how to transfer data from a rank to another, but I could not understand how to implement it, suppose I have function that i want to run on one processor, so should I write that function under if rank == 0, and for another function if rank == 1 ..., The syntax here is confusing me. Is it like spawning process1 = multiprocessing.Process()
MPI4PY provides bindings of the Message Passing Interface (MPI) standard for the Python and allows any Python program to exploit multiple processors.
It supports communications of any picklable Python object
point-to-point (sends, receives)
collective (broadcasts, scatters, gathers)
An example for collective scatters,
consider a list having values 0 to 79
import mpi4py
from mpi4py import MPI
mpi4py.rc.initialize = True
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
def processParallel(unitList):
subList=[i**2 for i in unitList]
return subList
if comm.rank == 0:
#Read the data in to a list
elementList=list(range(80))
#divide the data in do different chunks [small groups]
mainList = [elementList[x:x+10] for x in range(0, len(elementList), 10)]
else:
mainList = None
unitList = comm.scatter(mainList, root=0)
comm.Barrier()
processedResult=processParallel(unitList)
# Gathering the results back to Rank 0
final_detected_data = comm.gather(processedResult, root=0)
if comm.rank == 0:
print(final_detected_data)
Run the code as : python -n python eg : python -n 8 python sample.py
Compare the time taken to process the entire list
I have a multiprocessing.manager.Array object that will be shared by multiple workers to tally observed events: each element in the array holds the tally of a different event type. Incrementing a tally requires both read and write operations, so I believe that to avoid race conditions, each worker needs to request a lock that covers both stages, e.g.
with lock:
my_array[event_type_index] += 1
My intuition is that it should be possible to place a lock on a specific array element. With that type of lock, worker #1 could increment element 1 at the same time that worker #2 is incrementing element 2. This would be especially helpful for my application (n-gram counting), where the array length is quite large and collisions would be rare.
However, I can't figure out how to request an element-wise lock for an array. Does such a thing exist in multiprocessing, or is there a workaround?
For more context, I've included my current implementation below:
import multiprocessing as mp
from queue import Empty
def count_ngrams_in_sentence(n, ngram_counts, char_to_idx_dict, sentence_queue, lock):
while True:
try:
my_sentence_str = sentence_queue.get_nowait()
my_sentence_indices = [char_to_idx_dict[i] for i in my_sentence_str]
my_n = n.value
for i in range(len(my_sentence_indices) - my_n + 1):
my_index = int(sum([my_sentence_indices[i+j]*(27**(my_n - j - 1)) \
for j in range(my_n)]))
with lock: # lock the whole array?
ngram_counts[my_index] += 1
sentence_queue.task_done()
except Empty:
break
return
if __name__ == '__main__':
n = 4
num_ngrams = 27**n
num_workers = 2
sentences = [ ... list of sentences in lowercase ASCII + spaces ... ]
manager = mp.Manager()
sentence_queue = manager.JoinableQueue()
for sentence in sentences:
sentence_queue.put(sentence)
n = manager.Value('i', value=n, lock=False)
char_to_idx_dict = manager.dict([(i,ord(i)-97) for i in string.ascii_lowercase] + [(' ', 26)],
lock=False)
lock = manager.Lock()
ngram_counts = manager.Array('l', [0]*num_ngrams, lock=lock)
''
workers = [mp.Process(target=count_ngrams_in_sentence,
args=[n,
ngram_counts,
char_to_idx_dict,
sentence_queue,
lock]) for i in range(num_workers)]
for worker in workers:
worker.start()
sentence_queue.join()
Multiprocessing.manager.Array comes with a built-in lock. Gotta switch to RawArray.
Have an list of locks. Before modifying an indice, acquire the lock for your array. Then release.
locks[i].acquire()
array[i,:]=0
locks[i].release()
As I said, if the array is a MultiProcessing.RawArray or similar, multiple processes can read or write simultaneously. For some types of Arrays, reading/writing to an Array is inherently atomic - the lock is essentially built in. Carefully research this before proceeding.
As for performance, indexing into a list is on the order of nanoseconds in Python, and acquiring and releasing a lock on the order of microseconds. It's not a huge issue.