What is the usage of MPI in python3? - python-3.x

I'm trying to find the usage of mpi4py for multiprocessing in python3. In docs i read bout ranks and their usage, and how to transfer data from a rank to another, but I could not understand how to implement it, suppose I have function that i want to run on one processor, so should I write that function under if rank == 0, and for another function if rank == 1 ..., The syntax here is confusing me. Is it like spawning process1 = multiprocessing.Process()

MPI4PY provides bindings of the Message Passing Interface (MPI) standard for the Python and allows any Python program to exploit multiple processors.
It supports communications of any picklable Python object
point-to-point (sends, receives)
collective (broadcasts, scatters, gathers)
An example for collective scatters,
consider a list having values 0 to 79
import mpi4py
from mpi4py import MPI
mpi4py.rc.initialize = True
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
def processParallel(unitList):
subList=[i**2 for i in unitList]
return subList
if comm.rank == 0:
#Read the data in to a list
elementList=list(range(80))
#divide the data in do different chunks [small groups]
mainList = [elementList[x:x+10] for x in range(0, len(elementList), 10)]
else:
mainList = None
unitList = comm.scatter(mainList, root=0)
comm.Barrier()
processedResult=processParallel(unitList)
# Gathering the results back to Rank 0
final_detected_data = comm.gather(processedResult, root=0)
if comm.rank == 0:
print(final_detected_data)
Run the code as : python -n python eg : python -n 8 python sample.py
Compare the time taken to process the entire list

Related

How do I run two python functions simultaneously and stop others when any one them returns a value?

I want to write two python functions that request data from two different APIs (with similar data). Both the functions should start simultaneously and as soon as one of the functions (say fun1()) returns the result, the other function (say fun2()) is terminated and the rest of the program is carried out with the result obtained from fun1().
For example,
from datetime import datetime
def fun1(n): #Obtaining data from Source 1
sums = -1
if n > -1:
sums = sum([x*x for x in range(n)])
return sums
def fun2(n): #Obtaining data from source 2
sums = -1
if n > -1:
sums = (n*(2*n+1)*(n+1))/6
return int(sums)
if __name__ == '__main__':
no = 999999
print(fun1(no)) #Function Call 1
print(fun2(no)) #Function Call 2
So I want to run both the functions fun1() and fun2() simultaneously so that as soon as one of them returns the value, the program stops the other function and prints the value from fun1().
I tried to explore multiprocessing in python that offers parallel processing but in addition to that, I'm looking for some solution that stops other functions as soon as one of the functions returns some value.
Note: I'm using Python 3.10.6 on Ubuntu. The program above is just a dummy program to represent the problem.

Print statement does not get executed when MPI Barrier introduced

I am using python and mpi4py, and have encountered a scenario I do not understand. The below code is a minimal working example mwe.py.
import numpy as np
from mpi4py import MPI
import time
import itertools
N=8
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
N_sub = comm.Get_size()-1
get_sub = itertools.cycle(range(1, N_slaves+1))
if rank == 0:
print("I am rank {}".format(rank))
data = []
for i in range(N):
nums = np.random.normal(size=6)
data.append(nums)
sleep_short = 0.0001
sleep_long = 0.1
sleep_dict = {}
for r in range(1, N_sub+1):
sleep_dict[str(r)] = 1
data_idx = 0
while len(data) > data_idx:
r = next(get_sub)
if comm.iprobe(r):
useless_info = comm.recv(source=r)
print(useless_info)
comm.send(data[data_idx], dest=r)
data_idx += 1
print("data_idx {}".format(data_idx))
sleep_dict[str(r)] = 1
else:
sleep_dict[str(r)] = 0
if all(value == 0 for value in sleep_dict.values()):
time.sleep(sleep_long)
else:
time.sleep(sleep_short)
for r in range(1, N_sub+1):
comm.send('Done', dest=r)
else:
print("rank {}".format(rank))
######################
# vvv This is the statement in question
######################
comm.Barrier()
while True:
comm.send("I am done", dest=0)
model = comm.recv(source=0)
if type(model).__name__ == 'str':
break
MPI.Finalize()
sys.exit(0)
When run with mpirun -np 4 python mwe.py, this code generates an array containing lists of random numbers, and then distributes these lists to the "sub" ranks until all arrays have been sent. Understandably, if I insert a comm.Barrier() call near the bottom (where I have indicated in the code), the code no longer completes execution, as the sub ranks (not equal to 0) never get to the statement where they are to receive what is being sent from rank 0. Rank 0 keeps trying to find a rank to pass the array to, but never does since the other ranks are held up, and the code hangs.
This makes sense to me. What doesn't make sense to me is that with the comm.Barrier() statement included, the preceding print statement also does not execute. Based on my understanding, the sub ranks should proceed normally until they hit the barrier statement, and then wait there until all of the ranks 'catch up', which in this case never happens because rank 0 is in its own loop. If this is the case, the preceding print statement should be executed, as it comes before those ranks have gotten to the barrier line. So why does the statement not get printed? Can anyone explain where my understanding fails?

(itertools.combinations) Python Shell hangs in a bigger than specific amount size and throws memory error

I am trying to run a specific code to find the sum of all the possible combinations of the list that is coming from a .in file. The same code, when running with relatively small files, runs perfectly and with bigger files hangs and after a bit throws MEMORY ERROR
import itertools
file = open("c_medium.in","r")
if file.mode=='r':
content = file.readlines()
maxSlices,numberOfPizza = map(int,content[0].split())
numberOfSlices = tuple(map(int,content[1].split()))
print(maxSlices)
print(numberOfSlices)
sol = []
sumOfSlices = []
for x in range(1,len(numberOfSlices)+1):
print(x)
for y in itertools.combinations(numberOfSlices,x):
if sum(y) <= maxSlices:
sumOfSlices.append(sum(y))
sumOfSlices.sort()
print(sumOfSlices)
checkSum = sumOfSlices[len(sumOfSlices)-1]
print(checkSum)
found = False
if found == False:
for x in range(1,len(numberOfSlices)+1):
print(x)
for y in itertools.combinations(numberOfSlices,x):
if found == False:
if sum(y) == checkSum:
for z in y:
sol.append(numberOfSlices.index(z))
found = True
solution = tuple(map(str,sol))
print(solution)
The number of combinations of N elements grows very, very fast with N.
Regarding your code in particular, if (sum(y) <= maxSlices) is always true, then you'll generate a list with 2^(numberOfSlices) elements. i.e., you'll overflow a 32-bit integer if numberOfSlices=32.
I'd recommend trying to solve your task without explicitly building a list. If you describe what your code is doing, maybe someone can help.

How to accelerate the application of the following for loop and function?

I have the following for loop:
for j in range(len(list_list_int)):
arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
arr_1[j,:] = arr_1_.data.numpy()
arr_2[j,:] = arr_2_.data.numpy()
arr_3[j,:] = arr_3_.data.numpy()
I would like to apply foo with multiprocessing, mainly because it is taking a lot of time to finish. I tried to do it in batches with funcy's chunks method:
for j in chunks(1000, list_list_int):
arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
arr_1[j,:] = arr_1_.data.numpy()
arr_2[j,:] = arr_2_.data.numpy()
arr_3[j,:] = arr_3_.data.numpy()
However, I am getting list object cannot be interpreted as an integer. What is the correct way of applying foo using multiprocessing?
list_list_int = [1,2,3,4,5,6]
for j in chunks(2, list_list_int):
for i in j:
avg_, max_, last_ = foo(bar, i)
I don't have chunks installed, but from the docs I suspect it produces (for size 2 chunks, from:
alist = [[1,2],[3,4],[5,6],[7,8]]
j = [[1,2],[3,4]]
j = [[5,6],[7,8]]
which would produce an error:
In [116]: alist[j]
TypeError: list indices must be integers or slices, not list
And if your foo can't work with the full list of lists, I don't see how it will work with that list split into chunks. Apparently it can only work with one sublist at a time.
If you are looking to perform parallel operations on a numpy array, then I would use Dask.
With just a few lines of code, your operation should be able to be easily ran on multiple processes and the highly developed Dask scheduler will balance the load for you. A huge benefit to Dask compared to other parallel libraries like joblib, is that it maintains the native numpy API.
import dask.array as da
# Setting up a random array with dimensions 10K rows and 10 columns
# This data is stored distributed across 10 chunks, and the columns are kept together (1_000, 10)
x = da.random.random((10_000, 10), chunks=(1_000, 10))
x = x.persist() # Allow the entire array to persist in memory to speed up calculation
def foo(x):
return x / 10
# Using the native numpy function, apply_along_axis, applying foo to each row in the matrix in parallel
result_foo = da.apply_along_axis(foo, 0, x)
# View original contents
x[0:10].compute()
# View sample of results
result_foo = result_foo.compute()
result_foo[0:10]

Python fails to parallelize buffer reads

I'm having performances issues in multi-threading.
I have a code snippet that reads 8MB buffers in parallel:
import copy
import itertools
import threading
import time
# Basic implementation of thread pool.
# Based on multiprocessing.Pool
class ThreadPool:
def __init__(self, nb_threads):
self.nb_threads = nb_threads
def map(self, fun, iter):
if self.nb_threads <= 1:
return map(fun, iter)
nb_threads = min(self.nb_threads, len(iter))
# ensure 'iter' does not evaluate lazily
# (generator or xrange...)
iter = list(iter)
# map to results list
results = [None] * nb_threads
def wrapper(i):
def f(args):
results[i] = map(fun, args)
return f
# slice iter in chunks
chunks = [iter[i::nb_threads] for i in range(nb_threads)]
# create threads
threads = [threading.Thread(target = wrapper(i), args = [chunk]) \
for i, chunk in enumerate(chunks)]
# start and join threads
[thread.start() for thread in threads]
[thread.join() for thread in threads]
# reorder results
r = list(itertools.chain.from_iterable(map(None, *results)))
return r
payload = [0] * (1000 * 1000) # 8 MB
payloads = [copy.deepcopy(payload) for _ in range(40)]
def process(i):
for i in payloads[i]:
j = i + 1
if __name__ == '__main__':
for nb_threads in [1, 2, 4, 8, 20]:
t = time.time()
c = time.clock()
pool = ThreadPool(nb_threads)
pool.map(process, xrange(40))
t = time.time() - t
c = time.clock() - c
print nb_threads, t, c
Output:
1 1.04805707932 1.05
2 1.45473504066 2.23
4 2.01357698441 3.98
8 1.56527090073 3.66
20 1.9085559845 4.15
Why does the threading module miserably fail at parallelizing mere buffer reads?
Is it because of the GIL? Or because of some weird configuration on my machine, one process
is allowed only one access to the RAM at a time (I have decent speed-up if I switch ThreadPool for multiprocessing.Pool is the code above)?
I'm using CPython 2.7.8 on a linux distro.
Yes, Python's GIL prevents Python code from running in parallel across multiple threads. You describe your code as doing "buffer reads", but it's really running arbitrary Python code (in this case, iterating over a list adding 1 to other integers). If your threads were making blocking system calls (like reading from a file, or from a network socket), then the GIL would usually be released while the thread blocked waiting on the external data. But since most operations on Python objects can have side effects, you can't do several of them in parallel.
One important reason for this is that CPython's garbage collector uses reference counting as its main way to know when an object can be cleaned up. If several threads try to update the reference count of the same object at the same time, they might end up in a race condition and leave the object with the wrong count. The GIL prevents that from happening, as only one thread can be making such internal changes at a time. Every time your process code does j = i + 1, it's going to be updating the reference counts of the integer objects 0 and 1 a couple of times each. That's exactly the kind of thing the GIL exists to guard.

Resources