Fail fast with MPI4PY - openmpi

I'd like the following behavior when running an MPI script with mpi4py: when any process throws an exception, mpirun (and its spawned processes) should immediately exit with non-zero error codes. But instead, I find that execution continues even if one or more processes throws an exception.
I am using mpi4py 3.0.0 with OpenMPI 2.1.2. I'm running this script with
mpirun --verbose -mca orte_abort_on_non_zero_status 1 -n 4 python my_script.py. I expected this to immediately end before the sleep is hit, but instead, processes with ranks != 0 sleep:
import time
import mpi4py
def main():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('Failure')
print('{} continuing to execute'.format(mpi_comm.rank))
time.sleep(10)
print('{} exiting'.format(mpi_comm.rank)
if __name__ == '__main__':
main()
How can I get the behavior I'd like (fail quickly if any process fails)?
Thank you!

It seems to be an known issue of mpi4py. From https://groups.google.com/forum/#!topic/mpi4py/RovYzJ8qkbc, I read:
mpi4py initializes/finalizes MPI for you. The initialization occurs at
import time, and the finalization when the Python process is about to
finalize (I'm using Py_AtExit() C-API call to do this). As
MPI_Finalize() is collective and likely blocking in most MPI impls,
you get the deadlock.
A solution is to override sys.excepthookand call explicitly MPI.COMM_WORLD.Abort in it.
Here is your code modified:
import sys
import time
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
def mpiabort_excepthook(type, value, traceback):
mpi_comm.Abort()
sys.__excepthook__(type, value, traceback)
def main():
if mpi_comm.rank == 0:
raise ValueError('Failure')
print('{} continuing to execute'.format(mpi_comm.rank))
time.sleep(10)
print('{} exiting'.format(mpi_comm.rank))
if __name__ == "__main__":
sys.excepthook = mpiabort_excepthook
main()
sys.excepthook = sys.__excepthook__

It turns out mpi4py can be run as a module fixing this issue (internally by calling Abort() like jcgiret says):
mpirun --verbose -mca orte_abort_on_non_zero_status 1 -n 4 python -m mpi4py my_script.py

Related

torch.distributed.barrier() added on all processes not working

import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
if local_rank >0:
torch.distributed.barrier()
print(f"Entered process {local_rank}")
if local_rank ==0:
torch.distributed.barrier()
The above code gets hanged forever but if I remove both torch.distributed.barrier() then both print statements get executed. Am I missing something here?
On the command line I execute the process using torchrun --nnodes=1 --nproc_per_node 2 test.py where test.py is the name of the script
tried the above code with and without the torch.distributed.barrier()
With the barrier() statements expecting the statement to print for one gpu and exit -- not as expected
Without the barrier() statements expecting both to print -- as expected
Am I missing something here?
It is better to put your multiprocessing initialization code inside the if __name__ == "__main__": to avoid endless process generation and re-design the control flow to fit your purpose:
if __name__ == "__main__":
import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
if local_rank > 0:
torch.distributed.barrier()
else:
print(f"Entered process {local_rank}")
torch.distributed.barrier()

Runtime error using concurrent.futures.ProcessPoolExecutor

I have seen many YouTube videos for basic tutorials for concurrent.futures.ProcessPoolExecutor. I have also seen posts in SO here and here, GitHub and GitHubMemory, yet no luck.
Problem:
I'm getting the following runtime error:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I admit it, I do not fully understand this error since this is my very first attempt at multiprocessing in my python code.
Here's my pseudocode:
module.py
import xyz
from multiprocessing import freeze_support
def abc():
return x
def main():
xyz
qwerty
if __name__ == "__main__":
freeze_support()
obj = Object()
main()
classObject.py
import abcd
class Object(object):
def __init__(self):
asdf
cvbn
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
executor.map(self.function_for_multiprocess, var1, var2)
# ****The error points at the code above.👆*👆*👆
def function_for_multiprocess(var1, var2):
doSomething1
doSomething2
self.variable = something
My class file (classObject.py) does not have the "main" guard.
Things I have tried:
Tried adding if __name__ == "__main__": and freeze_support in the classObject.py along with renaming __init__() to main()`
While doing the above, removed the freeze_support from the module.py
I haven't found a different solution from the link provided above. Any insights would be greatly appreciated!
I'm using a MacBook Pro (16-inch, 2019), Processor 2.3 GHz 8-Core Intel Core i9, OS:Big Sur. I don't think that matters but just declaring it if it does.
you need to pass arguments as picklable object, so as list or a tuple.
and you don't need freeze_support()
just change executor.map(self.function_for_multiprocess, var1, var2)
to executor.map(self.function_for_multiprocess, (var1, var2))
from multiprocessing import freeze_support
import concurrent.futures
class Object(object):
def __init__(self, var1=1, var2=2):
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
executor.map(self.function_for_multiprocess, (var1, var2))
def function_for_multiprocess(var1, var2):
print('var1:', var1)
print('var2:', var2)
def abc(x):
return x
def main():
print('abc:', abc(200))
if __name__ == "__main__":
#freeze_support()
obj = Object()
main()

Python multiprocessing script partial output

I am following the principles laid down in this post to safely output the results which will eventually be written to a file. Unfortunately, the code only print 1 and 2, and not 3 to 6.
import os
import argparse
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist):
for par in parlist:
queue.put(par)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
try:
par=queueIn.get(block=False)
res=doCalculation(par)
queueOut.put((res))
queueIn.task_done()
except:
break
def doCalculation(par):
return par
def write(queue):
while True:
try:
par=queue.get(block=False)
print("response:",par)
except:
break
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
On running the code it prints,
$ python3 tst.py
Queue size 6
response: 1
response: 2
Also, is it possible to ensure that the write function always outputs 1,2,3,4,5,6 i.e. in the same order in which the data is fed into the feed queue?
The error is somehow with the task_done() call. If you remove that one, then it works, don't ask me why (IMO that's a bug). But the way it works then is that the queueIn.get(block=False) call throws an exception because the queue is empty. This might be just enough for your use case, a better way though would be to use sentinels (as suggested in the multiprocessing docs, see last example). Here's a little rewrite so your program uses sentinels:
import os
import argparse
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist, nthreads):
for par in parlist:
queue.put(par)
for i in range(nthreads):
queue.put(None)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
par=queueIn.get()
if par is None:
break
res=doCalculation(par)
queueOut.put((res))
def doCalculation(par):
return par
def write(queue):
while not queue.empty():
par=queue.get()
print("response:",par)
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod, nthreads))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
A few things to note:
the sentinel is putting a None into the queue. Note that you need one sentinel for every worker process.
for the write function you don't need to do the sentinel handling as there's only one process and you don't need to handle concurrency (if you would do the empty() and then get() thingie in your calc function you would run into a problem if e.g. there's only one item left in the queue and both workers check empty() at the same time and then both want to do get() and then one of them is locked forever)
you don't need to put feed and write into processes, just put them into your main function as you don't want to run it in parallel anyway.
how can I have the same order in output as in input? [...] I guess multiprocessing.map can do this
Yes map keeps the order. Rewriting your program into something simpler (as you don't need the workerQueue and writerQueue and adding random sleeps to prove that the output is still in order:
from multiprocessing import Pool
import time
import random
def calc(val):
time.sleep(random.random())
return val
if __name__ == "__main__":
considerperiod=[1,2,3,4,5,6]
with Pool(processes=2) as pool:
print(pool.map(calc, considerperiod))

Why no call_at_threadsafe and call_later_threadsafe?

I'm using Python 3.5.2 in Windows 32bits and aware that asyncio call_at is not threadsafe, hence following code won't print 'bomb' unless I uncomment the line loop._write_to_self().
import asyncio
import threading
def bomb(loop):
loop.call_later(1, print, 'bomb')
print('submitted')
# loop._write_to_self()
if __name__ == '__main__':
loop = asyncio.get_event_loop()
threading.Timer(2, bomb, args=(loop,)).start()
loop.run_forever()
However I couldn't find a piece of information about why call_at_threadsafe and call_later_threadsafe is implemented. Is the reason ever exists?
Simply use loop.call_soon_threadsafe to schedule loop.call_later:
loop.call_soon_threadsafe(loop.call_later, 1, print, 'bomb')

Terminate subprocess

I'm curious, why the code below freezes. When I kill python3 interpreter, "cat" process remains as a zombie. I expect the subprocess will be terminated before main process finished.
When I send manually SIGTERM to cat /dev/zero, the process is correctly finished (almost immediately)
#!/usr/bin/env python3
import subprocess
import re
import os
import sys
import time
from PyQt4 import QtCore
class Command(QtCore.QThread):
# stateChanged = QtCore.pyqtSignal([bool])
def __init__(self):
QtCore.QThread.__init__(self)
self.__runned = False
self.__cmd = None
print("initialize")
def run(self):
self.__runned = True
self.__cmd = subprocess.Popen(["cat /dev/zero"], shell=True, stdout=subprocess.PIPE)
try:
while self.__runned:
print("reading via pipe")
buf = self.__cmd.stdout.readline()
print("Buffer:{}".format(buf))
except:
logging.warning("Can't read from subprocess (cat /dev/zero) via pipe")
finally:
print("terminating")
self.__cmd.terminate()
self.__cmd.kill()
def stop(self):
print("Command::stop stopping")
self.__runned = False
if self.__cmd:
self.__cmd.terminate()
self.__cmd.kill()
print("Command::stop stopped")
def exitApp():
command.stop()
time.sleep(1)
sys.exit(0)
if __name__ == "__main__":
app = QtCore.QCoreApplication(sys.argv)
command = Command()
# command.daemon = True
command.start()
timer = QtCore.QTimer()
QtCore.QObject.connect(timer, QtCore.SIGNAL("timeout()"), exitApp)
timer.start(2 * 1000)
sys.exit(app.exec_())
As you noted yourself, the reason for zombie is that signal is caught by shell and doesn't effect process created by it. However there is a way to kill shell and all processes created by it; you have to use process group feature. See How to terminate a python subprocess launched with shell=True Having said that if you can manage without shell=True that's always preferable - see my answer here.
I solved this problem in a different way, so here's the result:
I have to call subprocess.Popen with shell=False, because otherwise it creates 2 processes (shell and the process) and __cmd.kill() send signal to shell while "process" remains as a zombie

Resources