Multiprocessing with subprocess.call for an entire directory of files

Multiprocessing with subprocess.call for an entire directory of files - python-3.x

I have a Python3 script that uses subprocess.call to run a program on about 2,300 input files in a directory and there are two output files for each input file. I have these two outputs going into two different directories. I would like to learn how to multiprocess my script so several files can be processed at the same time. I have been reading on the multiprocess library in Python but it might be too advanced for me to understand. Below is the script if the experts have any input. Thanks so much!
Script:
import os
import subprocess
import argparse
parser = argparse.ArgumentParser(description="This script aligns DNA sequences in files in a given directory.")
parser.add_argument('--root', default="/shared/testing_macse/", help="PATH to the input directory containing CDS orthogroup files.")
parser.add_argument('--align_NT_dir', default="/shared/testing_macse/NT_aligned/", help="PATH to the output directory for NT aligned CDS orthogroup files.")
parser.add_argument('--align_AA_dir', default="/shared/testing_macse/AA_aligned/", help="PATH to the output directory for AA aligned CDS orthogroup files.")
args = parser.parse_args()
def runMACSE(input_file, NT_output_file, AA_output_file):
MACSE_command = "java -jar ~/bin/MACSE/macse_v1.01b.jar "
MACSE_command += "-prog alignSequences "
MACSE_command += "-seq {0} -out_NT {1} -out_AA {2}".format(input_file, NT_output_file, AA_output_file)
# print(MACSE_command)
subprocess.call(MACSE_command, shell=True)
Orig_file_dir = args.root
NT_align_file_dir = args.align_NT_dir
AA_align_file_dir = args.align_AA_dir
try:
os.makedirs(NT_align_file_dir)
os.makedirs(AA_align_file_dir)
except FileExistsError as e:
print(e)
for currentFile in os.listdir(args.root):
if currentFile.endswith(".fa"):
runMACSE(args.root + currentFile, args.align_NT_dir + currentFile[:-3]+"_NT_aligned.fa", args.align_AA_dir + currentFile[:-3]+"_AA_aligned.fa")

Subprocess functions run any command-line executable in a separate process. You are running java. Multiprocessing runs python code in separate processes, just as threading runs python code in separate threads. The API for the two is intentionally similar. So multiprocessing cannot substitute for non-python subprocess calls.
It would be a waste of processes to use multiple python processes to initiate multiple java processes. You could just as well use multiple threads to make multiple subprocess calls. Or use the async module.
Or make your own scheduler. Wrap your for-if in a generator function.
def fa_file(path):
for currentFile in os.listdir(path):
if currentFile.endswith(".fa"):
yield currentFile
fafiles = fa_file(arg.root)
Make an array of, say, 10 Popen objects. Sleep for some appropriate interval. Upon waking, loop through the array and replace finished subprocesses (.poll() returns something other than None) for as long as next(fafiles) returns something.
EDIT: If you did the image processing in Python code that calls compiled C code (pillow, for instance), then you could use multiprocessing and a Queue loaded with the files to process.

Related

Python3 Unable to store stdout to variable

My case is a little bit specific. I'm trying to run a Python program using Python for testing purposes. The case is as follows:
# file1.py
print("Hello world")
# file1.test.py
import io
import sys
import os
import unittest
EXPECTED_OUTPUT = "Hello world"
class TestHello(unittest.TestCase):
def test_hello(self):
sio = io.StringIO()
sys.stdout = sio
os.system("python3 path/to/file1.py")
sys.stdout = sys.__stdout__
print("captured value:", sio.getvalue())
self.assertEqual(sio.getvalue(), EXPECTED_STDOUT)
if __name__ == "__main__":
unittest.main()
But nothing ends up in the sio variable. This way and similar ways are introduced online but they don't seem to work for me. My Python version is 3.8.10 but it doesn't really matter if this works better in some other version, I can switch to that.
Note: I know that if I was using an importable object this might be easier, but right now I need to know how to catch the output of another file.
Thanks!

stdout redirection does not work like this - this will change the stdout variable inside your Python process. But by using os.system, you are running another process, that will re-use the same terminal pseudo-files your parent process is using.
If you want to log a subprocess, the way to do it is to use the subprocess modules calls, which allow you to redirect the subprocess output. https://docs.python.org/3/library/subprocess.html
Also, the subprocess won't be able to use a StringIO object from the parent process (it is not an O.S. level object, just an in-process Python object with a write method). The docs above include instructions about using the special object subprocess.PIPE which allows for in-memory communication, or, you can just pass an ordinary filesystem file, which you can read afterwards.

How can I pass and receive information dinamically within a subprocess?

I'm developing a Python code that can run two applications and exchange information between them during their run time.
The basic scheme is something like:
start a subprocess with the 1st application
start a subprocess with the 2nd application
1st application performs some calculation, writes a file A, and waits for input
2nd application reads file A, performs some calculation, writes a file B, and waits for input
1st application reads file B, performs some calculation, writes a file C, and waits for input
...and so on until some condition is met
I know how to start one Python subprocess, and now I'm learning how to pass/receive information during run time.
I'm testing my Python code using a super-simple application that just reads a file, makes a plot, closes the plot, and returns 0.
I was able to pass an input to a subprocess using subprocess.communicate() and I could tell that the subprocess used that information (plot opens and closes), but here the problems started.
I can only send an input string once. After the first subprocess.communicate() in my code below, the subprocess hangs there. I suspect I might have to use subprocess.stdin.write() instead, since I read subprocess.communicate() will wait for the end of the file and I wish to send multiple times different inputs during the application run instead. But I also read that the use of stdin.write() and stdout.read() is discouraged. I tried this second alteranative (see #alternative in the code below), but in this case the application doesn't seem to receive the inputs, i.e. it doesn't do anything and the code ends.
Debugging is complicated because I haven't found a neat way to output what the subprocess is receiving as input and giving as output. (I tried to implement the solutions described here, but I must have done something wrong: Python: How to read stdout of subprocess in a nonblocking way, A non-blocking read on a subprocess.PIPE in Python)
Here is my working example. Any help is appreciated!
import os
import subprocess
from subprocess import PIPE
# Set application name
app_folder = 'my_folder_path'
full_name_app = os.path.join(app_folder, 'test_subprocess.exe')
# Start process
out_app = subprocess.Popen([full_name_app], stdin=PIPE, stdout=PIPE)
# Pass argument to process
N = 5
for n in range(N):
str_to_communicate = f'{{\'test_{n+1}.mat\', {{\'t\', \'y\'}}}}' # funny looking string - but this how it needs to be passed
bytes_to_communicate = str_to_communicate.encode()
output_communication = out_app.communicate(bytes_to_communicate)
# output_communication = out_app.stdin.write(bytes_to_communicate) # alternative
print(f'Communication command #{n+1} sent')
# Terminate process
out_app.terminate()

Python multiprocessing manager showing error when used in flask API

I am pretty confused about the best way to do what I am trying to do.
What do I want?
API call to the flask application
Flask route starts 4-5 multiprocess using Process module and combine results(on a sliced pandas dataframe) using a shared Managers().list()
Return computed results back to the client.
My implementation:
pos_iter_list = get_chunking_iter_list(len(position_records), 10000)
manager = Manager()
data_dict = manager.list()
processes = []
for i in range(len(pos_iter_list) - 1):
temp_list = data_dict[pos_iter_list[i]:pos_iter_list[i + 1]]
p = Process(
target=transpose_dataset,
args=(temp_list, name_space, align_namespace, measure_master_id, df_searchable, products,
channels, all_cols, potential_col, adoption_col, final_segment, col_map, product_segments,
data_dict)
)
p.start()
processes.append(p)
for p in processes:
p.join()
My directory structure:
- main.py(flask entry point)
- helper.py(contains function where above code is executed & calls transpose_dataset function)
Error that i am getting while running the same?
RuntimeError: No root path can be found for the provided module "mp_main". This can happen because the module came from an import hook that does not provide file name information or because it's a namespace package. In this case the root path needs to be explicitly provided.
Not sure what went wong here, manager list works fine when called from a sample.py file using if __name__ == '__main__':
Update: The same piece of code is working fine on my MacBook and not on windows os.
A sample flask API call:
#app.route(PREFIX + "ping", methods=['GET'])
def ping():
man = mp.Manager()
data = man.list()
processes = []
for i in range(0,5):
pr = mp.Process(target=test_func, args=(data, i))
pr.start()
processes.append(pr)
for pr in processes:
pr.join()
return json.dumps(list(data))

Stack has an ongoing bug preventing me from commenting, so I'll just write up an answer..
Python has 2 (main) ways to start a new process: "spawn", and "fork". Fork is a system command only available in *nix (read: linux or macos), and therefore spawn is the only option in windows. After 3.8 spawn will be the default on MacOS, but fork is still available. The big difference is that fork basically makes a copy of the existing process while spawn starts a whole new process (like just opening a new cmd window). There's a lot of nuance to why and how, but in order to be able to run the function you want the child process to run using spawn, the child has to import the main file. Importing a file is tantamount to just executing that file and then typically binding its namespace to a variable: import flask will run the flask/__ini__.py file, and bind its global namespace to the variable flask. There's often code however that is only used by the main process, and doesn't need to be imported / executed in the child process. In some cases running that code again actually breaks things, so instead you need to prevent it from running outside of the main process. This is taken into account in that the "magic" variable __name__ is only equal to "__main__" in the main file (and not in child processes or when importing modules).
In your specific case, you're creating a new app = Flask(__name__), which does some amount of validation and checks before you ever run the server. It's one of these setup/validation steps that it's tripping over when run from the child process. Fixing it by not letting it run at all is imao the cleaner solution, but you can also fix it by giving it a value that it won't trip over, then just never start that secondary server (again by protecting it with if __name__ == "__main__":)

Python Process Multiple PDF file from Multiple location at same time

I have processing 1500 PDF File On a daily basis from three different locations. My Problem is while running a code it processes folder 1 first then folder 2 then folder 3.
I want it to process all folders at the same time, for example, they process 5 files from folder 1 and 3 files from folder 2 and 4 files from folder 3.
So we can't wait until the completion of folder 1.some time folder1 have 500 files which mean I need to wait till the program process all files of folder 1 then it processes folder2
I try threading but not working, it processes folder Serial means it process folder1 files first after finishing folder1 file it processes folder2. But I want to process foler1 and folder 2 files at the same time.
I will explain a little bit about the below code, I have 3 folder location which has PDF files and I have the main function which response for converting PDF to PS, I am calling function in three different thread with 3 location
inputpath1 = "/121rawfile/FTP HotFolder"
inputpath2 = "/121rawfile/FTP Download File"
inputpath3 = "/121rawfile/Olive"
## define function with variable filename and the format of the timestamp
def timeStamped(filename, fmt='%m-%d-%y-{filename}'):
os.chdir("/PDF_Flattening")
return datetime.now().strftime(fmt).format(filename=filename)
filename=timeStamped("log_file_9.log")
logging.basicConfig(filename=filename,format = "%(asctime)s - %(levelname)s - %(message)s",level = logging.DEBUG)
while True:
try:
t1=threading.Thread(target=main(inputpath1),args=(inputpath1,))
t1.start()
t2=threading.Thread(target=main(inputpath2),args=(inputpath2,))
t2.start()
t3=threading.Thread(target=main(inputpath3),args=(inputpath3,))
t3.start()
raise Exception("Error simulated!")
except Exception as e :
logging.error("failed")

CPython (the standard python implementation) has something called the GIL (Global Interpreter Lock); the GIL prevents two threads from executing simultaneously in the same program. . For CPU bound tasks and truly parallel execution, we can use the multiprocessing module.
A basic example on how to use multiprocessing module:
( python doc link)
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
For more info you can refer :
link

Printing from other thread when waiting for input()

I am trying to write a shell that needs to run socket connections on a seperate thread. On my testings, when print() is used while cmd.Cmd.cmdloop() waiting for input, the print is displaying wrong.
from core.shell import Shell
import time
import threading
def test(shell):
time.sleep(2)
shell.write('Doing test')
if __name__ == '__main__':
shell = Shell(None, None)
testThrd = threading.Thread(target=test, args=(shell,))
testThrd.start()
shell.cmdloop()
When the above command runs, here is what happens:
python test.py
Welcome to Test shell. Type help or ? to list commands.
>>asd
*** Unknown syntax: asd
>>[17:59:25] Doing test
As you can see, printing from another threads add output after prompt >> not in a new line. How can I do it so that it appears in a new line and prompt appears?

What you can do, is redirect stdout from your core.shell.Shell to a file like object such as StringIO. You would also redirect the output from your thread into a different file like object.
Now, you can have some third thread read both of these objects and print them out in whatever fashion you want.
You said core.shell.Shell inherits from cmd.Cmd, which allows redirection as a parameter to the constructor:
import io
import time
import threading
from core.shell import Shell
def test(output_obj):
time.sleep(2)
print('Doing test', file=output_obj)
cmd_output = io.StringIO()
thr_output = io.StringIO()
shell = Shell(stdout=cmd_output)
testThrd = threading.Thread(target=test, args=(thr_output,))
testThrd.start()
# in some other process/thread
cmd_line = cmd_output.readline()
thr_line = thr_output.readline()

That's quite difficult. Both your threads are sharing the same stdout. So the output from each of those threads are concurrently sent to your stdout buffer where they are printed in some arbitrary order.
What you need to do is coordinate the output from both threads, and that's a tough nut to crack. Even bash doesn't do that!
That said, maybe you can try using a lock to make sure your threads access stdout in a controlled manner. Check out: http://effbot.org/zone/thread-synchronization.htm

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Multiprocessing with subprocess.call for an entire directory of files - python-3.x

Related

Python3 Unable to store stdout to variable

How can I pass and receive information dinamically within a subprocess?

Python multiprocessing manager showing error when used in flask API

Python Process Multiple PDF file from Multiple location at same time

Printing from other thread when waiting for input()

Categories

Resources