How can I fork many child process in Python? - python-3.x

I would like to produce child processes and download simultaneously (I know not actually simultaneous, but looks like it happens simultaneous) with wget in each process.
for download_cmd in cmd_list:
pid = os.fork()
if pid == 0:
fd = subprocess.Popen(download_cmd)
else:
cur_num_of_process += 1
if pid != 0:
while cur_num_of_process > 0 :
os.wait()
cur_num_of_process -= 1
but it doesn't work. Any help, please?
Python version is 3.x

You are waiting for each child, and so you never create more than one.
Classic fork bomb, in python. Try this only as an experiment, only on your own equipment (not on a computer lab's systems, for example), and be ready to handle the consequences. This demonstrates forking many (many, many) child processes.
while True:
os.fork()
What you want is to create a collection of processes, then wait on them after each has been started.
workers = []
for cmd in [...]:
proc = subprocess.Popen(cmd, ...)
workers.append(proc)
for proc in workers:
proc.wait()

Related

Getting `BrokenProcessPool` error in a `concurrent.futures` example

The example I am running is mentioned in this PyMOTW3 link. I am reproducing the code here:
from concurrent import futures
import os
def task(n):
return (n, os.getpid())
ex = futures.ProcessPoolExecutor(max_workers=2)
results = ex.map(task, range(5, 0, -1))
for n, pid in results:
print('ran task {} in process {}'.format(n, pid))
As per source, I am supposed to get following output:
ran task 5 in process 40854
ran task 4 in process 40854
ran task 3 in process 40854
ran task 2 in process 40854
ran task 1 in process 40854
Instead, I'm getting a long message with following concluding line -
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I am using Windows machine and running Python 9. All other examples are otherwise running fine. What is going wrong here?
I've finally been able to resolve the issue. The issue seems to be Windows specific. Following a related Stack Overflow post, I used if __name__=="__main__" idiom. The modified code is:
from concurrent import futures
import os
def task(n):
return (n, os.getpid())
def main():
ex = futures.ProcessPoolExecutor(max_workers=2)
results = ex.map(task, range(5, 0, -1))
for n, pid in results:
print('ran task {} in process {}'.format(n, pid))
if __name__ == '__main__':
main()
It worked, although I'm still not sure why this worked.

Multiprocess : Persistent Pool?

I have code like the one below :
def expensive(self,c,v):
.....
def inner_loop(self,c,collector):
self.db.query('SELECT ...',(c,))
for v in self.db.cursor.fetchall() :
collector.append( self.expensive(c,v) )
def method(self):
# create a Pool
#join the Pool ??
self.db.query('SELECT ...')
for c in self.db.cursor.fetchall() :
collector = []
#RUN the whole cycle in parallel in separate processes
self.inner_loop(c, collector)
#do stuff with the collector
#! close the pool ?
both the Outer and the Inner loop are thousands of steps ...
I think I understand how to run a Pool of couple of processes.
All the examples I found show that more or less.
But in my case I need to lunch a persistent Pool and then feed the data (c-value). Once a inner-loop process has finished I have to supply the next-available-c-value.
And keep the processes running and collect the results.
How do I do that ?
A clunky idea I have is :
def method(self):
ws = 4
with Pool(processes=ws) as pool :
cs = []
for i,c in enumerate(..) :
cs.append(c)
if i % ws == 0 :
res = [pool.apply(self.inner_loop, (c)) for i in range(ws)]
cs = []
collector.append(res)
will this keep the same pool running !! i.e. not lunch new process every time ?i
Do I need 'if i % ws == 0' part or I can use imap(), map_async() and the Pool obj will block the loop when available workers are exhausted and continue when some are freed ?
Yes, the way that multiprocessing.Pool works is:
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
So simply submitting all your work to the pool via imap should be sufficient:
with Pool(processes=4) as pool:
initial_results = db.fetchall("SELECT c FROM outer")
results = [pool.imap(self.inner_loop, (c,)) for c in initial_results]
That said, if you really are doing this to fetch things from the DB, it may make more sense to move more processing down into that layer (bring the computation to the data rather than bringing the data to the computation).

gsutil without -m multithreading / parallel default behavior

I am trying to find out if gsutil mv is called without the -m option, what the defaults are. I see in the config.py source code that it looks like even without the -m option the default would be to calculate the number of CPU cores and set that along with 5 threads. So by default if you had a 4 core machine you would get 4 processes and 5 threads, basically multi-threaded out of the box. How would we find out what -m does, I think I saw in some documentation that -m defaults to 10 threads, but how many processes are spawned? I know you can override these settings but whats default with -m?
should_prohibit_multiprocessing, unused_os =ShouldProhibitMultiprocessing()
if should_prohibit_multiprocessing:
DEFAULT_PARALLEL_PROCESS_COUNT = 1
DEFAULT_PARALLEL_THREAD_COUNT = 24
else:
DEFAULT_PARALLEL_PROCESS_COUNT = min(multiprocessing.cpu_count(), 32)
DEFAULT_PARALLEL_THREAD_COUNT = 5
Also would a mv command in a for loop take advantage of -m or will it just feed the gsutil command one at a time rendering parallel obsolete? The reason I ask because using the below loop with 50000 files took 24 hours to complete, I wanted to know if I used the -m option if it would of helped? Not sure if calling the gsutil command each iteration would allow full threading or would it just do it with 10 processes and 10 threads making it twice as fast?
#!/bin/bash
for files in $(cat listing2.txt) ; do
echo "Renaming: $files --> ${files#removeprefix-}"
gsutil mv gs://testbucket/$files gs://testbucket/${files#removeprefix-}
done
Thanks to the commenters #guillaume blaquiere,
I engineered a python program that would multi process the API calls to move the files in the cloud with 25 concurrent processes. I will share the code here to hopefully help others.
import time
import subprocess
import multiprocessing
class GsRenamer:
def __init__(self):
self.gs_cmd = '~/google-cloud-sdk/bin/gsutil'
def execute_jobs(self, cmd):
try:
print('RUNNING PARALLEL RENAME: [{0}]'.format(cmd))
print(cmd)
subprocess.run(cmd, check=True, shell=True)
except subprocess.CalledProcessError as e:
print('[{0}] FATAL: Command failed with error [{1}]').format(cmd,
e)
def get_filenames_from_gs(self):
self.file_list = []
cmd = [self.gs_cmd, 'ls',
'gs://gs-bucket/jason_testing']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.stdout.readlines()
for files in output:
files = files.decode('utf-8').strip()
tokens = files.split('/')[-1]
self.file_list.append(tokens)
self.file_list = list(filter(None, self.file_list))
def rename_files(self, string_original, string_replace):
final_rename_list = []
for files in self.file_list:
renamed_files = files.replace(string_original,
string_replace)
rename_command = "{0} mv gs://gs-bucket/jason_testing/{1} " \
"gs://gs-bucket/jason_testing/{2}".format(
self.gs_cmd, files, renamed_files)
final_rename_list.append(rename_command)
final_rename_list.sort()
multiprocessing.pool = multiprocessing.Pool(
processes=25)
multiprocessing.pool.map(self.execute_jobs, final_rename_list)
def main():
gsr = GsRenamer()
gsr.get_filenames_from_gs()
#gsr.rename_files('sample', 'jason')
gsr.rename_files('jason', 'sample')
if __name__ == "__main__":
main()

Streaming read from subprocess

I need to read output from a child process as it's produced -- perhaps not on every write, but well before the process completes. I've tried solutions from the Python3 docs and SO questions here and here, but I still get nothing until the child terminates.
The application is for monitoring training of a deep learning model. I need to grab the test output (about 250 bytes for each iteration, at roughly 1-minute intervals) and watch for statistical failures.
I cannot change the training engine; for instance, I cannot insert stdout.flush() in the child process code.
I can reasonably wait for a dozen lines of output to accumulate; I was hopeful of a buffer-fill solving my problem.
Code: variations are commented out.
Parent
cmd = ["/usr/bin/python3", "zzz.py"]
# test_proc = subprocess.Popen(
test_proc = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
out_data = ""
print(time.time(), "START")
while not "QUIT" in str(out_data):
out_data = test_proc.stdout
# out_data, err_data = test_proc.communicate()
print(time.time(), "MAIN received", out_data)
Child (zzz.py)
from time import sleep
import sys
for _ in range(5):
print(_, "sleeping", "."*1000)
# sys.stdout.flush()
sleep(1)
print("QUIT this exercise")
Despite sending lines of 1000+ bytes, the buffer (tested elsewhere as 2kb; here, I've gone as high as 50kb) filling doesn't cause the parent to "see" the new text.
What am I missing to get this to work?
Update with regard to links, comments, and iBug's posted answer:
Popen instead of run fixed the blocking issue. Somehow I missed this in the documentation and my experiments with both.
universal_newline=True neatly changed the bytes return to string: easier to handle on the receiving end, although with interleaved empty lines (easy to detect and discard).
Setting bufsize to something tiny (e.g. 1) didn't affect anything; the parent still has to wait for the child to fill the stdout buffer, 8k in my case.
export PYTHONUNBUFFERED=1 before execution did fix the buffering problem. Thanks to wim for the link.
Unless someone comes up with a canonical, nifty solution that makes these obsolete, I'll accept iBug's answer tomorrow.
subprocess.run always spawns the child process, and blocks the thread until it exits.
The only option for you is to use p = subprocess.Popen(...) and read lines with s = p.stdout.readline() or p.stdout.__iter__() (see below).
This code works for me, if the child process flushes stdout after printing a line (see below for extended note).
cmd = ["/usr/bin/python3", "zzz.py"]
test_proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)
out_data = ""
print(time.time(), "START")
while not "QUIT" in str(out_data):
out_data = test_proc.stdout.readline()
print(time.time(), "MAIN received", out_data)
test_proc.communicate() # shut it down
See my terminal log (dots removed from zzz.py):
ibug#ubuntu:~/t $ python3 p.py
1546450821.9174328 START
1546450821.9793346 MAIN received b'0 sleeping \n'
1546450822.987753 MAIN received b'1 sleeping \n'
1546450823.993136 MAIN received b'2 sleeping \n'
1546450824.997726 MAIN received b'3 sleeping \n'
1546450825.9975247 MAIN received b'4 sleeping \n'
1546450827.0094354 MAIN received b'QUIT this exercise\n'
You can also do it with a for loop:
for out_data in test_proc.stdout:
if "QUIT" in str(out_data):
break
print(time.time(), "MAIN received", out_data)
If you cannot modify the child process, unbuffer (from package expect - install with APT or YUM) may help. This is my working parent code without changing the child code.
test_proc = subprocess.Popen(
["unbuffer"] + cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT
)

Indicate no more input without closing pty

When controlling a process using a PTY master/slave pair, I would like to indicate to the process in question that stdin has closed and I have no more content to send, but I would still like to receive output from the process.
The catch is that I only have one file descriptor (the PTY "master") which handles both input from the child process and output to the child process. So closing the descriptor would close both.
Example in python:
import subprocess, pty, os
master,slave = pty.openpty()
proc = subprocess.Popen(["/bin/cat"], stdin=slave, stdout=slave)
os.close(slave) # now belongs to child process
os.write(master,"foo")
magic_close_fn(master) # <--- THIS is what I want
while True:
out = os.read(master,4096)
if out:
print out
else:
break
proc.wait()
You need to get separate read and write file descriptors. The simple way to do that is with a pipe and a PTY. So now your code would look like this:
import subprocess, pty, os
master, slave = pty.openpty()
child_stdin, parent_stdin = os.pipe()
proc = subprocess.Popen(["/bin/cat"], stdin=child_stdin, stdout=slave)
os.close(child_stdin) # now belongs to child process
os.close(slave)
os.write(parent_stdin,"foo") #Write to the write end (our end) of the child's stdin
#Here's the "magic" close function
os.close(parent_stdin)
while True:
out = os.read(master,4096)
if out:
print out
else:
break
proc.wait()
I had to do this today, ended up here and was sad to see no answer. I achieved this using a pair of ptys rather than a single pty.
stdin_master, stdin_slave = os.openpty()
stdout_master, stdout_slave = os.openpty()
def child_setup():
os.close(stdin_master) # only the parent needs this
os.close(stdout_master) # only the parent needs this
with subprocess.Popen(cmd,
start_new_session=True,
stderr=subprocess.PIPE,
stdin=stdin_slave,
stdout=stdout_slave,
preexec_fn=child_setup) as proc:
os.close(stdin_slave) # only the child needs this
os.close(stdout_slave) # only the child needs this
stdin_pty = io.FileIO(stdin_master, "w")
stdout_pty = io.FileIO(stdout_master, "r")
stdin_pty.write(b"here is your input\r")
stdin_pty.close() # no more input (EOF)
output = b""
while True:
try:
output += stdout_pty.read(1)
except OSError:
# EOF
break
stdout_pty.close()
I think that what you want is to send the CTRL-D (EOT - End Of Transmission) caracter, isn't you? This will close the input in some applications, but others will quit.
perl -e 'print qq,\cD,'
or purely shell:
echo -e '\x04' | nc localhost 8080
Both are just examples. BTW the CTRL-D caracter is \x04 in hexa.

Resources