How to change global variables when using parallel programing - python-3.x

I am using multiprocessing in my code to do somethings parallel. Actually in a simple version of my goal, I want to change some global variables by two different processes in parallel.
But in the end of the code running, the result which is getting from mp.Queue is true but the variables are not changed.
here is a simple version of code:
import multiprocessing as mp
a = 3
b = 5
# define a example function
def f(length, output):
global a
global b
if length==5:
a = length + a
output.put(a)
if length==3:
b = length + b
output.put(b)
if __name__ == '__main__':
# Define an output queue
output = mp.Queue()
# Setup a list of processes that we want to run
processes = []
processes.append(mp.Process(target=f, args=(5, output)))
processes.append(mp.Process(target=f, args=(3, output)))
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
print ("a:",a)
print ("b:",b)
And the blow is the answers:
[8, 8]
a: 3
b: 5
How can I apply the results of processes to the global variables? or how can I run this code with multiprocessing and get answer like running a simple threat code ?

When you use Threading, the two (or more) threads are created within the same process and share their memory (globals).
When you use MultiProcessing, a whole new process is created and each one gets its own copy of the memory (globals).
You could look at mutiprocessing Value/Array or Manager to allow pseudo-globals, i.e. shared objects.

Related

How to run Multiprocessing Pool code in Python without mentioning __main__ environment

When I run Python's multiprocessing Pool with main environment, I get the expected output i.e. time is reduced due to parallel processing.
But when I run the same code without main enviroment, it just throws error
from multiprocessing import Pool
import os
import time
def get_acord_detected_json(page_no):
time.sleep(5)
return page_no*page_no
def main():
n_processes = 2
page_num_list = [1,2,3,4,5]
print("n_processes : ", n_processes)
print("page_num_list : ", page_num_list)
print("CPU count : ", os.cpu_count())
t1 = time.time()
with Pool(processes=n_processes) as pool:
acord_js_list = pool.map(get_acord_detected_json, page_num_list)
print("acord_js_list : ", acord_js_list)
t2 = time.time()
print("t2-t1 : ", t2-t1)
if __name__=="__main__":
main()
Output :
n_processes : 2
page_num_list : [1, 2, 3, 4, 5]
CPU count : 8
acord_js_list : [1, 4, 9, 16, 25]
t2-t1 : 15.423236846923828
But when I do
main()
instead of
if __name__=="__main__":
main()
I get non-stopping error logs(crash logs)
When Python launches a secondary multiprocessing.Process, it imports the script again in each Process's space. (I use Windows and this is always true, but according to the documentation it is not necessarily the case on other OS's.) This even applies to process Pools, which is what you are using.
During these extra imports, the __name__ local variable is something other than __main__. So you can separate code that you want to run in every child Process from code that you want to run only once. That is the purpose of the if __name__ == "__main__" statement.
When you run your script with this statement omitted, the module is loaded again and again, in every one of the child processes. Every process tries to run the function main(), which then tries to launch more child processes, which tries to launch more, and so on. The whole thing crashes, as you observe. With the line of code present, the main() function runs only once, in the main Process where it works properly, launching all the other Processes.
I'm afraid you are stuck writing that line of code. Life is full of unpleasant necessities. But it's probably less disruptive than switching everything to another operating system.
See also python multiprocessing on windows, if __name__ == "__main__"
Unfortunately, the standard python library docs present this issue as a commandment that must be obeyed rather than an explanation that can be understood (in my opinion).

How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.
My code is like this:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working. It comes out of loop immediately.
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here? How can I run for sub-directories in parallel.
You have to get the directories list first and then you have to use multiprocessing pool to call the function.
something like below.
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
.
pls let me know, how it goes.
The problem is this line:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement). Also, it is completely unnecessary as well. Calling .join does the same thing internally that p.is_alive does (except one blocks). So you can safely just do this:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.

using multiprocessing in windows with multithreading python

I have a script which works like that, a list of element and a function,:
def fct(elm):
do work
after that I start the threads (3) wherein the end of every thread I print the name of the element like this:
jobs = Queue()
def do_stuff(q):
while not q.empty():
value = q.get()
fct(item=value)
q.task_done()
for i in lines:
jobs.put(i)
for i in range(3):
worker = threading.Thread(target=do_stuff, args=(jobs,))
worker.start()
jobs.join()
what I wanna do is whenever a three is done (a file is saved) starting another process which has to read the file and apply another fct2
note: I'm using windows

Multiproccesing and lists in python

I have a list of jobs but due to certain condition not all of the jobs should run in parallel at the same time because sometimes it is important that a finishes before I start b or vice versa (actually its not important which one runs first just not that they run both at the same time) so i thought i keep a list of the currently running threads and when ever a new on starts it checks in this list of currently running threads if the thread can proceed or not. I wrote some sample code for that:
from time import sleep
from multiprocessing import Pool
def square_and_test(x):
print(running_list)
if not x in running_list:
running_list = running_list.append(x)
sleep(1)
result_list = result_list.append(x**2)
running_list = running_list.remove(x)
else:
print(f'{x} is currently worked on')
task_list = [1,2,3,4,1,1,4,4,2,2]
running_list = []
result_list = []
pool = Pool(2)
pool.map(square_and_test, task_list)
print(result_list)
this code fails with UnboundLocalError: local variable 'running_list' referenced before assignment so i guess my threads don't have access to global variables. Is there a way around this? If not is there another way to solve this problem?

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

Resources