Python3 multiprocessing or multi thread for loop - python-3.x

I have a big ML based package which has several modules (4) which read and write sequentially their own I/O. I also have several files (variable number). I understand the difference between thred and process but I m still puzzling of which one would make more sense to implement. A dummy structure is like this
import module1
import module2
import module3
import module4
for fl in list_of_files:
tmp_path = os.path.join('tmp', fl) # here we create the folder which holds all tmp files
module1.do_stuff(fl)
module2.do_stuff(tmp_path) # input here is output of module1
module3.do_stuff(tmp_path) # input is output of module2
module4.do_stuff(tmp_path) # input here is output of module3
aggregate_results('tmp/') # this takes all outputs from module4 and combine them into a single file
Now my question is does it make sense to split it according to the files like this?
import multiprocessing.dummy as mp
def small_proc(fl):
tmp_path = os.path.join('tmp', fl)
module1.do_stuff(fl)
module2.do_stuff(tmp_path)
module3.do_stuff(tmp_path)
module4.do_stuff(tmp_path)
p=mp.Pool(len(list_of_files)
p.map(single_file,list_of_files)
p.close()
p.join()
Or to split it according to the order of run as we can safely run the loop offset by one module (if that makes any sense to anybody?)

Related

How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.
My code is like this:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working. It comes out of loop immediately.
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here? How can I run for sub-directories in parallel.
You have to get the directories list first and then you have to use multiprocessing pool to call the function.
something like below.
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
.
pls let me know, how it goes.
The problem is this line:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement). Also, it is completely unnecessary as well. Calling .join does the same thing internally that p.is_alive does (except one blocks). So you can safely just do this:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.

The results is different when placing torch.distributed.rpc.rpc_aync at a different .py file

I want to execute a function at the worker side and return the results to the master. However, I find that the results is different when placing rpc_async at a different .py file
Method 1
master.py:
import os
import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc import RRef
from test import sub_fun
os.environ['MASTER_ADDR'] = '10.5.26.19'
os.environ['MASTER_PORT'] = '5677'
rpc.init_rpc("master", rank=0, world_size=2)
rref = torch.Tensor([0])
sub_fun(rref)
rpc.shutdown()
test.py
def f(rref):
print("function is executed on master")
def sub_fun(rref):
x = rpc.rpc_async("worker", f, args=(rref,))
worker.py:
import os
import torch
import torch.distributed.rpc as rpc
from torch.distributed.rpc import RRef
os.environ['MASTER_ADDR'] = '10.5.26.19'
os.environ['MASTER_PORT'] = '5677'
def f(rref):
print("function is executed on worker")
rpc.init_rpc("worker", rank=1, world_size=2)
rpc.shutdown()
I found that the output is "function is executed on master" at the worker side.
Method 2
when I put the two functions: sub_fun and f in the master.py rather than the test.py, the result is "function is executed on worker".
Why the two ways output the different results. and how can I get the result 2 with the method 1.
RPC assumes that the modules and functions are consistent across all processes. In this case, you have the same function f with different implementations across two processes. Note that RPC does not ship the entire function code to the remote side for execution but only finds the function f on the remote side and execute whatever exists there.
I'd suggest using two different functions to achieve what you are trying to do here since depending on how you import files, one function implementation might override the other.

Problem with python looping when I have no loop in my code

I have a main program that calls some modules. For some reason when I run the code it loops over parts of the main code when there is no loop in the code.
import os
import datetime
import multiprocessing as mp
import shutil
#make temp folder for data files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
temp_folder_name = 'temp_data_files_' + date
os.mkdir(temp_folder_name)
#make folder for sim files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
save_folder_name = 'Sim_files_' + date
os.mkdir(save_folder_name)
#make data files and save in temp folder
import make
make.data('model_1',temp_folder_name) #model name and folder for results
#run file on multiple cores
import distributed
corecount = mp.cpu_count() # edit this value to the number of cores you want to use on your computer
if __name__ == '__main__':
distributed.simulate(temp_folder_name, corecount ,save_folder_name)
The program should make two folders. It then uses 'make' to make some files and put them in the temp folder. It then should use 'distributed' to do some things with the files and save them in the 'sim_files' folder. But for some reason it makes several folders in each instance (with slightly different time stamps).
The distributed function includes some links but I don't think these should have an effect on the main program.
The if __name__ == ... line is to do with multiprocessing a guard against infinitely looping
I have found a solution to this. It is to do with the way multiprocessing works, the child processes import the main program like modules. This lead to the main program being run for each instance of multiprocessing.
The solution is to move the if __name__ == '__main__': line to the start of the main. This ensures that it only runs when it is being called as itself, rather than when it is imported like a module :)

How to import variable values from a file that is running?

I am running a python file, say file1, and in that, I am importing another python file, say file2, and calling one of its functions. Now, the file2 needs the value of a variable which is defined in file 1. Also, before importing file2 in file1, the value of the variable was changed during the run-time. How do I make the file file2, access the current value of the variable from file 1?
The content of file1 is:
variable = None
if __name__ == '__main__':
variable = 123
from file2 import func1
func1()
The content of file2 is:
from file1 import variable as var
def func1():
print(var)
When I run the file1, I want the function func1 in file2 to print 123. But it prints None. One way I can tackle this is by saving the content of the variable in some ordinary file when it is modified, and then retrieving it when needed. But the application in which I am using this code, the size of the variable is massive, like around 300 MB. So, I believe it won't be efficient enough to write the content of the variable in a text file, every time it is modified. How do I do this? (Any suggestions are welcome)
The main script is run with the name __main__, not by its module name. This is also how the if __name__ == '__main__' check works. Importing it by its regular name creates a separate module with the regular content.
If you want to access its attributes, import it as __main__:
from __main__ import variable as var
def func1():
print(var)
Note that importing __main__ is fragile. On top of duplicating the module, you may end up importing a different module if your program structure changes. If you want to exchange global data, use well-defined module names:
# constants.py
variable = None
# file1.py
if __name__ == '__main__':
import constants
constants.variable = 123
from file2 import func1
func1()
# file2.py
from constants import variable as var
def func1():
print(var)
Mandatory disclaimer: Ideally, functions do not rely on global variables. Use parameters for passing variables into functions:
# constants.py
variable = None
# file1.py
if __name__ == '__main__':
from file2 import func1
func1(123)
# file2.py
from constants import variable
def func1(var=variable):
print(var)

import and rename functions from a folder - Python 3 [duplicate]

I would like to import all methods from a module with altered names.
For instance, instead of
from module import repetitive_methodA as methodA, \
repetitive_Class1 as Class1, \
repetitive_instance4 as instance4
I'd prefer something along the lines of
from module import * as *-without-"repetitive_"
this is a rephrasing of this clumsy unanswered question, I have not been able to find a solution or similar questions yet.
You can do it this way:
import module
import inspect
for (k,v) in inspect.getmembers(module):
if k.startswith('repetitive_'):
globals()[k.partition("_")[2]] = v
Edit in response to the comment "how is this answer intended to be used?"
Suppose module looks like this:
# module
def repetitive_A():
print ("This is repetitive_A")
def repetitive_B():
print ("This is repetitive_B")
Then after running the rename loop, this code:
A()
B()
produces this output:
This is repetitive_A
This is repetitive_B
What I would do, creating a work-around...
Including you have a file named some_file.py in the current directory, which is composed of...
# some_file.py
def rep_a():
return 1
def rep_b():
return 2
def rep_c():
return 3
When you import something, you create an object on which you call methods. These methods are the classes, variables, functions of your file.
In order to get what you want, I thought It'd be a great idea to just add a new object, containing the original functions you wanted to rename. The function redirect_function() takes an object as first parameter, and will iterate through the methods (in short, which are the functions of your file) of this object : it will, then, create another object which will contain the pointer of the function you wanted to rename at first.
tl;dr : this function will create another object which contains the original function, but the original name of the function will also remain.
See example below. :)
def redirect_function(file_import, suffixe = 'rep_'):
# Lists your functions and method of your file import.
objects = dir(file_import)
for index in range(len(objects)):
# If it begins with the suffixe, create another object that contains our original function.
if objects[index][0:len(suffixe)] == suffixe:
func = eval("file_import.{}".format(objects[index]))
setattr(file_import, objects[index][len(suffixe):], func)
if __name__ == '__main__':
import some_file
redirect_function(some_file)
print some_file.rep_a(), some_file.rep_b(), some_file.rep_c()
print some_file.a(), some_file.b(), some_file.c()
This outputs...
1 2 3
1 2 3

Resources