Problem with python looping when I have no loop in my code - python-3.x

I have a main program that calls some modules. For some reason when I run the code it loops over parts of the main code when there is no loop in the code.
import os
import datetime
import multiprocessing as mp
import shutil
#make temp folder for data files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
temp_folder_name = 'temp_data_files_' + date
os.mkdir(temp_folder_name)
#make folder for sim files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
save_folder_name = 'Sim_files_' + date
os.mkdir(save_folder_name)
#make data files and save in temp folder
import make
make.data('model_1',temp_folder_name) #model name and folder for results
#run file on multiple cores
import distributed
corecount = mp.cpu_count() # edit this value to the number of cores you want to use on your computer
if __name__ == '__main__':
distributed.simulate(temp_folder_name, corecount ,save_folder_name)
The program should make two folders. It then uses 'make' to make some files and put them in the temp folder. It then should use 'distributed' to do some things with the files and save them in the 'sim_files' folder. But for some reason it makes several folders in each instance (with slightly different time stamps).
The distributed function includes some links but I don't think these should have an effect on the main program.
The if __name__ == ... line is to do with multiprocessing a guard against infinitely looping

I have found a solution to this. It is to do with the way multiprocessing works, the child processes import the main program like modules. This lead to the main program being run for each instance of multiprocessing.
The solution is to move the if __name__ == '__main__': line to the start of the main. This ensures that it only runs when it is being called as itself, rather than when it is imported like a module :)

Related

How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.
My code is like this:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working. It comes out of loop immediately.
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here? How can I run for sub-directories in parallel.
You have to get the directories list first and then you have to use multiprocessing pool to call the function.
something like below.
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
.
pls let me know, how it goes.
The problem is this line:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement). Also, it is completely unnecessary as well. Calling .join does the same thing internally that p.is_alive does (except one blocks). So you can safely just do this:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.

The best way to share a class between processes

First of all, I'm pretty new in multiprocessing and I'm here for learning of all of you. I have several files doing something similar to this:
SharedClass.py:
class simpleClass():
a = 0
b = ""
.....
MyProcess.py:
import multiprocessing
import SharedClass
class FirstProcess(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
def modifySharedClass():
# Here I want to modify the object shared with main.py defined in SharedClass.py
Main.py:
from MyProcess import FirstProcess
import sharedClass
if __name__ == '__main__':
pr = FirstProcess()
pr.start()
# Here I want to print the initial value of the shared class
pr.modifySharedClass()
# Here I want to print the modified value of the shared class
I want to define a shared class (in SharedClass.py), in a kind of shared memory that can be readed and writted for both files Main.py and MyProcess.py.
I have try to use the Manager of multiprocessing and multiprocessing.array but Im not having good results, the changes made in one file are not beeing reflected in the other file (maybe Im doing this in the wrong way).
Any ideas? Thank you.

Is it possible to use Google's ortool in my script without downloaded ortool library?

Basically, I have to test my script on a server which I cant add new libraries. How do i write/change my
from ortools.algorithms import pywrapknapsack_solver
in my .py file such that I can still utilise Google's ortool when i submit onto a server without ortools installed? Is there something like html tag which i can just link to and use ortool library?
I have to sent my whole code.py file to test, and i can add along other .py files with my code.py.
I tried to download from Google the source code but i dont know how to get it to work.
Currently my code.py:
from __future__ import print_function
from ortools.algorithms import pywrapknapsack_solver
def getBestSet(W, packages):
final_arr = []
pID = ['1','2','3','4','5'] #sample data
values = [20,44,12,5,16]
weights = [10,11,21,3,9]
solver = pywrapknapsack_solver.KnapsackSolver(
pywrapknapsack_solver.KnapsackSolver.
KNAPSACK_MULTIDIMENSION_BRANCH_AND_BOUND_SOLVER, 'KnapsackExample')
solver.Init(values, [weights], [W])
computed_value = solver.Solve()
packed_items = []
packed_weights = []
total_weight = 0
# print('Total value =', computed_value)
for i in range(len(values)):
if solver.BestSolutionContains(i):
packed_items.append(i)
packed_weights.append(weights[i])
total_weight += weights[i]
# print('Total weight:', total_weight)
# print('Packed items:', packed_items)
# print('Packed_weights:', packed_weights)
for i in packed_items:
final_arr.append(pID[i])
return final_arr
You can try on Google Colab.
To install or-tools, in the first cell, run !pip install ortools
then put your code in a new cell below the first one.

Python3 multiprocessing or multi thread for loop

I have a big ML based package which has several modules (4) which read and write sequentially their own I/O. I also have several files (variable number). I understand the difference between thred and process but I m still puzzling of which one would make more sense to implement. A dummy structure is like this
import module1
import module2
import module3
import module4
for fl in list_of_files:
tmp_path = os.path.join('tmp', fl) # here we create the folder which holds all tmp files
module1.do_stuff(fl)
module2.do_stuff(tmp_path) # input here is output of module1
module3.do_stuff(tmp_path) # input is output of module2
module4.do_stuff(tmp_path) # input here is output of module3
aggregate_results('tmp/') # this takes all outputs from module4 and combine them into a single file
Now my question is does it make sense to split it according to the files like this?
import multiprocessing.dummy as mp
def small_proc(fl):
tmp_path = os.path.join('tmp', fl)
module1.do_stuff(fl)
module2.do_stuff(tmp_path)
module3.do_stuff(tmp_path)
module4.do_stuff(tmp_path)
p=mp.Pool(len(list_of_files)
p.map(single_file,list_of_files)
p.close()
p.join()
Or to split it according to the order of run as we can safely run the loop offset by one module (if that makes any sense to anybody?)

How to display Folders and recent items

I have 2 questions in trying to retrieve a set of data from a directory and displays it out into the ListWidget.
As I am a linux user, I set my ListWidget to read my directory from Desktop in which insides contains say 5 folders and 5 misc items (.txt, .py etc)
Currently I am trying to make my ListWidget to display just the folders but apparently it does that but it also displays all the items, making it a total of 10 items instead of 5.
I tried looking up on the net but I am unable to find any info. Can someone help me?
Pertaining to Qns 1, I am wondering if it is possible to display the top 3 recent folders in the ListWidget, if a checkbox is being checked?
import glob
import os
def test(object):
testList = QListWidget()
localDir = os.listdir("/u/ykt/Desktop/test")
testList.addItems(localDir)
Maybe you should try "QFileDialog" like the following:
class MyWidget(QDialog):
def __init__(self):
QDialog.__init__(self)
fileNames = QFileDialog.getExistingDirectory(self, "list dir", "C:\\",QFileDialog.ShowDirsOnly)
print fileNames
if __name__ == "__main__":
app = QApplication(sys.argv)
widget = MyWidget()
widget.show()
app.exec_()
2nd question, you could reference to this: enter link description here
I guess that you are expecting that os.listdir() will return only the directory names from the given path. Actually it returns the file names too. If you want to add only directories to the listWidget, do the following:
import os
osp = os.path
def test(object):
testList = QListWidget()
dirPath = "/u/ykt/Desktop/test"
localDir = os.listdir(dirPath)
for dir in lacalDir:
path = osp.join(dirPath, dir)
if osp.isdir(path):
testList.addItem(dir)
This will add only directories to the listWidget ignoring the files.
If you want to get the access time for the files and/or folders, use time following method:
import os.path as osp
accessTime = osp.getatime("path/to/dir") # returns the timestamp
Get access time for all the directories and one which has the greatest value is the latest accessed directory. This way you can get the latest accessed 3 directories.

Resources