Python Process Multiple PDF file from Multiple location at same time - python-3.x

I have processing 1500 PDF File On a daily basis from three different locations. My Problem is while running a code it processes folder 1 first then folder 2 then folder 3.
I want it to process all folders at the same time, for example, they process 5 files from folder 1 and 3 files from folder 2 and 4 files from folder 3.
So we can't wait until the completion of folder 1.some time folder1 have 500 files which mean I need to wait till the program process all files of folder 1 then it processes folder2
I try threading but not working, it processes folder Serial means it process folder1 files first after finishing folder1 file it processes folder2. But I want to process foler1 and folder 2 files at the same time.
I will explain a little bit about the below code, I have 3 folder location which has PDF files and I have the main function which response for converting PDF to PS, I am calling function in three different thread with 3 location
inputpath1 = "/121rawfile/FTP HotFolder"
inputpath2 = "/121rawfile/FTP Download File"
inputpath3 = "/121rawfile/Olive"
## define function with variable filename and the format of the timestamp
def timeStamped(filename, fmt='%m-%d-%y-{filename}'):
os.chdir("/PDF_Flattening")
return datetime.now().strftime(fmt).format(filename=filename)
filename=timeStamped("log_file_9.log")
logging.basicConfig(filename=filename,format = "%(asctime)s - %(levelname)s - %(message)s",level = logging.DEBUG)
while True:
try:
t1=threading.Thread(target=main(inputpath1),args=(inputpath1,))
t1.start()
t2=threading.Thread(target=main(inputpath2),args=(inputpath2,))
t2.start()
t3=threading.Thread(target=main(inputpath3),args=(inputpath3,))
t3.start()
raise Exception("Error simulated!")
except Exception as e :
logging.error("failed")

CPython (the standard python implementation) has something called the GIL (Global Interpreter Lock); the GIL prevents two threads from executing simultaneously in the same program. . For CPU bound tasks and truly parallel execution, we can use the multiprocessing module.
A basic example on how to use multiprocessing module:
( python doc link)
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
For more info you can refer :
link

Related

Why is this code not running in parallel in Python using ThreadPoolExecutor? I'm trying to write to parquet files in paralllel

for og_raw_file in de_core.file.rglob(raw_path_object.url):
with de_core.file.open(og_raw_file, mode="rb") as raw_file, de_core.file.open(
staging_destination_path + de_core.aws.s3.S3FilePath(raw_file.name).file_name, "wb"
) as stager_file, concurrent.futures.ThreadPoolExecutor() as executor:
logger.info("Submitting file to thread to add metadata", raw_file=raw_file)
executor.submit(
<long_length_metadata_function_that_I_want_to_parallize>,
raw_path_object,
<...rest of arguments to function>
)
I want every file to be processed in a separate thread all at once and for the submit not to be blocking. What am I doing wrong? What happens is that each file is submitted one at a time but the next file isn't submitted until the previous one finishes... how do I parallelize this properly?
I would expect the "Submitting file to thread to add metadata" to appear quickly for every file at the beginning since the threads should be submitted and then forgot, but that's not what's happening.
Do I need to do something like this? Why?
future_mapping = {executor.submit(predicate, uri): uri for uri in uris}
for future in concurrent.futures.as_completed(future_mapping):
The metadata function is basically adding columns to a parquet file. Is this not something I can use threads for given the Python gil?
Start by reading the docs for Executor.shutdown(), which is called by magic with wait=True when the with block ends.
For the same reason, if you run this trivial program you'll see that you get no useful parallelism either:
def worker(i):
from time import sleep
print(f"working on {i}")
sleep(2)
if __name__ == "__main__":
from concurrent.futures import ThreadPoolExecutor
for i in range(10):
with ThreadPoolExecutor() as ex:
ex.submit(worker, i)
An executor is intended to by used "many" times after it's created, not just once. You can use it just once, but you don't want to do that ;-)
To "repair" my toy program, swap the lines:
with ThreadPoolExecutor() as ex:
for i in range(10):

Python multiprocessing manager showing error when used in flask API

I am pretty confused about the best way to do what I am trying to do.
What do I want?
API call to the flask application
Flask route starts 4-5 multiprocess using Process module and combine results(on a sliced pandas dataframe) using a shared Managers().list()
Return computed results back to the client.
My implementation:
pos_iter_list = get_chunking_iter_list(len(position_records), 10000)
manager = Manager()
data_dict = manager.list()
processes = []
for i in range(len(pos_iter_list) - 1):
temp_list = data_dict[pos_iter_list[i]:pos_iter_list[i + 1]]
p = Process(
target=transpose_dataset,
args=(temp_list, name_space, align_namespace, measure_master_id, df_searchable, products,
channels, all_cols, potential_col, adoption_col, final_segment, col_map, product_segments,
data_dict)
)
p.start()
processes.append(p)
for p in processes:
p.join()
My directory structure:
- main.py(flask entry point)
- helper.py(contains function where above code is executed & calls transpose_dataset function)
Error that i am getting while running the same?
RuntimeError: No root path can be found for the provided module "mp_main". This can happen because the module came from an import hook that does not provide file name information or because it's a namespace package. In this case the root path needs to be explicitly provided.
Not sure what went wong here, manager list works fine when called from a sample.py file using if __name__ == '__main__':
Update: The same piece of code is working fine on my MacBook and not on windows os.
A sample flask API call:
#app.route(PREFIX + "ping", methods=['GET'])
def ping():
man = mp.Manager()
data = man.list()
processes = []
for i in range(0,5):
pr = mp.Process(target=test_func, args=(data, i))
pr.start()
processes.append(pr)
for pr in processes:
pr.join()
return json.dumps(list(data))
Stack has an ongoing bug preventing me from commenting, so I'll just write up an answer..
Python has 2 (main) ways to start a new process: "spawn", and "fork". Fork is a system command only available in *nix (read: linux or macos), and therefore spawn is the only option in windows. After 3.8 spawn will be the default on MacOS, but fork is still available. The big difference is that fork basically makes a copy of the existing process while spawn starts a whole new process (like just opening a new cmd window). There's a lot of nuance to why and how, but in order to be able to run the function you want the child process to run using spawn, the child has to import the main file. Importing a file is tantamount to just executing that file and then typically binding its namespace to a variable: import flask will run the flask/__ini__.py file, and bind its global namespace to the variable flask. There's often code however that is only used by the main process, and doesn't need to be imported / executed in the child process. In some cases running that code again actually breaks things, so instead you need to prevent it from running outside of the main process. This is taken into account in that the "magic" variable __name__ is only equal to "__main__" in the main file (and not in child processes or when importing modules).
In your specific case, you're creating a new app = Flask(__name__), which does some amount of validation and checks before you ever run the server. It's one of these setup/validation steps that it's tripping over when run from the child process. Fixing it by not letting it run at all is imao the cleaner solution, but you can also fix it by giving it a value that it won't trip over, then just never start that secondary server (again by protecting it with if __name__ == "__main__":)

Automate the Script whenever a new folder/file is added in directory in Python

I have multiple folders in a directory and each folder has multiple files. I have a code which checks for a specific file in each folder and does some data preprocessing and analysis if the specific file is present.
A snippet of it is given below.
import pandas as pd
import json
import os
rootdir = os.path.abspath(os.getcwd())
df_list = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file.startswith("StudyParticipants") and file.endswith(".csv"):
temp = pd.read_csv(os.path.join(subdir, file))
.....
.....
'some analysis'
Merged_df.to_excel(path + '\Processed Data Files\Study_Participants_Merged.xlsx')
Now, I want to automate this process. I want this script to be executed whenever a new folder is added. This is my first in exploring automation process and I ham stuck on this for quite a while without major progress.
I am using windows system and Jupyter notebook to create these dataframes and perform analysis.
Any help is greatly appreciated.
Thanks.
I've wrote a script which you should only run once and it will work.
Please note:
1.) This solution does not take into account which folder was created. If this information is required I can rewrite the answer.
2.) This solution assumes folders won't be deleted from the main folder. If this isn't the case, I can rewrite the answer as well.
import time
import os
def DoSomething():
pass
if __name__ == '__main__':
# go to folder of interest
os.chdir('/home/somefolders/.../A1')
# get current number of folders inside it
N = len(os.listdir())
while True:
time.sleep(5) # sleep for 5 secs
if N != len(os.listdir()):
print('New folder added! Doing something useful...')
DoSomething()
N = len(os.listdir()) # update N
take a look at watchdog.
http://thepythoncorner.com/dev/how-to-create-a-watchdog-in-python-to-look-for-filesystem-changes/
you could also code a very simple watchdog service on your own.
list all files in the directory you want to observe
wait a time span you define, say every few seconds
make again a list of the filesystem
compare the two lists, take the difference of them
the resulting list from this difference are your filesystem changes
Best greetings

file watcher in python 3.5 using library watchgod

Hi everyone i am trying to build a file watcher in python 3.5 using watchgod. I want to continuously watch a directory and if any file is added then i want to send a list of added files to another program which will perform a series of task. Following is my code in python :-
print("execution of main file begins !!!!")
import os
from watchgod import watch
#changes gives a set object when watch finds any kind of changes in directory
for changes in watch(r'C:\Users\Rajat.Malik\Desktop\Requests'):
fileStatus = [obj[0] for obj in list(changes) ] #converting set to list which gives file status as added, changed or modified
fileLocation = [obj[1] for obj in list(changes) ] #similarly getting list of location of files added
var2 = 0
for var1 in fileLocation:
if fileStatus[var2] == 1: #if file is added then passing all files to another code which will work on the list of files added
os.system('python split_thread_module.py '+var1) #now this code will start executing
var2 = var2 + 1
So the problem i am having is that while split_thread_module.py is executing the watcher is not watching the directory. Any file which is coming at time when split_thread_module.py is executing is not reflecting in changes. How can i watch the changes in directory and pass it to the other program on the fly even when the other program is executing. I am not a python programmer. Can anyone help me in this regard ?
Thanks in advance !!!!
Sorry for delayed, I'm the developer of watchgod. I've added a python-watchgod tag to your question which I'll watch (no pun intended) in future so I can answer such questions more quickly.
To answer your question, watchgod will not miss changes which occur in the filesystem while other code is running. They'll just be reported as changes next time watch iterates.
More generally the best approach would be to run the other code asynchronously so the main process can get back to watching the directory.
a few other hints for neater python
no need to call list(changes) in the comprehension
os.system is deprecated, better to use subprocess.run
since split_thread_module.py is also python, do you really need to run it in a separate process? Even if you do you might have more luck with python multiprocessing than starting a new process with the system's process initiation.
Overall you might try something like:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
from watchgod import watch
def slow_job(status, location):
print(f'status: {status}, location: {location}, starting...')
sleep(10)
print(f'status: {status}, location: {location}, done')
with ProcessPoolExecutor() as executor:
for changes in watch('./tests'):
for status, location in changes:
executor.submit(slow_job, status, location)

Multiprocessing with subprocess.call for an entire directory of files

I have a Python3 script that uses subprocess.call to run a program on about 2,300 input files in a directory and there are two output files for each input file. I have these two outputs going into two different directories. I would like to learn how to multiprocess my script so several files can be processed at the same time. I have been reading on the multiprocess library in Python but it might be too advanced for me to understand. Below is the script if the experts have any input. Thanks so much!
Script:
import os
import subprocess
import argparse
parser = argparse.ArgumentParser(description="This script aligns DNA sequences in files in a given directory.")
parser.add_argument('--root', default="/shared/testing_macse/", help="PATH to the input directory containing CDS orthogroup files.")
parser.add_argument('--align_NT_dir', default="/shared/testing_macse/NT_aligned/", help="PATH to the output directory for NT aligned CDS orthogroup files.")
parser.add_argument('--align_AA_dir', default="/shared/testing_macse/AA_aligned/", help="PATH to the output directory for AA aligned CDS orthogroup files.")
args = parser.parse_args()
def runMACSE(input_file, NT_output_file, AA_output_file):
MACSE_command = "java -jar ~/bin/MACSE/macse_v1.01b.jar "
MACSE_command += "-prog alignSequences "
MACSE_command += "-seq {0} -out_NT {1} -out_AA {2}".format(input_file, NT_output_file, AA_output_file)
# print(MACSE_command)
subprocess.call(MACSE_command, shell=True)
Orig_file_dir = args.root
NT_align_file_dir = args.align_NT_dir
AA_align_file_dir = args.align_AA_dir
try:
os.makedirs(NT_align_file_dir)
os.makedirs(AA_align_file_dir)
except FileExistsError as e:
print(e)
for currentFile in os.listdir(args.root):
if currentFile.endswith(".fa"):
runMACSE(args.root + currentFile, args.align_NT_dir + currentFile[:-3]+"_NT_aligned.fa", args.align_AA_dir + currentFile[:-3]+"_AA_aligned.fa")
Subprocess functions run any command-line executable in a separate process. You are running java. Multiprocessing runs python code in separate processes, just as threading runs python code in separate threads. The API for the two is intentionally similar. So multiprocessing cannot substitute for non-python subprocess calls.
It would be a waste of processes to use multiple python processes to initiate multiple java processes. You could just as well use multiple threads to make multiple subprocess calls. Or use the async module.
Or make your own scheduler. Wrap your for-if in a generator function.
def fa_file(path):
for currentFile in os.listdir(path):
if currentFile.endswith(".fa"):
yield currentFile
fafiles = fa_file(arg.root)
Make an array of, say, 10 Popen objects. Sleep for some appropriate interval. Upon waking, loop through the array and replace finished subprocesses (.poll() returns something other than None) for as long as next(fafiles) returns something.
EDIT: If you did the image processing in Python code that calls compiled C code (pillow, for instance), then you could use multiprocessing and a Queue loaded with the files to process.

Resources