How to implement Multiprocessing in Azure Databricks - Python - python-3.x

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.
My code is like this:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working. It comes out of loop immediately.
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here? How can I run for sub-directories in parallel.

You have to get the directories list first and then you have to use multiprocessing pool to call the function.
something like below.
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
.
pls let me know, how it goes.

The problem is this line:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement). Also, it is completely unnecessary as well. Calling .join does the same thing internally that p.is_alive does (except one blocks). So you can safely just do this:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.

Related

Best way to keep creating threads on variable list argument

I have an event that I am listening to every minute that returns a list ; it could be empty, 1 element, or more. And with those elements in that list, I'd like to run a function that would monitor an event on that element every minute for 10 minute.
For that I wrote that script
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
import Client
client = Client()
def handle_event(event):
for i in range(10):
client.get_info(event)
sleep(60)
async def main():
while True:
entires = client.get_new_entry()
if len(entires) > 0:
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
executor.map(handle_event, entires)
await asyncio.sleep(60)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
However, instead of keep monitoring the entries, it blocks while the previous entries are still being monitors.
Any idea how I could do that please?
First let me explain why your program doesn't work the way you want it to: It's because you use the ThreadPoolExecutor as a context manager, which will not close until all the threads started by the call to map are finished. So main() waits there, and the next iteration of the loop can't happen until all the work is finished.
There are ways around this. Since you are using asyncio already, one approach is to move the creation of the Executor to a separate task. Each iteration of the main loop starts one copy of this task, which runs as long as it takes to finish. It's a async def function so many copies of this task can run concurrently.
I changed a few things in your code. Instead of Client I just used some simple print statements. I pass a list of integers, of random length, to handle_event. I increment a counter each time through the while True: loop, and add 10 times the counter to every integer in the list. This makes it easy to see how old calls continue for a time, mixing with new calls. I also shortened your time delays. All of these changes were for convenience and are not important.
The important change is to move ThreadPoolExecutor creation into a task. To make it cooperate with other tasks, it must contain an await expression, and for that reason I use executor.submit rather than executor.map. submit returns a concurrent.futures.Future, which provides a convenient way to await the completion of all the calls. executor.map, on the other hand, returns an iterator; I couldn't think of any good way to convert it to an awaitable object.
To convert a concurrent.futures.Future to an asyncio.Future, an awaitable, there is a function asyncio.wrap_future. When all the futures are complete, I exit from the ThreadPoolExecutor context manager. That will be very fast since all of the Executor's work is finished, so it does not block other tasks.
import random
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
def handle_event(event):
for i in range(10):
print("Still here", event)
sleep(2)
async def process_entires(counter, entires):
print("Counter", counter, "Entires", entires)
x = [counter * 10 + a for a in entires]
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
futs = []
for z in x:
futs.append(executor.submit(handle_event, z))
await asyncio.gather(*(asyncio.wrap_future(f) for f in futs))
async def main():
counter = 0
while True:
entires = [0, 1, 2, 3, 4][:random.randrange(5)]
if len(entires) > 0:
counter += 1
asyncio.create_task(process_entires(counter, entires))
await asyncio.sleep(3)
if __name__ == "__main__":
asyncio.run(main())

Using contextlib.redirect_stdout in an async function redirects output of other tasks

I want to redirect the output of a few lines in my code that I don't have control over, but the outputs are not relevant. I've been able to use contextlib.redirect_stdout(io.StringIO()) in a synchronous function to successfully redirect the lines I want, but I can't do it with an async function
This is what I have so far
import asyncio
import contextlib
import sys
async def long_function(val: int, semaphore: asyncio.Semaphore, file_out, old_stdout=sys.stdout):
# Only let two tasks start at a time
await semaphore.acquire()
print(f"{val}: Starting")
# Redirect stdout of ONLY the lines within this context manager
with contextlib.redirect_stdout(file_out):
await asyncio.sleep(3) # long-running task that prints output I can't control, but is not useful to me
print(f"{val}: Finished redirect")
contextlib.redirect_stdout(old_stdout)
print(f"{val}: Done")
semaphore.release()
async def main():
# I want to limit the number of concurrent tasks to 2
semaphore: asyncio.Semaphore = asyncio.Semaphore(2)
# Create a list of tasks to perform
file_out = open("file.txt", "w")
tasks = []
for i in range(0, 9):
tasks.append(long_function(i, semaphore, file_out))
# Gather/run the tasks
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
When running this, however, the output of other tasks is also placed into the "file.txt" file. I only want the "Finished redirect" to be placed into the file
I see the following in the Python docs
Note that the global side effect on sys.stdout means that this context manager is not suitable for use in library code and most threaded applications. It also has no effect on the output of subprocesses. However, it is still a useful approach for many utility scripts.
Is there any other way to go about this, or do I just have to live with the output as-is?
Thanks for any help!

Launching parallel tasks: Subprocess output triggers function asynchronously

The example I will describe here is purely conceptual so I'm not interested in solving this actual problem.
What I need to accomplish is to be able to asynchronously run a function based on a continuous output of a subprocess command, in this case, the windows ping yahoo.com -t command and based on the time value from the replies I want to trigger the startme function. Now inside this function, there will be some more processing done, including some database and/or network-related calls so basically I/O processing.
My best bet would be that I should use Threading but for some reason, I can't get this to work as intended. Here is what I have tried so far:
First of all I tried the old way of using Threads like this:
import subprocess
import re
import asyncio
import time
import threading
def startme(mytime: int):
print(f"Mytime {mytime} was started!")
time.sleep(mytime) ## including more long operation functions here such as database calls and even some time.sleep() - if possible
print(f"Mytime {mytime} finished!")
myproc = subprocess.Popen(['ping', 'yahoo.com', '-t'], shell=True, stdout=subprocess.PIPE)
def main():
while True:
output = myproc.stdout.readline()
if myproc.poll() is not None:
break
myoutput = output.strip().decode(encoding="UTF-8")
print(myoutput)
mytime = re.findall("(?<=time\=)(.*)(?=ms\s)", myoutput)
try:
mytime = int(mytime[0])
if mytime < 197:
# startme(int(mytime[0]))
p1 = threading.Thread(target=startme(mytime), daemon=True)
# p1 = threading.Thread(target=startme(mytime)) # tried with and without the daemon
p1.start()
# p1.join()
except:
pass
main()
But right after startme() fire for the first time, the pings stop showing and they are waiting for the startme.time.sleep() to finish.
I did manage to get this working using the concurrent.futures's ThreadPoolExecutor but when tried to replace the time.sleep() with the actual database query I found out that my startme() function will never complete so no Mytime xxx finished! message is ever shown nor any database entry is being made.
import sqlite3
import subprocess
import re
import asyncio
import time
# import threading
# import multiprocessing
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute(
'''CREATE TABLE IF NOT EXISTS mytable (id INTEGER PRIMARY KEY, u1, u2, u3, u4)''')
def startme(mytime: int):
print(f"Mytime {mytime} was started!")
# time.sleep(mytime) ## including more long operation functions here such as database calls and even some time.sleep() - if possible
c.execute("INSERT INTO mytable VALUES (null, ?, ?, ?, ?)",(1,2,3,mytime))
conn.commit()
print(f"Mytime {mytime} finished!")
myproc = subprocess.Popen(['ping', 'yahoo.com', '-t'], shell=True, stdout=subprocess.PIPE)
def main():
while True:
output = myproc.stdout.readline()
myoutput = output.strip().decode(encoding="UTF-8")
print(myoutput)
mytime = re.findall("(?<=time\=)(.*)(?=ms\s)", myoutput)
try:
mytime = int(mytime[0])
if mytime < 197:
print(f"The time {mytime} is low enought to call startme()" )
executor = ThreadPoolExecutor()
# executor = ProcessPoolExecutor() # I did tried using process even if it's not a CPU-related issue
executor.submit(startme, mytime)
except:
pass
main()
I did try using asyncio but I soon realized this is not the case but I'm wondering if I should try aiosqlite
I also thought about using asyncio.create_subprocess_shell and run both as parallel subprocesses but can't think of a way to wait for a certain string from the ping command that would trigger the second script.
Please note that I don't really need a return from the startme() function and the ping command example is conceptually derived from the mitmproxy's mitmdump output command.
The first code wasn't working as I did a stupid mistake when creating the thread so p1 = threading.Thread(target=startme(mytime)) does not take the function with its arguments but separately like this p1 = threading.Thread(target=startme, args=(mytime,))
The reason why I could not get the SQL insert statement to work in my second code was this error:
SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 10688 and this is thread id 17964
that I didn't saw until I wrapped my SQL statement into a try/except and captured the error. So I needed to make the SQL database connection inside my startme() function
The other asyncio stuff was just nonsense and cannot be applied to the current issue here.

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

Problem with python looping when I have no loop in my code

I have a main program that calls some modules. For some reason when I run the code it loops over parts of the main code when there is no loop in the code.
import os
import datetime
import multiprocessing as mp
import shutil
#make temp folder for data files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
temp_folder_name = 'temp_data_files_' + date
os.mkdir(temp_folder_name)
#make folder for sim files
date = str(datetime.datetime.now())
date = date[0:19]
date = date.replace(':', '-')
save_folder_name = 'Sim_files_' + date
os.mkdir(save_folder_name)
#make data files and save in temp folder
import make
make.data('model_1',temp_folder_name) #model name and folder for results
#run file on multiple cores
import distributed
corecount = mp.cpu_count() # edit this value to the number of cores you want to use on your computer
if __name__ == '__main__':
distributed.simulate(temp_folder_name, corecount ,save_folder_name)
The program should make two folders. It then uses 'make' to make some files and put them in the temp folder. It then should use 'distributed' to do some things with the files and save them in the 'sim_files' folder. But for some reason it makes several folders in each instance (with slightly different time stamps).
The distributed function includes some links but I don't think these should have an effect on the main program.
The if __name__ == ... line is to do with multiprocessing a guard against infinitely looping
I have found a solution to this. It is to do with the way multiprocessing works, the child processes import the main program like modules. This lead to the main program being run for each instance of multiprocessing.
The solution is to move the if __name__ == '__main__': line to the start of the main. This ensures that it only runs when it is being called as itself, rather than when it is imported like a module :)

Resources