Python Notify when all files have been transferred - python-3.x

I am using "watchdog" api to keep checking changes in a folder in my filesystem. Whatever files changes in that folder, I pass them to a particular function which starts threads for each file I pass them.
But watchdog, or any other filesystem watcher api (in my knowledge), notifies users file by file i.e. as the files come by, they notify the user. But I would like it to notify me a whole bunch of files at a time so that I can pass that list to my function and take use of multi-threading. Currently, when I use "watchdog", it notifies me one file at a time and I am only able to pass that one file to my function. I want to pass it many files at a time to be able to have multithreading.
One solution that comes to my mind is: you see when you copy a bunch of files in a folder, OS shows you a progress bar. If it would be possible for me to be notified when that progress bar is done, then it would be a perfect solution for my question. But I don't know if that is possible.
Also I know that watchdog is a polling API, and an ideal API for watching filesystem would be interrupt driven api like pyinotify. But I didn't find any API which was interrupt driven and also cross platform. iWatch is good, but only for linux, and I want something for all OS. So, if you have suggestions on any other API, please do let me know.
Thanks.

Instead of accumulating filesystem events, you could spawn a pool of worker
threads which get tasks from a common queue. The watchdog thread could then put
tasks in the queue as filesystem events occur. Done this way, a worker thread
can start working as soon as an event occurs.
For example,
import logging
import Queue
import threading
import time
import watchdog.observers as observers
import watchdog.events as events
logger = logging.getLogger(__name__)
SENTINEL = None
class MyEventHandler(events.FileSystemEventHandler):
def on_any_event(self, event):
super(MyEventHandler, self).on_any_event(event)
queue.put(event)
def __init__(self, queue):
self.queue = queue
def process(queue):
while True:
event = queue.get()
logger.info(event)
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG,
format='[%(asctime)s %(threadName)s] %(message)s',
datefmt='%H:%M:%S')
queue = Queue.Queue()
num_workers = 4
pool = [threading.Thread(target=process, args=(queue,)) for i in range(num_workers)]
for t in pool:
t.daemon = True
t.start()
event_handler = MyEventHandler(queue)
observer = observers.Observer()
observer.schedule(
event_handler,
path='/tmp/testdir',
recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Running
% mkdir /tmp/testdir
% script.py
yields output like
[14:48:31 Thread-1] <FileCreatedEvent: src_path=/tmp/testdir/.#foo>
[14:48:32 Thread-2] <FileModifiedEvent: src_path=/tmp/testdir/foo>
[14:48:32 Thread-3] <FileModifiedEvent: src_path=/tmp/testdir/foo>
[14:48:32 Thread-4] <FileDeletedEvent: src_path=/tmp/testdir/.#foo>
[14:48:42 Thread-1] <FileDeletedEvent: src_path=/tmp/testdir/foo>
[14:48:47 Thread-2] <FileCreatedEvent: src_path=/tmp/testdir/.#bar>
[14:48:49 Thread-4] <FileCreatedEvent: src_path=/tmp/testdir/bar>
[14:48:49 Thread-4] <FileModifiedEvent: src_path=/tmp/testdir/bar>
[14:48:49 Thread-1] <FileDeletedEvent: src_path=/tmp/testdir/.#bar>
[14:48:54 Thread-2] <FileDeletedEvent: src_path=/tmp/testdir/bar>
Doug Hellman has written an excellent set of tutorials (which has now been edited into a book) which should help you get started:
on using Queue
the threading module
how to setup and use a pool of worker processes
how to setup a pool of worker threads
I didn't actually end up using a multiprocessing Pool or ThreadPool as discussed
in the last two links, but you may find them useful anyway.

Related

asyncio.Queue as producer-consumer flow in a webserver like Quart

Is it possible to use asyncio.Queue with a webserver like Quart to communicate between the producer and consumer?
Here is what I am trying to do....
from quart import Quart, request
import asyncio
queue = asyncio.Queue()
producers = []
consumers = []
async def producer(mesg):
print(f'produced {mesg}')
await queue.put(mesg)
await asyncio.sleep(1) # do some work
async def consumer():
while True:
token = await queue.get()
await asyncio.sleep(1) # do some work
queue.task_done()
print(f'consumed {token}')
#app.route('/route', methods=['POST'])
async def index():
mesg = await request.get_data()
try:
p = asyncio.create_task(producer(mesg))
producers.append(p)
c = asyncio.create_task(consumer())
consumers.append(c)
return f"published message {mesg}", 200
except Exception as e:
logger.exception("Failed tp publish message %s!", mesg)
return f"Failed to publish message: {mesg}", 400
if __name__ == '__main__':
PORT = int(os.getenv('PORT')) if os.getenv('PORT') else 8050
app.run(host='0.0.0.0', port=PORT, debug=True)
This works fine.
But I am not sure if this is a good practice because I am confused how (where in my code) to do the below steps.
# Making sure all the producers have completed
await asyncio.gather(*producers)
#wait for the remaining tasks to be processed
await queue.join()
# cancel the consumers, which are now idle
for c in consumers:
c.cancel()
EDIT-1:
I have tried using #app.after_serving, with some logger.debug statements.
#app.after_serving
async def shutdown():
logger.debug("Shutting down...")
logger.debug("waiting for producers to finish...")
await asyncio.gather(*producers)
logger.debug("waiting for tasks to complete...")
await queue.join()
logger.debug("cancelling consumers...")
for c in consumers:
c.cancel()
But the debug statements are not printed when hypercorn is gracefully shutting down. So, I am not sure whether the function(shutdown) decorated with #app.after_serving is actually called during a shutdown.
Here is the message from hypercorn during shutdown
appserver_1 | 2020-05-29 15:55:14,200 - base_events.py:1490 - create_server - INFO - <Server sockets=(<asyncio.TransportSocket fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 8080)>,)> is serving
appserver_1 | Running on 0.0.0.0:8080 over http (CTRL + C to quit)
Gracefully stopping... (press Ctrl+C again to force)
I using a kill -SIGTERM <PID> to signal a graceful shutdown to the process.
I would place the cleanup code in a shutdown after_serving function,
#app.after_serving
async def shutdown():
# Making sure all the producers have completed
await asyncio.gather(*producers)
#wait for the remaining tasks to be processed
await queue.join()
# cancel the consumers, which are now idle
for c in consumers:
c.cancel()
As for the globals, I tend to store them on the app directly so that they can be accessed via the current_app proxy. Please note though that this (and your solution) only works for a single process (worker), if you want to use multiple workers (or equivalently hosts) you will need a third party store for this information e.g. using redis.
But I am not sure if this is a good practice
Global variables like the ones you've created in your example are not typically good practice in enterprise solutions. Especially in Python, there are some finicky behavior when it comes to globals.
Passing in variables to a function or class has been a cleaner approach from my experience.
However, I don't know how to do that in quart since I don't use that lib.
# Making sure all the producers have completed
#wait for the remaining tasks to be processed
# cancel the consumers, which are now idle
Typically clean up tasks are done upon exiting the event loop and before exiting the application.
I don't have knowledge of how quart works, but you might be able to put that logic after app.run() so that clean up tasks run after the event loop stops.
This might vary depending on how your application exits.
Check the documentation for some sort of a "on shutdown" event that you can hook into.

ValueError when asyncio.run() is called in separate thread

I have a network application which is listening on multiple sockets.
To handle each socket individually, I use Python's threading.Thread module.
These sockets must be able to run tasks on packet reception without delaying any further packet reception from the socket handling thread.
To do so, I've declared the method(s) that are running the previously mentioned tasks with the keyword async so I can run them asynchronously with asyncio.run(my_async_task(my_parameters)).
I have tested this approach on a single socket (running on the main thread) with great success.
But when I use multiple sockets (each one with it's independent handler thread), the following exception is raised:
ValueError: set_wakeup_fd only works in main thread
My question is the following: Is asyncio the appropriate tool for what I need? If it is, how do I run an async method from a thread that is not a main thread.
Most of my search results are including "event loops" and "awaiting" assync results, which (if I understand these results correctly) is not what I am looking for.
I am talking about sockets in this question to provide context but my problem is mostly about the behaviour of asyncio in child threads.
I can, if needed, write a short code sample to reproduce the error.
Thank you for the help!
Edit1, here is a minimal reproducible code example:
import asyncio
import threading
import time
# Handle a specific packet from any socket without interrupting the listenning thread
async def handle_it(val):
print("handled: {}".format(val))
# A class to simulate a threaded socket listenner
class MyFakeSocket(threading.Thread):
def __init__(self, val):
threading.Thread.__init__(self)
self.val = val # Value for a fake received packet
def run(self):
for i in range(10):
# The (fake) socket will sequentially receive [val, val+1, ... val+9]
asyncio.run(handle_it(self.val + i))
time.sleep(0.5)
# Entry point
sockets = MyFakeSocket(0), MyFakeSocket(10)
for socket in sockets:
socket.start()
This is possibly related to the bug discussed here: https://bugs.python.org/issue34679
If so, this would be a problem with python 3.8 on windows. To work around this, you could try either downgrading to python 3.7, which doesn't include asyncio.main so you will need to get and run the event loop manually like:
loop = asyncio.get_event_loop()
loop.run_until_complete(<your tasks>)
loop.close()
Otherwise, would you be able to run the code in a docker container? This might work for you and would then be detached from the OS behaviour, but is a lot more work!

Process finishes but cannot be joined?

To accelerate a certain task, I'm subclassing Process to create a worker that will process data coming in samples. Some managing class will feed it data and read the outputs (using two Queue instances). For asynchronous operation I'm using put_nowait and get_nowait. At the end I'm sending a special exit code to my process, upon which it breaks its internal loop. However... it never happens. Here's a minimal reproducible example:
import multiprocessing as mp
class Worker(mp.Process):
def __init__(self, in_queue, out_queue):
super(Worker, self).__init__()
self.input_queue = in_queue
self.output_queue = out_queue
def run(self):
while True:
received = self.input_queue.get(block=True)
if received is None:
break
self.output_queue.put_nowait(received)
print("\tWORKER DEAD")
class Processor():
def __init__(self):
# prepare
in_queue = mp.Queue()
out_queue = mp.Queue()
worker = Worker(in_queue, out_queue)
# get to work
worker.start()
in_queue.put_nowait(list(range(10**5))) # XXX
# clean up
print("NOTIFYING")
in_queue.put_nowait(None)
#out_queue.get() # XXX
print("JOINING")
worker.join()
Processor()
This code never completes, hanging permanently like this:
NOTIFYING
JOINING
WORKER DEAD
Why?
I've marked two lines with XXX. In the first one, if I send less data (say, 10**4), everything will finish normally (processes join as expected). Similarly in the second, if I get() after notifying the workers to finish. I know I'm missing something but nothing in the documentation seems relevant.
Documentation mentions that
When an object is put on a queue, the object is pickled and a background thread later flushes the pickled data to an underlying pipe. This has some consequences [...] After putting an object on an empty queue there may be an infinitesimal delay before the queue’s empty() method returns False and get_nowait() can return without raising queue.Empty.
https://docs.python.org/3.7/library/multiprocessing.html#pipes-and-queues
and additionally that
whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate.
https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing-programming
This means that the behaviour you describe is caused probably by a racing condition between self.output_queue.put_nowait(received) in the worker and joining the worker with worker.join() in the Processers __init__. If joining was faster than feeding it into the queue, everything finishes fine. If it was too slow, there is an item in the queue, and the worker would not join.
Uncommenting the out_queue.get() in the main process would empty the queue, which allows joining. But as it is important for the queue to return if the queue would already be empty, using a time-out might be an option to try to wait out the racing condition, e.g out_qeue.get(timeout=10).
Possibly important might also be to protect the main routine, especially for Windows (python multiprocessing on windows, if __name__ == "__main__")

watchdog with Pool in python 3

I have a simple watchdog in python 3 that reboots my server if something goes wrong:
import time, os
from multiprocessing import Pool
def watchdog(x):
time.sleep(x)
os.system('reboot')
return
def main():
while True:
p = Pool(processes=1)
p.apply_async(watchdog, (60, )) # start watchdog with 60s interval
# here some code thas has a little chance to block permanently...
# reboot is ok because of many other files running independently
# that will get problems too if this one blocks too long and
# this will reset all together and autostart everything back
# block is happening 1-2 time a month, mostly within a http-request
p.terminate()
p.join()
return
if __name__ == '__main__':
main()
p = Pool(processes=1) is declared every time the while loop starts.
Now here the question: Is there any smarter way?
If I p.terminate() to prevent the process from reboot, the Pool becomes closed for any other work. Or is there even nothing wrong with declaring a new Pool every time because of garbage collection.
Use a process. Processes support all of the features you are using, so you don't need to make a pool with size one. While processes do have a warning about using the terminate() method (since it can corrupt pipes, sockets, and locking primitives), you are not using any of those items and don't need to care. (In any event, Pool.terminate() probably has the same issues with pipes etc. even though it lacks a similar warning.)

IronPython: StartThread creates 'DummyThreads' that don't get cleaned up

The problem I have is that in my IronPython application threads are being created but never cleaned up, even when the method they run has exited. In my application I start threads in two ways: a) by using Python-style threads (sub-classes of threading.Thread that do something in their run() method), and b) by using the .NET 'ThreadStart' approach. The Python-style threads behave as expected, and after their 'run()' exits they get cleaned up. The .NET style threads never get cleaned up, even after they have exited. You can call del, Abort, whatever you want, and it has no effect on them.
The following IronPython script demonstrates the issue:
import System
import threading
import time
import logging
def do_beeps():
logging.debug("starting do_beeps")
t_start = time.clock()
while time.clock() - t_start < 10:
System.Console.Beep()
System.Threading.Thread.CurrentThread.Join(1000)
logging.debug("exiting do_beeps")
class PythonStyleThread(threading.Thread):
def __init__(self, thread_name="PythonStyleThread"):
super(PythonStyleThread, self).__init__(name=thread_name)
def run(self):
do_beeps()
class ThreadStarter():
def start(self):
t = System.Threading.Thread(System.Threading.ThreadStart(do_beeps))
t.IsBackground = True
t.Name = "ThreadStartStyleThread"
t.Start()
if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.DEBUG, datefmt='%H:%M:%S')
# Start some ThreadStarter threads:
for _ in range(5):
ts = ThreadStarter()
ts.start()
System.Threading.Thread.CurrentThread.Join(200)
# Start some Python-style threads:
for _ in range(5):
pt = PythonStyleThread()
pt.start()
System.Threading.Thread.CurrentThread.Join(200)
# Do something on the main thread:
for _ in range(30):
print(".")
System.Threading.Thread.CurrentThread.Join(1000)
When this is debugged in PyDev, what I see is that all the threads appear as expected in the 'debug' view as they are created:
but whereas the Python-style ones disappear after they've finished, the .NET / ThreadStart ones stay until the main thread exits.
As can be seen in the image, in the debugger the problematic threads appear with names 'Dummy-4', 'Dummy-5' etc, whereas the Pythonic ones appear with the name I've given them ('PythonStyleThread'). Looking in the threading.py file in my IronPython installation I see there is a class called "_DummyThread", a subclass of Thread, that sets its 'name' as 'name=_newname("Dummy-%d")', so it looks like by using ThreadStart I'm ending up with _DummyThreads. The comment for the class also says:
# Dummy thread class to represent threads not started here.
# These aren't garbage collected when they die, nor can they be waited for.
which would explain why I can't get rid of them.
But I don't want 'DummyThread's. I just want normal ones, that behave nicely and get garbage-collected when they've finished doing their thing.
Now, a slightly odd thing about all of this is that unless I set up the logger I don't see the DummyThread entries in the debugger at all (although they still run). This may be a funny of the PyDev debugger, or it may relevant. Is there any sane reason why logging should have any bearing on this? Can I solve my problem just by not logging in my thread?
Here, it says:
"There is the possibility that "dummy thread objects" are created. These are thread objects corresponding to "alien threads", which are threads of control started outside the threading module, such as directly from C code. Dummy thread objects have limited functionality; they are always considered alive and daemonic, and cannot be joined. They are never deleted, since it is impossible to detect the termination of alien threads."
Which makes me wonder why I've had the misfortune of ending up with them?
While I have a workaround in that I can use Python-style threading.Thread subclasses everywhere I currently use .NET 'ThreadStart' threads, I am not keen to do this as the reason I was using .NET style threads in certain places was because they give me an Abort method (whereas the Python ones don't). I know Aborting threads is a Bad Thing, but the application is a unit-testing framework, and I a) need to run unit-tests in a thread, and b) have no control over their contents (they are written by third-parties), so I have no means of periodically checking for a 'please shut me down nicely' flag on these threads, and in extremis may need to kill them rudely.
So a) why am I getting DummyThreads, b) has this got anything to do with logging and c) what can I do about it?
Thanks.

Resources