Multithreading with Watchdog - python-3.x

I need to run a function on all new files in a folder.
I've chosen watchdog to detect event handling, as it is rather straightforward to use. However, as the operation on each file takes roughly 30-40 seconds, the process takes relatively long whenever large quantities (ex 1000 files) of files have been added to the folder.
I have heard of multi threading, and I believe that this is the answer to my issue: Instead of running the function one by one on each item that is added - running the function (do_smth) on as many files as possible, given the restriction of my RAM. How should I go about it?
Please a minimal reproducable example of my code below:
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
folder_to_watch= "/trigger_folder"
class EventHandler(FileSystemEventHandler):
def do_smth(self):
print("do_something")
time.sleep(2)
def on_created(self, event): # when file is created
print("Got event for file %s" % event.src_path)
time.sleep(1)
self.do_smth()
observer = Observer()
event_handler = EventHandler() # create event handler
# set observer to use created handler in directory
observer.schedule(event_handler, path=folder_to_watch)
observer.start()
# sleep until keyboard interrupt, then stop + rejoin the observer
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Edit 1:
The do_stmh function in reality checks if the new file is an image file, and if so opens it via cv2, takes its height and width, and saves it to a .csv file among some other operations (that unfortunately take longer).

Related

Checking a folder if no new files were added into it and generate an alert if no new files were added

I'm trying to create a program with Python that checks if a folder has a new file(s) or not.
I had an instance where there was a technical error and no new files were being added to a folder. I didn't know until a couple days later.
I want to generate an alert (a text file or email alert or something) if no new files were added to the folder.
I have a test folder with two files:
Name Date Modified
doc1.txt 6/23/22
doc2.txt 6/23/22
If I run a program now it should say "No files added since 6/23/22" as an alert then end program.
But, if I add a new file today:
Name Date Modified
doc1.txt 6/23/22
doc2.txt 6/23/22
doc3.txt 7/25/22
The program should output nothing and end or say "there is an event with doc3.txt".
So far I have something like this I found here: Running a python script when a new file is created as I see Watchdog is good for this. It's sort of similar to what I want, just need to remove the while loop as I have my own scheduling software to run the program everyday to check. But is there a better way than this?
import os, time, datetime
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
path_to_watch = 'Testing Folder'
class ExampleHandler(FileSystemEventHandler):
def on_created(self, event): # when file is created
# do something, eg. call your function to process the image
print("Got event for file %s" % event.src_path)
observer = Observer()
event_handler = ExampleHandler() # create event handler
# set observer to use created handler in directory
observer.schedule(event_handler, path=path_to_watch)
observer.start()
# sleep until keyboard interrupt, then stop + rejoin the observer
try:
while True:
time.sleep(1)
exit()
except KeyboardInterrupt:
observer.stop()
observer.join()

How to force os.stat re-read file stats by same path

I have a code that is architecturally close to posted below (unfortunately i can't post full version cause it's proprietary). I have an self-updating executable and i'm trying to test this feature. We assume that full path to this file will be in A.some_path after executing input. My problem is that assertion failed, because on second call os.stat still returning the previous file stats (i suppose it thinks that nothing could changed so it's unnecessary). I have tried to launch this manually and self-updating works completely fine and the file is really removing and recreating with stats changing. Is there any guaranteed way to force os.stat re-read file stats by the same path, or alternative option to make it works (except recreating an A object)?
from pathlib import Path
import unittest
import os
class A:
some_path = Path()
def __init__(self, _some_path):
self.some_path = Path(_some_path)
def get_path(self):
return self.some_path
class TestKit(unittest.TestCase):
def setUp(self):
pass
def check_body(self, a):
some_path = a.get_path()
modification_time = os.stat(some_path).st_mtime
# Launching self-updating executable
self.assertTrue(modification_time < os.stat(some_path).st_mtime)
def check(self):
a = A(input('Enter the file path\n'))
self.check_body(a)
def Tests():
suite = unittest.TestSuite()
suite.addTest(TestKit('check'))
return suite
def main():
tests_suite = Tests()
unittest.TextTestRunner().run(tests_suite)
if __name__ == "__main__":
main()
I have found the origins of the problem: i've tried to launch self-updating via os.system which wait till the process done. But first: during the self-updating we launch several detached proccesses and actually should wait unitl all them have ended, and the second: even the signal that the proccess ends doesn't mean that OS really completely realease the file, and looks like on assertTrue we are not yet done with all our routines. For my task i simply used sleep, but normal solution should analyze the existing proccesses in the system and wait for them to finish, or at least there should be several attempts with awaiting.

Get Duration of Media Before Playing

I have a QMediaPlayer object, which if I try to get the duration of before letting the file buffer enough, will return -1. To my understanding, this is because the file is loaded asynchronously and duration (and subsequently position) cannot be known since it is unknown if the file is fully loaded yet.
My initial idea to solve this was to run media.play(), immediately followed by media.stop(). This does absolutely nothing. Then, I considered running media.play() and media.pause(). This does not work either. I imagine this is because the media needs to buffer for a significant period of time before the duration can be obtained. Also, this "solution" would not have been ideal regardless.
How can I get the duration of a QMediaPlayer object before the file has been played?
One possible solution is to use the durationChanged signal:
from PyQt5 import QtCore, QtMultimedia
if __name__ == '__main__':
import sys
app = QtCore.QCoreApplication(sys.argv)
player = QtMultimedia.QMediaPlayer()
#QtCore.pyqtSlot('qint64')
def on_durationChanged(duration):
print(duration)
player.stop()
QtCore.QCoreApplication.quit()
player.durationChanged.connect(on_durationChanged)
file = "/path/of/small.mp4"
player.setMedia(QtMultimedia.QMediaContent(QtCore.QUrl.fromLocalFile(file)))
player.play()
sys.exit(app.exec())

Python - Multiprocessing vs Multithreading for file-based work

I have created a GUI (using wxPython) that allows the user to search for files and their content. The program consists of three parts:
The main GUI window
The searching mechanism (seperate class)
The output window that displays the results of the file search (seperate class)
Currently I'm using pythons threading module to run the searching mechanism (2) in a separate thread, so that the main GUI can still work flawlessly. I'm passing the results during runtime to the output window (3) using Queue. This works fine for less performance requiring file-reading-actions, but as soon as the searching mechanism requires more performance, the main GUI window (1) starts lagging.
This is roughly the schematic:
import threading
import os
import wx
class MainWindow(wx.Frame): # this is point (1)
def __init__(self, parent, ...):
# Initialize frame and panels etc.
self.button_search = wx.Button(self, label="Search")
self.button_search.Bind(wx.EVT_BUTTON, self.onSearch)
def onSearch(self, event):
"""Run filesearch in separate Thread""" # Here is the kernel of my question
filesearch = FileSearch().Run()
searcher = threading.Thread(target=filesearch, args=(args,))
searcher.start()
class FileSearch(): # this is point (2)
def __init__(self):
...
def Run(self, args):
"""Performs file search"""
for root, dirs, files in os.walk(...):
for file in files:
...
def DetectEncoding(self):
"""Detects encoding of file for reading"""
...
class OutputWindow(wx.Frame): # this is point (3)
def __init__(self, parent, ...):
# Initialize output window
...
def AppendItem(self, data):
"""Appends a fileitem to the list"""
...
My questions:
Is python's multiprocessing module better suited for this specific performance requiring job?
If yes, which way of interprocess communication (IPC) should I choose to send the results from the searching mechanism class (2) to the output window (3) and how should implement it schematically?

Why do I get NSAutoreleasePool double release when using Python/Pyglet on OS X

I'm using Python 3.5 and Pyglet 1.2.4 on OS X 10.11.5. I am very new to this setup.
I am trying to see if I can use event handling to capture keystrokes (without echoing them to the screen) and return them to the main program one at a time by separate invocations of the pyglet.app.run method. In other words I am trying to use Piglet event handling as if it were a callable function for this purpose.
Below is my test program. It sets up the Pyglet event mechanism and then calls it four times. It works as desired but causes the system messages shown below.
import pyglet
from pyglet.window import key
event_loop = pyglet.app.EventLoop()
window = pyglet.window.Window(width=400, height=300, caption="TestWindow")
#window.event
def on_draw():
window.clear()
#window.event
def on_key_press(symbol, modifiers):
global key_pressed
if symbol == key.A:
key_pressed = "a"
else:
key_pressed = 'unknown'
pyglet.app.exit()
# Main Program
pyglet.app.run()
print(key_pressed)
pyglet.app.run()
print(key_pressed)
pyglet.app.run()
print(key_pressed)
pyglet.app.run()
print(key_pressed)
print("Quitting NOW!")
Here is the output with blank lines inserted for readability. The first message is different and appears even if I comment out the four calls to piglet.app.run. The double release messages do not occur after every call to event handling and do not appear in a consistent manner from one test run to the next.
/Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5 "/Users/home/PycharmProjects/Test Event Handling/.idea/Test Event Handling 03B.py"
2016-07-28 16:49:59.401 Python[11419:4185158]ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/8q/bhzsqtz900s742c17gkj_y740000gr/T/org.python.python.savedState
a
2016-07-28 16:50:02.841 Python[11419:4185158] *** -[NSAutoreleasePool drain]: This pool has already been drained, do not release it (double release).
2016-07-28 16:50:03.848 Python[11419:4185158] *** -[NSAutoreleasePool drain]: This pool has already been drained, do not release it (double release).
a
a
2016-07-28 16:50:04.632 Python[11419:4185158] *** -[NSAutoreleasePool drain]: This pool has already been drained, do not release it (double release).
a
Quitting NOW!
Process finished with exit code 0
Basic question: Why is this happening and what can I do about it?
Alternate question: Is there a better way to detect and get a users keystrokes without echoing them to the screen. I will be using Python and Pyglet for graphics so I was trying this using Pyglet's event handling.
Try to play with this simple example. It uses the built-in pyglet event handler to send the key pressed to a function that can then handle it. It shows that pyglet.app itself is the loop. You don't need to create any other.
#!/usr/bin/env python
import pyglet
class Win(pyglet.window.Window):
def __init__(self):
super(Win, self).__init__()
def on_draw(self):
self.clear()
# display your output here....
def on_key_press(self, symbol, modifiers):
if symbol== pyglet.window.key.ESCAPE:
exit(0)
else:
self.do_something(symbol)
# etc....
def do_something(symbol):
print symbol
# here you can test the input and then redraw
window = Win()
pyglet.app.run()

Resources