How to initialize python watchdog pattern matching event handler - python-3.x

I'm using the Python Watchdog to monitor a directory for new files being created. Several different types of files are created in said directory but I only need to monitor a single file type, hence I use the Watchdog PatternMatchingEventHandler, where I specify the pattern to monitor using the patterns keyword.
To correctly execute the code under the hood (not displayed here) I need to initialize an empty dataframe in my event-handler, and I am having trouble getting this to work. If I remove the __init__ in the code below, everything works just fine btw.
I used the code in this answer as inspiration for my own.
The code I have set up looks as follows:
from watchdog.observers import Observer
from watchdog.events import PatternMatchingEventHandler
import time
import pandas as pd
import numpy as np
from multiprocessing import Pool
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self):
#Initializing an empty dataframe for storage purposes.
data_54 = pd.DataFrame(columns = ['Barcode','DUT','Step12','Step11','Np1','Np2','TimestampEQ54'])
#Converting to INT for later purposes
data_54[['Barcode','DUT']]=data_54[['Barcode','DUT']].astype(np.int64)
self.data = data_54
def on_created(self,event):
if event.is_directory:
return True
elif event.event_type == 'created':
#Take action here when a file is created.
print('Found new files:')
print(event.src_path)
time.sleep(0.1)
#Creating process pool to return data
pool1 = Pool(processes=4)
#Pass file to parsing function and return parsed result.
result_54 = pool1.starmap(parse_eq54,[(event.src_path,self.data)])
#returns the dataframe rather than the list of dataframes returned by starmap
self.data = result_54[0]
print('Data read: ')
print(self.data)
def monitorEquipment(equipment):
'''Uses the Watchdog package to monitor the data directory for new files.
See the HandlerEQ54 and HandlerEQ51 classes in multiprocessing_handlers for actual monitoring code. Monitors each equipment.'''
print('equipment')
if equipment.upper() == 'EQ54':
event_handler = HandlerEQ54(patterns=["*.log"])
filepath = '/path/to/first/file/source/'
# set up observer
observer = Observer()
observer.schedule(event_handler, path=filepath, recursive=True)
observer.daemon=True
observer.start()
print('Observer started')
# monitor
try:
while True:
time.sleep(5)
except KeyboardInterrupt:
observer.unschedule_all()
observer.stop()
observer.join()
However, when I execute monitorEquipment I receive the following error message:
TypeError: __init__() got an unexpected keyword argument 'patterns'
Evidently I'm doing something wrong when I'm initializing my handler class, but I'm drawing a blank as to what that is (which probably reflects my less-than-optimal understanding of classes). Can someone advice me on how to correctly initialize the empty dataframe in my HandlerEQ54 class, to not get the error I do?

Looks like you are missing the patterns argument from your __init__ method, you'll also need a super() call to the __init__ method of the parent class (PatternMatchingEventHandler), so you can pass the patterns argument upwards.
it should look something like this:
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self, patterns=None):
super(HandlerEQ54, self).__init__(patterns=patterns)
...
event_handler = HandlerEQ54(patterns=["*.log"])
or, for a more generic case and to support all of PatternMatchingEventHandler's arguments:
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self, *args, **kwargs):
super(HandlerEQ54, self).__init__(*args, **kwargs)
...
event_handler = HandlerEQ54(patterns=["*.log"])

Related

Unable to access class attribute in another function

import rospy
from sensor_msgs.msg import Imu
class ImuData:
def __init__(self):
#self.data = None
pass
def get_observation(self):
rospy.Subscriber('/imu', Imu, self.imu_callback)
imuData = self.data
print(imuData)
def imu_callback(self, msg):
self.data = msg.orientation
print(self.data)
if __name__ == '__main__':
rospy.init_node('gett_imu', anonymous= True)
idd = ImuData()
idd.get_observation()
In the above code, I would like to access self.data defined in imu_callback from get_observation function. The problem is I get error saying that ImuData has no attribute data.
How do I solve this issue?
Note: I feel that the question has to do with the python classes and not with Ros and rospy.
A couple of things are going on here. One, that was mentioned in the comment, is that you should be initializing your attributes inside __init__. The error your seeing is partially because of Python and the fact that self.data has not actually been initialized yet.
The second issue you have is where you setup the subscriber. This should also be done in __init__ and only once. Sensors will be publishing at a fairly constant rate, thus it takes time to actually receive any data on the topic. Also if you plan to call get_observation more than once you would create a new subscription, which you do not want.
Take the following code as a fixed example:
def __init__(self):
rospy.Subscriber('/imu', Imu, self.imu_callback)
self.data = None
def get_observation(self):
imuData = self.data
print(imuData)

How to use Multiprocessing pool in Databricks with pandas [duplicate]

I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.
Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.
I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation
When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.
As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()
I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.
This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.
A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))
Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)
As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.

How can i use multiprocessing in class

I'm making program for study python.
it is gui web crawler.
i success multi work gui and main classes using QThread
but i have a problem.
in main class, get picture addresses using webdriver first, and make a list called data.
after that use pool() and map() to start download pictures using download_image method in Main class.
i searched and tried many things.
imap and lambda etc.
here is my code
(i import multiprocess as mul)
(and my python version is 3.7)
# crawler and downloader class
class Main(QThread, QObject):
def __init__(self, path, brand, model, grade):
QThread.__init__(self)
self.path = path
# this is download method
def download_image(self, var):
a = var.split("/")
filename = a[-1]
download_path = self.path + filename
urllib.request.urlretreieve(var, download_path)
# this is start method when button clicked in Gui
def core(self):
#sample url data list
data = ['url.com', 'url2.com', 'url3.com', ...]
download_p = Pool(mul.cpu_count())
download_p.map(self.download_image, data)
download_p.close()
download_p.join()
print("end")
def run(self):
self.core()
class Gui(QMainWindow, design.Ui_MainWindow):
def __init__(self):
(and gui code here)
if i set download_p.map(self.download_image, data)
i get this error -> [ TypeError: can't pickle Main objects ]
if i set download_p.map(self.download_image, self.data)
(and also set self.data = [urls...])
i get same TypeError
if i set download_p.map(self.download_image, self, data)
i get this error -> [TypeError : 'Main' object is not iterable
I'm not good at English and Python too
but i want to resolve this problem so i decide ask here
really thanks for looking this newbie's question...

How to initialize the class only once and retain the overidden properties in python

I am carrying out XML parsing for a list of XML files. I am using a module which overrides the XMLParser class of element tree. This is the code-
import sys
sys.modules['_elementtree'] = None
try:
sys.modules.pop('xml.etree.ElementTree')
except KeyError:
pass
import xml.etree.ElementTree as ET
class Parse():
def __init__(self):
self.xmlFiles = [list_of_xmlFile_paths]
def parse_xml_files(self):
for filepath in self.xmlFiles:
root = ET.parse(filepath, LineNumberingParser()).getroot()
for elem in root:
print(elem.start_line_numer, elem.end_line_number)
class LineNumberingParser(ET.XMLParser):
def _start(self, *args, **kwargs):
# Here we assume the default XML parser which is expat
# and copy its element position attributes into output Elements
self.element = super(self.__class__, self)._start(*args, **kwargs)
self.element.start_line_number = self.parser.CurrentLineNumber
self.element.start_column_number = self.parser.CurrentColumnNumber
return self.element
def _end(self, *args, **kwargs):
self.element = super(self.__class__, self)._end(*args, **kwargs)
self.element.end_line_number = self.parser.CurrentLineNumber
self.element.end_column_number = self.parser.CurrentColumnNumber
return self.element
The class LineNumberingParser gives me the begin line, end line of an xml node. My issue is that, for every xml file, the class is initialised.So this repetitive initialisation is not efficient. How can I do this by initialising the class only once? Can anyone please suggest.
I am still unsure how do you want to do that? It seems that ET.XMLParser class needs to be initialized on per-file basis....
However, should you find a way to go around that (e.g. by "re-initializing" the ET.XMLParser object's variables manually) you could keep an instance of the parser in LineNumberingParser as a class variable and initialize it only once.

Pickle can't pickle _thread.lock objects

I'm trying to use pickle to save one of my objects but I face this error when trying to dump it:
TypeError: can't pickle _thread.lock objects
It is not clear to me, because I'm not using any locks inside my code. I tried to reproduce this error:
import threading
from time import sleep
import pickle
class some_class:
def __init__(self):
self.a = 1
thr = threading.Thread(target=self.incr)
self.lock = threading.Lock()
thr.start()
def incr(self):
while True:
# with self.lock:
self.a += 1
print(self.a)
sleep(0.5)
if __name__ == "__main__":
a = some_class()
val = pickle.dumps(a, pickle.HIGHEST_PROTOCOL)
print("pickle done!")
pickle_thread.py", line 22, in
val = pickle.dumps(a, pickle.HIGHEST_PROTOCOL) TypeError: can't pickle _thread.lock objects
If I define a thread lock inside my object I can't pickle it, right?
I think the problem here is using threading.lock but is there any workaround for this?
Actually, in my main project, I can't find any locks but I've used lots of modules that I can't trace them. What should I look for?
Thanks.
You can try to customize the pickling method for this class by excluding unpicklable objects from the dictionary:
def __getstate__(self):
state = self.__dict__.copy()
del state['lock']
return state
When unpickling, you can recreate missing objects manually, e.g.:
def __setstate__(self, state):
self.__dict__.update(state)
self.lock = threading.Lock() # ???
I don't know enough about the threading module to predict if this is gonna be sufficient.

Resources