How to use Multiprocessing pool in Databricks with pandas [duplicate]

How to use Multiprocessing pool in Databricks with pandas [duplicate] - python-3.x

I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.

Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.

I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation

When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.

As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()

I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.

Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.

This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.

A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))

Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)

As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.

Related

How to resolve pickle error caused by passing instance method to connurent.futures.ProcessPoolExecutor,submit() (multiprocessing in Python 3)?

I am really new to multiprocessing!
What I was trying to do:
Run a particular instance method i.e. ( wait_n_secs() which was slow!) as a separate process so that other processes can run on the side.
Once instance method is done processing we retrieve its output and use it using shared array provided by multiprocessing module.
Here is the code I was trying to run.
import cv2
import time
from multiprocessing import Array
import concurrent.futures
import copyreg as copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
class Testing():
def __init__(self):
self.executor = concurrent.futures.ProcessPoolExecutor()
self.futures = None
self.shared_array = Array('i', 4)
def wait_n_secs(self,n):
print(f"I wait for {n} sec")
cv2.waitKey(n*1000)
wait_array = (n,n,n,n)
return wait_array
def function(waittime):
bbox = Testing().wait_n_secs(waittime)
return bbox
if __name__ =="__main__":
testing = Testing()
waittime = 5
# Not working!
testing.futures = testing.executor.submit(testing.wait_n_secs,waittime)
# Working!
#testing.futures = testing.executor.submit(function,waittime)
stime = time.time()
while 1:
if not testing.futures.running():
print("Checking for results")
testing.shared_array = testing.futures.result()
print("Shared_array received = ",testing.shared_array)
break
time_elapsed = time.time()-stime
if (( time_elapsed % 1 ) < 0.001):
print(f"Time elapsed since some time = {time_elapsed:.2f} sec")
Problems I faced:
1) Error on Python 3.6:
Traceback (most recent call last):
File "C:\Users\haide\AppData\Local\Programs\Python\Python36\lib\multiprocessing\queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Users\haide\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Users\haide\AppData\Local\Programs\Python\Python36\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Users\haide\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 356, in assert_spawning
' through inheritance' % type(obj).__name__
RuntimeError: Queue objects should only be shared between processes through inheritance
2) Error on Python 3.8:
testing.shared_array = testing.futures.result()
File "C:\Users\haide\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 437, in result
return self.__get_result()
File "C:\Users\haide\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 389, in __get_result
raise self._exception
File "C:\Users\haide\AppData\Local\Programs\Python\Python38\lib\multiprocessing\queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Users\haide\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'weakref' object
As others like Amby, falviussn has previously asked.
Problem:
We get a pickling error specifically for instance methods in multiprocessing as they are unpickable.
Solution I tried (Partially):
The solution most mentioned is to use copy_reg to pickle the instance method.
I don't fully understand copy_reg. I have tried adding the lines of code to the top of mp.py provided by Nabeel. But I haven't got it to work.
(Important consideration): I am on Python 3 using copyreg and solutions seem to be using python 2 as they imported copy_reg (Python 2)
(I haven't tried):
Using dill because they were either not the case of multiprocessing or even if they were. They were not using concurrent.futures module.
Workaround:
Passing function that calls the instance method ( instead of the instance method directly ) to submit() method.
testing.futures = testing.executor.submit(function,waittime)
This does work. But does not seem like an elegant solution.
What I want:
Please guide me on how to correctly use copyreg as I clearly don't understand its workings.
Or
If it's a python3 issue, Suggest another solution where i can pass instance methods to conccurent.futurs.ProcessPoolExecutor.submit() for multi-processing. :)
Update #1:
#Aaron Can you share an example code of your solution? "passing a module level function that takes instance as an argument"
or
Correct my mistake here:
This was my attempt. :(
Passing the instance to the module level function along with the arguments
inp_args = [waittime]
testing.futures = testing.executor.submit(wrapper_func,testing,inp_args)
And this was the module wrapper function I created,
def wrapper_func(ins,*args):
ins.wait_n_secs(args)
This got me back to...
TypeError: cannot pickle 'weakref' object

We get a pickling error specifically for instance methods in multiprocessing as they are unpickable.
This is not true, instance methods are very much picklable in python 3 (unless they contain local attributes, like factory functions). You get the error because some other instance attributes (specific to your code) are not picklable.
Please guide me on how to correctly use copyreg as I clearly don't understand its workings.
It's not required here
If it's a python3 issue, Suggest another solution where i can pass instance methods to conccurent.futurs.ProcessPoolExecutor.submit() for multi-processing. :)
It's not really a python issue, it's to do with what data your sending to be pickled. Specifically, all three attributes (after they are populated), self.executor, self.futures and self.shared_array cannot be put on a multiprocessing.Queue (which ProcessPoolExecutor internally uses) and pickled.
So, the problem happens because you are passing an instance method as the target function, which means that all instance attributes are also implicitly pickled and sent to the other process. Since, some of these attributes are not picklable, this error is raised. This is also the reason why your workaround works, since the instance attributes are not pickled there as the target function is not an instance method. There are a couple of things you can do, the best way depends on if there are other attributes that you need to send as well.
Method #1
Judging from the sample code, your wait_n_secs function is not really using any instance attributes. Therefore, you can convert it into a staticmethod and pass that as the target function directly instead:
import time
from multiprocessing import Array
import concurrent.futures
class Testing():
def __init__(self):
self.executor = concurrent.futures.ProcessPoolExecutor()
self.futures = None
self.shared_array = Array('i', 4)
#staticmethod
def wait_n_secs(n):
print(f"I wait for {n} sec")
# Have your own implementation here
time.sleep(n)
wait_array = (n, n, n, n)
return wait_array
if __name__ == "__main__":
testing = Testing()
waittime = 5
testing.futures = testing.executor.submit(type(testing).wait_n_secs, waittime) # Notice the type(testing)
stime = time.time()
while 1:
if not testing.futures.running():
print("Checking for results")
testing.shared_array = testing.futures.result()
print("Shared_array received = ", testing.shared_array)
break
time_elapsed = time.time() - stime
if ((time_elapsed % 1) < 0.001):
print(f"Time elapsed since some time = {time_elapsed:.2f} sec")
Method #2
If your instance contains attributes which would be used by the target functions (so they can't be converted to staticmethods), then you can also explicitly not pass the unpicklable attributes of the instance when pickling using the __getstate__ method. This would mean that the instance recreated inside other processes would not have all these attributes either (since we did not pass them), so do keep that in mind:
import time
from multiprocessing import Array
import concurrent.futures
class Testing():
def __init__(self):
self.executor = concurrent.futures.ProcessPoolExecutor()
self.futures = None
self.shared_array = Array('i', 4)
def wait_n_secs(self, n):
print(f"I wait for {n} sec")
# Have your own implementation here
time.sleep(n)
wait_array = (n, n, n, n)
return wait_array
def __getstate__(self):
d = self.__dict__.copy()
# Delete all unpicklable attributes.
del d['executor']
del d['futures']
del d['shared_array']
return d
if __name__ == "__main__":
testing = Testing()
waittime = 5
testing.futures = testing.executor.submit(testing.wait_n_secs, waittime)
stime = time.time()
while 1:
if not testing.futures.running():
print("Checking for results")
testing.shared_array = testing.futures.result()
print("Shared_array received = ", testing.shared_array)
break
time_elapsed = time.time() - stime
if ((time_elapsed % 1) < 0.001):
print(f"Time elapsed since some time = {time_elapsed:.2f} sec")

Overridden setitem call works in serial but breaks in apply_async call

I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.
The summary of the problem is that I have a class that inherits from a dict to facilitate parsing of misc. input files. I've overridden the the __setitem__ call to support recursive indexing of sections in our input file (e.g. parser['some.section.variable'] is equivalent to parser['some']['section']['variable']). This has been working great for us for over a year now, but we just ran into an issue when passing these Parser classes through a multiprocessing.apply_async call.
Show below is the minimum working example - obviously the __setitem__ call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter - this is where it breaks. It doesn't break in the initial call or in the serial function call. But when you call the some_function (which doesn't do anything either) using apply_async, it crashes.
import multiprocessing as mp
import numpy as np
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
def some_function(parser):
pass
if __name__ == "__main__":
print("Initialize creation/setting")
parser = Parser()
parser['x'] = 1
print("Single serial call works fine")
some_function(parser)
print("Parallel async call breaks on line 16?")
pool = mp.Pool(1)
for i in range(1):
pool.apply_async(some_function, (parser,))
pool.close()
pool.join()
If you run the code below, you'll get the following output
Initialize creation/setting
__init__
__setitem__
Single serial call works fine
Parallel async call breaks on line 16?
__setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
return _ForkingPickler.loads(res)
File "test_apply_async.py", line 13, in __setitem__
self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Any help is greatly appreciated. I spent considerable time tracking down this bug and reproducing a minimal example. I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async and inheritance/overridden methods interact.
Let me know if you need any more information.
Thank you very much!
Isaac

Cause
The cause of the problem is that multiprocessing serializes and deserializes your Parser object to move its data across process boundaries. This is done using pickle. By default pickle does not call __init__() when deserializing classes. Because of this self.section_delimiter is not set when the deserializer calls __setitem__() to restore the items in your dictionary and you get the error:
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Using just pickle and no multiprocessing gives the same error:
import pickle
parser = Parser()
parser['x'] = 1
data = pickle.dumps(parser)
copy = pickle.loads(data) # Same AttributeError here
Deserialization will work for an object with no items and the value of section_delimiter will be restored:
import pickle
parser = Parser()
parser.section_delimiter = "|"
data = pickle.dumps(parser)
copy = pickle.loads(data)
print(copy.section_delimiter) # Prints "|"
So in a sense you are just unlucky that pickle calls __setitem__() before it restores the rest of the state of your Parser.
Workaround
You can work around this by setting section_delimiter in __new__() and telling pickle what arguments to pass to __new__() by implementing __getnewargs__():
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
__getnewargs__() returns a tuple of arguments. Because section_delimiter is set in __new__(), it is no longer necessary to set it in __init__().
This is the code of your Parser class after the change:
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
Simpler solution
The reason pickle calls __setitem__() on your Parser object is because it is a dictionary. If your Parser is just a class that happens to implement __setitem__() and __getitem__() and has a dictionary to implement those calls then pickle will not call __setitem__() and serialization will work with no extra code:
class Parser:
def __init__(self, file_name : str = None):
print('\t__init__')
self.dict = { }
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
self.dict[key] = value
def __getitem__(self, key):
return self.dict[key]
So if there is no other reason for your Parser to be a dictionary, I would just not use inheritance here.

RuntimeError: Queue objects should only be shared between processes through inheritance

I'm having some trouble with ProcessPoolExecutor.
The following code is trying to find the shortest path in a WikiRace game, it gets 2 titles and navigates between one to another.
Here is my code:
class AsyncSearch:
def __init__(self, start, end):
self.start = start
self.end = end
# self.manager = multiprocessing.Manager()
self.q = multiprocessing.Queue()
# self.q = self.manager.Queue()
def _add_starting_node_page_to_queue(self):
start_page = WikiGateway().page(self.start)
return self._check_page(start_page)
def _is_direct_path_to_end(self, page):
return (page.title == self.end) or (page.links.get(self.end) is not None)
def _add_tasks_to_queue(self, pages):
for page in pages:
self.q.put(page)
def _check_page(self, page):
global PATH_WAS_FOUND_FLAG
logger.info('Checking page "{}"'.format(page.title))
if self._is_direct_path_to_end(page):
logger.info('##########\n\tFound a path!!!\n##########')
PATH_WAS_FOUND_FLAG = True
return True
else:
links = page.links
logger.info("Couldn't find a direct path form \"{}\", "
"adding {} pages to the queue.".format(page.title, len(links)))
self._add_tasks_to_queue(links.values())
return "Couldn't find a direct path form " + page.title
def start_search(self):
global PATH_WAS_FOUND_FLAG
threads = []
logger.debug(f'Running with concurrent processes!')
if self._add_starting_node_page_to_queue() is True:
return True
with concurrent.futures.ProcessPoolExecutor(max_workers=AsyncConsts.PROCESSES) as executor:
threads.append(executor.submit(self._check_page, self.q.get()))
I'm getting the following exception:
Traceback (most recent call last):
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\queues.py", line 241, in _feed
obj = _ForkingPickler.dumps(obj)
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\context.py", line 356, in assert_spawning
' through inheritance' % type(obj).__name__
RuntimeError: Queue objects should only be shared between processes through inheritance
It's weird since I'm using multiprocessing.Queue() that should be shared between the processes as mentioned by the exception.
I found this similar question but couldn't found the answer there.
I tried to use self.q = multiprocessing.Manager().Queue() instead of self.q = multiprocessing.Queue(), I'm not sure if this takes me anywhere but the exception I'm getting is different:
Traceback (most recent call last):
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\queues.py", line 241, in _feed
obj = _ForkingPickler.dumps(obj)
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "c:\users\tomer smadja\appdata\local\programs\python\python36-32\lib\multiprocessing\process.py", line 282, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security reasons
Also, when I'm trying to use multiprocessing.Process() instead of ProcessPoolExecutor, I'm unable to finish the process once I do find a path. I set up a global variable to stop PATH_WAS_FOUND_FLAG to stop the process initiation but still with no success. What I'm missing here?

ProcessPoolExecutor.submit(...) will not pickle multiprocessing.Queue instances as well other shared multiprocessing.* class instances. You can do two things: One is to use SyncManager, or you can initialize the worker with the multiprocessing.Queue instance at ProcessPoolExecutor construction time. Both are shown below.
Following is your original variation with a couple of fixes applied (see note at end)... with this variation, multiprocessing.Queue operations are slightly faster than below SyncManager variation...
global_page_queue = multiprocessing.Queue()
def set_global_queue(q):
global global_page_queue
global_page_queue = q
class AsyncSearch:
def __init__(self, start, end):
self.start = start
self.end = end
#self.q = multiprocessing.Queue()
...
def _add_tasks_to_queue(self, pages):
for page in pages:
#self.q.put(page)
global_page_queue.put(page)
#staticmethod
def _check_page(self, page):
...
def start_search(self):
...
print(f'Running with concurrent processes!')
with concurrent.futures.ProcessPoolExecutor(
max_workers=5,
initializer=set_global_queue,
initargs=(global_page_queue,)) as executor:
f = executor.submit(AsyncSearch._check_page, self, global_page_queue.get())
r = f.result()
print(f"result={r}")
Following is SyncManager variation where queue operations are slightly slower than above multiprocessing.Queue variation...
import multiprocessing
import concurrent.futures
class AsyncSearch:
def __init__(self, start, end):
self.start = start
self.end = end
self.q = multiprocessing.Manager().Queue()
...
#staticmethod
def _check_page(self, page):
...
def start_search(self):
global PATH_WAS_FOUND_FLAG
worker_process_futures = []
print(f'Running with concurrent processes!')
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
worker_process_futures.append(executor.submit(AsyncSearch._check_page, self, self.q.get()))
r = worker_process_futures[0].result()
print(f"result={r}")
Note, for some shared objects, SyncManager can be anywhere from slightly to noticeably slower compared to multiprocessing.* variations. For example, a multiprocessing.Value is in shared memory whereas a SyncManager.Value is in the sync manager processes, requiring overhead to interact with it.
An aside, unrelated to your ask, your original code was passing _check_page with incorrect parameters, where you were passing dequeued item to self, leaving the page parameter None. I resolved this by changing _check_page to a static method and passing self.

How to initialize python watchdog pattern matching event handler

I'm using the Python Watchdog to monitor a directory for new files being created. Several different types of files are created in said directory but I only need to monitor a single file type, hence I use the Watchdog PatternMatchingEventHandler, where I specify the pattern to monitor using the patterns keyword.
To correctly execute the code under the hood (not displayed here) I need to initialize an empty dataframe in my event-handler, and I am having trouble getting this to work. If I remove the __init__ in the code below, everything works just fine btw.
I used the code in this answer as inspiration for my own.
The code I have set up looks as follows:
from watchdog.observers import Observer
from watchdog.events import PatternMatchingEventHandler
import time
import pandas as pd
import numpy as np
from multiprocessing import Pool
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self):
#Initializing an empty dataframe for storage purposes.
data_54 = pd.DataFrame(columns = ['Barcode','DUT','Step12','Step11','Np1','Np2','TimestampEQ54'])
#Converting to INT for later purposes
data_54[['Barcode','DUT']]=data_54[['Barcode','DUT']].astype(np.int64)
self.data = data_54
def on_created(self,event):
if event.is_directory:
return True
elif event.event_type == 'created':
#Take action here when a file is created.
print('Found new files:')
print(event.src_path)
time.sleep(0.1)
#Creating process pool to return data
pool1 = Pool(processes=4)
#Pass file to parsing function and return parsed result.
result_54 = pool1.starmap(parse_eq54,[(event.src_path,self.data)])
#returns the dataframe rather than the list of dataframes returned by starmap
self.data = result_54[0]
print('Data read: ')
print(self.data)
def monitorEquipment(equipment):
'''Uses the Watchdog package to monitor the data directory for new files.
See the HandlerEQ54 and HandlerEQ51 classes in multiprocessing_handlers for actual monitoring code. Monitors each equipment.'''
print('equipment')
if equipment.upper() == 'EQ54':
event_handler = HandlerEQ54(patterns=["*.log"])
filepath = '/path/to/first/file/source/'
# set up observer
observer = Observer()
observer.schedule(event_handler, path=filepath, recursive=True)
observer.daemon=True
observer.start()
print('Observer started')
# monitor
try:
while True:
time.sleep(5)
except KeyboardInterrupt:
observer.unschedule_all()
observer.stop()
observer.join()
However, when I execute monitorEquipment I receive the following error message:
TypeError: __init__() got an unexpected keyword argument 'patterns'
Evidently I'm doing something wrong when I'm initializing my handler class, but I'm drawing a blank as to what that is (which probably reflects my less-than-optimal understanding of classes). Can someone advice me on how to correctly initialize the empty dataframe in my HandlerEQ54 class, to not get the error I do?

Looks like you are missing the patterns argument from your __init__ method, you'll also need a super() call to the __init__ method of the parent class (PatternMatchingEventHandler), so you can pass the patterns argument upwards.
it should look something like this:
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self, patterns=None):
super(HandlerEQ54, self).__init__(patterns=patterns)
...
event_handler = HandlerEQ54(patterns=["*.log"])
or, for a more generic case and to support all of PatternMatchingEventHandler's arguments:
class HandlerEQ54(PatternMatchingEventHandler):
def __init__(self, *args, **kwargs):
super(HandlerEQ54, self).__init__(*args, **kwargs)
...
event_handler = HandlerEQ54(patterns=["*.log"])

Python multiprocessing - How to create a function that parallelizes a for loop

If you open a Jupyter Notebook and run this:
import multiprocessing
def f(x):
a = 3 * x
pool = multiprocessing.Pool(processes=1)
global g
def g(j):
return a * j
return pool.map(g, range(5))
f(1)
You will get the following errors
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/Users/me/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/me/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/Users/me/anaconda3/lib/python3.5/multiprocessing/pool.py", line 108, in worker
task = get()
File "/Users/me/anaconda3/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
AttributeError: Can't get attribute 'g' on <module '__main__'>
and I'm trying to understand if this is a bug or a feature.
I'm trying to get this working because in my real case f is basically a for loop easily parallelizable (you only change one parameter each iteration) but that takes a lot of time on each iteration! Am I approaching the problem correctly or is there an easier way? (Note: Throughout the notebook f will be called several times with different parameters itself)

It works just fine if you define g outside of f.
import multiprocessing
def g(j):
return 4 * j
def f():
pool = multiprocessing.Pool(processes=1)
return pool.map(g, range(5))
f()
Edit:
In example you put in your question callable object will look somewhat like this:
class Calculator():
def __init__(self, j):
self.j = j
def __call__(self, x):
return self.j*x
and your function f becomes something like this:
def f(j):
calculator = Calculator(j)
pool = multiprocessing.Pool(processes=1)
return pool.map(calculator, range(5))
I in this case it works just fine. Hope it helped.

If you want to apply g to more arguments than only the iterator element passed by pool.map you can use functools.partial like this:
import multiprocessing
import functools
def g(a, j):
return a * j
def f(x):
a = 3 * x
pool = multiprocessing.Pool(processes=1)
g_with_a = functools.partial(g, a)
return pool.map(g_with_a, range(5))
f(1)
What functools.partial does, is to take a function and an arbitrary number of arguments (both by position and keyword) and returns a new function that behaves like the function you passed to it, but only takes the arguments you didn't pass to partial.
The function returned by partial can be pickled without problems i. e. passed to pool.map, as long as you're using python3.
This is essentially the same as Darth Kotik described in his answer, but you don't have to implement the Calculator class yourself, as partial already does what you want.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use Multiprocessing pool in Databricks with pandas [duplicate] - python-3.x

Related

How to resolve pickle error caused by passing instance method to connurent.futures.ProcessPoolExecutor,submit() (multiprocessing in Python 3)?

Overridden setitem call works in serial but breaks in apply_async call

RuntimeError: Queue objects should only be shared between processes through inheritance

How to initialize python watchdog pattern matching event handler

Python multiprocessing - How to create a function that parallelizes a for loop

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use Multiprocessing pool in Databricks with pandas [duplicate] - python-3.x

Related

How to resolve pickle error caused by passing instance method to connurent.futures.ProcessPoolExecutor,submit() (multiprocessing in Python 3)?

Overridden __setitem__ call works in serial but breaks in apply_async call

RuntimeError: Queue objects should only be shared between processes through inheritance

How to initialize python watchdog pattern matching event handler

Python multiprocessing - How to create a function that parallelizes a for loop

Categories

Resources

Overridden setitem call works in serial but breaks in apply_async call