Runtime error using concurrent.futures.ProcessPoolExecutor - python-3.x

I have seen many YouTube videos for basic tutorials for concurrent.futures.ProcessPoolExecutor. I have also seen posts in SO here and here, GitHub and GitHubMemory, yet no luck.
Problem:
I'm getting the following runtime error:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I admit it, I do not fully understand this error since this is my very first attempt at multiprocessing in my python code.
Here's my pseudocode:
module.py
import xyz
from multiprocessing import freeze_support
def abc():
return x
def main():
xyz
qwerty
if __name__ == "__main__":
freeze_support()
obj = Object()
main()
classObject.py
import abcd
class Object(object):
def __init__(self):
asdf
cvbn
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
executor.map(self.function_for_multiprocess, var1, var2)
# ****The error points at the code above.👆*👆*👆
def function_for_multiprocess(var1, var2):
doSomething1
doSomething2
self.variable = something
My class file (classObject.py) does not have the "main" guard.
Things I have tried:
Tried adding if __name__ == "__main__": and freeze_support in the classObject.py along with renaming __init__() to main()`
While doing the above, removed the freeze_support from the module.py
I haven't found a different solution from the link provided above. Any insights would be greatly appreciated!
I'm using a MacBook Pro (16-inch, 2019), Processor 2.3 GHz 8-Core Intel Core i9, OS:Big Sur. I don't think that matters but just declaring it if it does.

you need to pass arguments as picklable object, so as list or a tuple.
and you don't need freeze_support()
just change executor.map(self.function_for_multiprocess, var1, var2)
to executor.map(self.function_for_multiprocess, (var1, var2))
from multiprocessing import freeze_support
import concurrent.futures
class Object(object):
def __init__(self, var1=1, var2=2):
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
executor.map(self.function_for_multiprocess, (var1, var2))
def function_for_multiprocess(var1, var2):
print('var1:', var1)
print('var2:', var2)
def abc(x):
return x
def main():
print('abc:', abc(200))
if __name__ == "__main__":
#freeze_support()
obj = Object()
main()

Related

How to use Multiprocessing pool in Databricks with pandas [duplicate]

I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.
Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.
I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation
When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.
As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()
I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.
This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.
A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))
Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)
As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.

call method on running process from parent process

I'm trying to write a program that interfaces with hardware via pyserial according to this diagram https://github.com/kiyoshi7/Intrument/blob/master/Idea.gif . my problem is that I don't know how to tell the child process to run a method.
I tried reducing my problem down to the essence of what I am trying to do can call the method request() from the main script. I just dont know how to handle two way communication like this, in examples using queue i just see data shared or i cant understand the examples
import multiprocessing
from time import sleep
class spawn:
def __init__(self, _number, _max):
self._number = _number
self._max = _max
self.Update()
def request(self, x):
print("{} was requested.".format(x))
def Update(self):
while True:
print("Spawned {} of {}".format(self._number, self._max))
sleep(2)
if __name__ == '__main__':
p = multiprocessing.Process(target=spawn, args=(1,1))
p.start()
sleep(5)
p.request(2) #here I'm trying to run the method I want
update thanks to Carcigenicate
import multiprocessing
from time import sleep
from operator import methodcaller
class Spawn:
def __init__(self, _number, _max):
self._number = _number
self._max = _max
# Don't call update here
def request(self, x):
print("{} was requested.".format(x))
def update(self):
while True:
print("Spawned {} of {}".format(self._number, self._max))
sleep(2)
if __name__ == '__main__':
spawn = Spawn(1, 1) # Create the object as normal
p = multiprocessing.Process(target=methodcaller("update"), args=(spawn,)) # Run the loop in the process
p.start()
while True:
sleep(1.5)
spawn.request(2) # Now you can reference the "spawn"
You're going to need to rearrange things a bit. I would not do the long running (infinite) work from the constructor. That's generally poor practice, and is complicating things here. I would instead initialize the object, then run the loop in the separate process:
from operator import methodcaller
class Spawn:
def __init__(self, _number, _max):
self._number = _number
self._max = _max
# Don't call update here
def request(self, x):
print("{} was requested.".format(x))
def update(self):
while True:
print("Spawned {} of {}".format(self._number, self._max))
sleep(2)
if __name__ == '__main__':
spawn = Spawn(1, 1) # Create the object as normal
p = multiprocessing.Process(target=methodcaller("update"), args=(spawn,)) # Run the loop in the process
p.start()
spawn.request(2) # Now you can reference the "spawn" object to do whatever you like
Unfortunately, since Process requires that it's target argument is pickleable, you can't just use a lambda wrapper like I originally had (whoops). I'm using operator.methodcaller to create a pickleable wrapper. methodcaller("update") returns a function that calls update on whatever is given to it, then we give it spawn to call it on.
You could also create a wrapper function using def:
def wrapper():
spawn.update()
. . .
p = multiprocessing.Process(target=wrapper) # Run the loop in the process
But that only works if it's feasible to have wrapper as a global function. You may need to play around to find out what works best, or use a multiprocessing library that doesn't require pickleable tasks.
Note, please use proper Python naming conventions. Class names start with capitals, and method names are lowercase. I fixed that up in the code I posted.

Multi-Processing to share memory between processes

I am trying to update a variable of a class by calling a function of the class from a different function which is being run on multi-process.
To achieve the desired result, process (p1) needs to update the variable "transaction" and which should get then modified by process (p2)
I tried the below code and I know i should use Multiprocess.value or manager to achieve the desired result and I am not sure of how to do it as my variable to be updated is in another class
Below is the code:
from multiprocessing import Process
from helper import Helper
camsource = ['a','b']
Pros = []
def sub(i):
HC.trail_func(i)
def main():
for i in camsource:
print ("Camera Thread {} Started!".format(i))
p = Process(target=sub, args=(i))
Pros.append(p)
p.start()
# block until all the threads finish (i.e. block until all function_x calls finish)
for t in Pros:
t.join()
if __name__ == "__main__":
HC = Helper()
main()
Here is the helper code:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
class Helper():
def __init__(self):
self.transactions = []
def trail_func(self,preview):
if preview == 'a':
self.transactions.append({"Apple":1})
else:
if self.transactions[0]['Apple'] == 1:
self.transactions[0]['Apple'] = self.transactions[0]['Apple'] + 1
print (self.transactions)
Desired Output:
p1:
transactions = {"Apple":1}
p2:
transactions = {"Apple":2}
I've recently released this module that can help you with your code, as all data frames (data models that can hold any type of data), have locks on them, in order to solve concurrency issues. Anyway, take a look at the README file and the examples.
I've made an example here too, if you'd like to check.

Pickle can't pickle _thread.lock objects

I'm trying to use pickle to save one of my objects but I face this error when trying to dump it:
TypeError: can't pickle _thread.lock objects
It is not clear to me, because I'm not using any locks inside my code. I tried to reproduce this error:
import threading
from time import sleep
import pickle
class some_class:
def __init__(self):
self.a = 1
thr = threading.Thread(target=self.incr)
self.lock = threading.Lock()
thr.start()
def incr(self):
while True:
# with self.lock:
self.a += 1
print(self.a)
sleep(0.5)
if __name__ == "__main__":
a = some_class()
val = pickle.dumps(a, pickle.HIGHEST_PROTOCOL)
print("pickle done!")
pickle_thread.py", line 22, in
val = pickle.dumps(a, pickle.HIGHEST_PROTOCOL) TypeError: can't pickle _thread.lock objects
If I define a thread lock inside my object I can't pickle it, right?
I think the problem here is using threading.lock but is there any workaround for this?
Actually, in my main project, I can't find any locks but I've used lots of modules that I can't trace them. What should I look for?
Thanks.
You can try to customize the pickling method for this class by excluding unpicklable objects from the dictionary:
def __getstate__(self):
state = self.__dict__.copy()
del state['lock']
return state
When unpickling, you can recreate missing objects manually, e.g.:
def __setstate__(self, state):
self.__dict__.update(state)
self.lock = threading.Lock() # ???
I don't know enough about the threading module to predict if this is gonna be sufficient.

Python - How can I implement a 'stoppable' thread?

There is a solution posted here to create a stoppable thread. However, I am having some problems understanding how to implement this solution.
Using the code...
import threading
class StoppableThread(threading.Thread):
"""Thread class with a stop() method. The thread itself has to check
regularly for the stopped() condition."""
def __init__(self):
super(StoppableThread, self).__init__()
self._stop_event = threading.Event()
def stop(self):
self._stop_event.set()
def stopped(self):
return self._stop_event.is_set()
How can I create a thread that runs a function that prints "Hello" to the terminal every 1 second. After 5 seconds I use the .stop() to stop the looping function/thread.
Again I am having troubles understanding how to implement this stopping solution, here is what I have so far.
import threading
import time
class StoppableThread(threading.Thread):
"""Thread class with a stop() method. The thread itself has to check
regularly for the stopped() condition."""
def __init__(self):
super(StoppableThread, self).__init__()
self._stop_event = threading.Event()
def stop(self):
self._stop_event.set()
def stopped(self):
return self._stop_event.is_set()
def funct():
while not testthread.stopped():
time.sleep(1)
print("Hello")
testthread = StoppableThread()
testthread.start()
time.sleep(5)
testthread.stop()
Code above creates the thread testthread which can be stopped by the testthread.stop() command. From what I understand this is just creating an empty thread... Is there a way I can create a thread that runs funct() and the thread will end when I use .stop(). Basically I do not know how to implement the StoppableThread class to run the funct() function as a thread.
Example of a regular threaded function...
import threading
import time
def example():
x = 0
while x < 5:
time.sleep(1)
print("Hello")
x = x + 1
t = threading.Thread(target=example)
t.start()
t.join()
#example of a regular threaded function.
There are a couple of problems with how you are using the code in your original example. First of all, you are not passing any constructor arguments to the base constructor. This is a problem because, as you can see in the plain-Thread example, constructor arguments are often necessary. You should rewrite StoppableThread.__init__ as follows:
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._stop_event = threading.Event()
Since you are using Python 3, you do not need to provide arguments to super. Now you can do
testthread = StoppableThread(target=funct)
This is still not an optimal solution, because funct uses an external variable, testthread to stop itself. While this is OK-ish for a tiny example like yours, using global variables like that normally causes a huge maintenance burden and you don't want to do it. A much better solution would be to extend the generic StoppableThread class for your particular task, so you can access self properly:
class MyTask(StoppableThread):
def run(self):
while not self.stopped():
time.sleep(1)
print("Hello")
testthread = MyTask()
testthread.start()
time.sleep(5)
testthread.stop()
If you absolutely do not want to extend StoppableThread, you can use the current_thread function in your task in preference to reading a global variable:
def funct():
while not current_thread().stopped():
time.sleep(1)
print("Hello")
testthread = StoppableThread(target=funct)
testthread.start()
sleep(5)
testthread.stop()
I found some implementation of a stoppable thread - and it does not rely that You check if it should continue to run inside the thread - it "injects" an exception into the wrapped function - that will work as long as You dont do something like :
while True:
try:
do something
except:
pass
definitely worth looking at !
see : https://github.com/kata198/func_timeout
maybe I will extend my wrapt_timeout_decorator with such kind of mechanism, which You can find here : https://github.com/bitranox/wrapt_timeout_decorator
Inspired by above solution I created a small library, ants, for this problem.
Example
from ants import worker
#worker
def do_stuff():
...
thread code
...
do_stuff.start()
...
do_stuff.stop()
In above example do_stuff will run in a separate thread being called in a while 1: loop
You can also have triggering events , e.g. in above replace do_stuff.start() with do_stuff.start(lambda: time.sleep(5)) and you will have it trigger every 5:th second
The library is very new and work is ongoing on GitHub https://github.com/fa1k3n/ants.git

Resources