How to pool method with different kwargs - python-3.x

by example this method
def bloop(age=10,year=2030):
pass
how i can call this method same time with ThreadPool (or other) with different kwards
jobs = [1 => [age=55,year=2055],2 => [age=60,year=2060]]
return pool.map(bloop, jobs)
Thanks u for ur help

Related

Multiprocessing.Pool: can not iterate over IMapIterator object in AWS Batch because of PicklingError

I need to request huge bulk of data from an API endpoint and I want to use multiprocessing (vs multithreading, company framework limitations)
I have a multiprocessing.Pool with predefined concurrency CONCURRENCY in a class called Batcher. The class looks like this:
class Batcher:
def __init__(self, concurrency: int = 8):
self.concurrency = concurrency
def _interprete_response_to_succ_or_err(self, resp: requests.Response) -> str:
if isinstance(resp, str):
if "Error:" in resp:
return "dlq"
else:
return "err"
if isinstance(resp, requests.Response):
if resp.status_code == 200:
return "succ"
else:
return "err"
def _fetch_dat_data(self, id: str) -> requests.Response:
try:
resp = requests.get(API_ENDPOINT)
return resp
except Exception as e:
return f"ID {id} -> Error: {str(e)}"
def _dispatch_batch(self, batch: list) -> dict:
pool = MPool(self.concurrency)
results = pool.imap(self._fetch_dat_data, batch)
pool.close()
pool.join()
return results
def _run_batch(self, id):
return self._dispatch_batch(id)
def start(self, id_list: list):
""" In real class, this function will create smaller
batches from bigger chunks of data """
results = self._run_batch(id_list)
print(
[
res.text
for res in results
if self._interprete_response_to_succ_or_err(res) == "succ"
]
)
this class is called in file like this
if __name__ == "__main__":
"""
the source of ids is a csv file with single column in s3 that contains list
of columns with single id per line
"""
id_list = boto3_get_object_body(my_file_name).decode().split("\n") # custom function, works
batcher = Batcher()
batcher.start(id_list)
This script is a part of AWS Batch Job that is triggered via CLI. the same function runs perfectly on my local machine with same environment as in AWS Batch. It throws
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
in the line where I try to iterate over IMapIterator object results that is generated by pool.imap()
Relevant Traceback:
for res in results
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
I am wondering if I am missing something blatantly obvious or this issue is related to EC2 Instance spun on by batch job at this point and appreciate any kind of lead to root cause analysis.
This error happens because multiprocessing could not import the relevant datatype for duplicating data or calling the target function in the new process it started. This usually happens when the object necessary for the target function to run is created someplace the child process do not know about (for example, a class created inside the if __name__ ==... block in main module), or if the object's __qualname__ property has been fiddled with (you might see this using something similar to functools.wraps or monkey-patching in general)
Therefore, to actually "fix" this, you need to dig in your code and see if the above is true. A good place to start is with the class that is raising the issue (in this case it's boto3.resources.factory.s3.ServiceResource), can you import this in the main module before the if __name__... block runs?
However, most of the times, you can get away with by simply reducing the data required to start the target function (less data = less chances for faults occuring). In this case, the target function you are calling in the pool is an instance method. To start this function in a new process, multiprocessing would need to pickle all the instance attributes, which might have their own instance attributes, and so on. Not only does this add overhead, it could also be possible that the problem lies in a particular instance attribute. Therefore, just as a good practice, if your target function can run independently but is currently an instance method, change it a to staticmethod instead.
In this case, this would mean changing _fetch_dat_data to a staticmethod, and submitting it to the pool using type(self)._fetch_dat_data instead.

asyncio wait on multiple tasks with timeout and cancellation

I have some code that runs multiple tasks in a loop like this:
done, running = await asyncio.wait(running, timeout=timeout_seconds,
return_when=asyncio.FIRST_COMPLETED)
I need to be able to determine which of these timed out. According to the documentation:
Note that this function does not raise asyncio.TimeoutError. Futures or Tasks that aren’t done when the timeout occurs are simply returned in the second set.
I could use wait_for() instead, but that function only accepts a single awaitable, whereas I need to specify multiple. Is there any way to determine which one from the set of awaitables I passed to wait() was responsible for the timeout?
Alternatively, is there a way to use wait_for() with multiple awaitables?
Your can try that tricks, probably it is not good solution:
import asyncio
async def foo():
return 42
async def need_some_sleep():
await asyncio.sleep(1000)
return 42
async def coro_wrapper(coro):
result = await asyncio.wait_for(coro(), timeout=10)
return result
loop = asyncio.get_event_loop()
done, running = loop.run_until_complete(asyncio.wait(
[coro_wrapper(foo), coro_wrapper(need_some_sleep)],
return_when=asyncio.FIRST_COMPLETED
)
)
for item in done:
print(item.result())
print(done, running)
Here is how I do it:
done, pending = await asyncio.wait({
asyncio.create_task(task, name=index)
for index, task in enumerate([
my_coroutine(),
my_coroutine(),
my_coroutine(),
])
},
return_when=asyncio.FIRST_COMPLETED
)
num = next(t.get_name() for t in done)
if num == 2:
pass
Use enumerate to name the tasks as they are created.

Can a task re-schedule itself with APScheduler?

I am trying to set up a delayed task task whose timing will depend on several parameters (either passed to it or obtained from a redis database).
the pseudocode would look like this:
def main():
scheduler = BackgroundScheduler()
scheduler.add_job(delayed_task,
id=task_id,
next_run_time=somedate,
args=(task_id, some_data))
scheduler.start()
do_something_else()
def delayed_task(id, passed_data):
rd = connect_to_redis()
redis_data = rd.fetch_data(id)
publish_data(passed_data, redis_data)
updated_run_time = parse(redis_data)
#obtain a scheduler object here
scheduler.modify_job(id, next_run_time=updated_run_time)
The question is the following: is there a way to access the scheduler from a task?
The scheduler cannot be passed as parameter to the task, as this will raise
TypeError: can't pickle _thread.lock objects
For the same reason, I can't put all this in a class and have it called a method since the arguments of the method include self, which is the class containing the scheduler, and will thus result in the same issue.
Is it possible to regain an instance of the scheduler from outside, like I can generate a new connexion to redis?

Retrieve all chained task result by id separately in Celery

I'm trying to retrieve the results of all chained tasks in celery that's stored in the mysql result backend.
For example, I have the following two celery tasks,
#celery.task(name='celery_fl.add')
def add(x, y, value=None):
if value is None:
try:
return x + y
except TypeError:
return None
return value
#celery.task(name='celery_fl.mul')
def mul(x, y, value=None):
if value is None:
try:
return x * y
except TypeError:
return None
return value
and here is how I chain them,
parent = (add.s(2, 2) | mul.s(8)).apply_async()
Here the output of parent.get() will be the result of the final chained task. parent.parent.get() will give me the output of the first chained task.
What I'm trying to achieve is that I'd like to get the same output using the task id at a latter stage.
task_id = 'bc5fc4b1-613e-4ef0-b5c8-900999d9a6f1'
parent = AsyncResult(task_id, app=celery)
say that the task_id I have belongs to the second task in the chained event (the parent). Then I should get the result of the first chained task if I type parent.parent.get(). But somehow I get None as the value. Is there another way I should be getting the task with task_id instead AsyncResult()?
When using a mysql backend to store results, the results of each chained task is stored separately. But the task instance is no longer available and without this it's not possible to retrieve the results of the sub tasks using the main task (Ref - Celery tasks).
So in order to retrieve the results of all tasks, the task ID of each task should be stored somewhere in the database.
An example using flask (python),
chain = (s3_init.s(order.name, order.id)|create_order_sheet.s(order.id, order.name) | create_order_info.s(order.id, order.name))
res = chain()
process = {
's3_init': res.parent.parent.parent.parent.parent.parent.id,
'order_sheet': res.parent.parent.parent.parent.id,
'order_info': res.parent.parent.parent.id
}
order.update(process_id=json.dumps(process))
Then you can simply get the task IDs from the database and use celery.result.AsyncResult(task_id) to retrieve each task by ID (ref - Async results).
Here is a solution that gets the topmost parent:
from celery import chain, Celery
app = Celery("my-tasks")
#app.task
def run_task(item_id):
res = chain(
long_task_1.s(item_id),
long_task_2.s(),
long_task_3.s(),
).delay()
while getattr(res, "parent", None):
res = res.parent
item = MyItem.objects.get(id=item_id)
item.celery_root_task_id = res.id
item.save()
return res
Then, you can retrieve all the children later with:
from celery.result import AsyncResult
root_result = AsyncResult(obj.celery_task_id, app=app)
task_results = ", ".join(
[
f"{t._cache.get('task_name')}: {t.status}"
for t, _ in root_result.collect()
]
)

Gifs shown inside a label doesn't update itself regulary in Pyqt5

I have a loading widget that consists of two labels, one is the status label and the other one is the label that the animated gif will be shown in. If I call show() method before heavy stuff gets processed, the gif at the loading widget doesn't update itself at all. There's nothing wrong with the gif btw(looping problems etc.). The main code(caller) looks like this:
self.loadingwidget = LoadingWidgetForm()
self.setCentralWidget(self.loadingwidget)
self.loadingwidget.show()
...
...
heavy stuff
...
...
self.loadingwidget.hide()
The widget class:
class LoadingWidgetForm(QWidget, LoadingWidget):
def __init__(self, parent=None):
super().__init__(parent=parent)
self.setupUi(self)
self.setWindowFlags(self.windowFlags() | Qt.FramelessWindowHint)
self.setAttribute(Qt.WA_TranslucentBackground)
pince_directory = SysUtils.get_current_script_directory() # returns current working directory
self.movie = QMovie(pince_directory + "/media/loading_widget_gondola.gif", QByteArray())
self.label_Animated.setMovie(self.movie)
self.movie.setScaledSize(QSize(50, 50))
self.movie.setCacheMode(QMovie.CacheAll)
self.movie.setSpeed(100)
self.movie.start()
self.not_finished=True
self.update_thread = Thread(target=self.update_widget)
self.update_thread.daemon = True
def showEvent(self, QShowEvent):
QApplication.processEvents()
self.update_thread.start()
def hideEvent(self, QHideEvent):
self.not_finished = False
def update_widget(self):
while self.not_finished:
QApplication.processEvents()
As you see I tried to create a seperate thread to avoid workload but it didn't make any difference. Then I tried my luck with the QThread class by overriding the run() method but it also didn't work. But executing QApplication.processEvents() method inside of the heavy stuff works well. I also think I shouldn't be using seperate threads, I feel like there should be a more elegant way to do this. The widget looks like this btw:
Processing...
Full version of the gif:
Thanks in advance! Have a good day.
Edit: I can't move the heavy stuff to a different thread due to bugs in pexpect. Pexpect's spawn() method requires spawned object and any operations related with the spawned object to be in the same thread. I don't want to change the working flow of the whole program
In order to update GUI animations, the main Qt loop (located in the main GUI thread) has to be running and processing events. The Qt event loop can only process a single event at a time, however because handling these events typically takes a very short time control is returned rapidly to the loop. This allows the GUI updates (repaints, including animation etc.) to appear smooth.
A common example is having a button to initiate loading of a file. The button press creates an event which is handled, and passed off to your code (either via events directly, or via signals). Now the main thread is in your long-running code, and the event loop is stalled — and will stay stalled until the long-running job (e.g. file load) is complete.
You're correct that you can solve this with threads, but you've gone about it backwards. You want to put your long-running code in a thread (not your call to processEvents). In fact, calling (or interacting with) the GUI from another thread is a recipe for a crash.
The simplest way to work with threads is to use QRunner and QThreadPool. This allows for multiple execution threads. The following wall of code gives you a custom Worker class that makes it simple to handle this. I normally put this in a file threads.py to keep it out of the way:
import sys
from PyQt5.QtCore import QObject, QRunnable
class WorkerSignals(QObject):
'''
Defines the signals available from a running worker thread.
error
`tuple` (exctype, value, traceback.format_exc() )
result
`dict` data returned from processing
'''
finished = pyqtSignal()
error = pyqtSignal(tuple)
result = pyqtSignal(dict)
class Worker(QRunnable):
'''
Worker thread
Inherits from QRunnable to handler worker thread setup, signals and wrap-up.
:param callback: The function callback to run on this worker thread. Supplied args and
kwargs will be passed through to the runner.
:type callback: function
:param args: Arguments to pass to the callback function
:param kwargs: Keywords to pass to the callback function
'''
def __init__(self, fn, *args, **kwargs):
super(Worker, self).__init__()
# Store constructor arguments (re-used for processing)
self.fn = fn
self.args = args
self.kwargs = kwargs
self.signals = WorkerSignals()
#pyqtSlot()
def run(self):
'''
Initialise the runner function with passed args, kwargs.
'''
# Retrieve args/kwargs here; and fire processing using them
try:
result = self.fn(*self.args, **self.kwargs)
except:
traceback.print_exc()
exctype, value = sys.exc_info()[:2]
self.signals.error.emit((exctype, value, traceback.format_exc()))
else:
self.signals.result.emit(result) # Return the result of the processing
finally:
self.signals.finished.emit() # Done
To use the above, you need a QThreadPool to handle the threads. You only need to create this once, for example during application initialisation.
threadpool = QThreadPool()
Now, create a worker by passing in the Python function to execute:
from .threads import Worker # our custom worker Class
worker = Worker(fn=<Python function>) # create a Worker object
Now attach signals to get back the result, or be notified of an error:
worker.signals.error.connect(<Python function error handler>)
worker.signals.result.connect(<Python function result handler>)
Then, to execute this Worker, you can just pass it to the QThreadPool.
threadpool.start(worker)
Everything will take care of itself, with the result of the work returned to the connected signal... and the main GUI loop will be free to do it's thing!

Resources