Get current task of a process - django-viewflow

I am trying to flow through tasks pragmattically using custom viewswhich will be controlled by an api. I have figured out how to flow through a process if i pass in the a task object to a flow_function
task = Task.objects.get(id=task_id)
response = hello_world_approve(task, json=json, user=user)
#flow_func
def hello_world_approve(activation, **kwargs):
activation.process.approved = True
activation.process.save()
activation.prepare()
activation.done()
return activation
However i would like to be able to get the current task from the process object instead, like so
process = HelloWorldFlow.process_class.objects.get(id=task_id)
task = process.get_current_task()
Is this the way i should be going about this and is it possible or is there another approach i am missing?

It's available as activation.task

Related

How to inspect mapped tasks' inputs from reduce tasks in Prefect

I'm exploring Prefect's map-reduce capability as a powerful idiom for writing massively-parallel, robust importers of external data.
As an example - very similar to the X-Files tutorial - consider this snippet:
#task
def retrieve_episode_ids():
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode_ids()
#task(max_retries=2, retry_delay=datetime.timedelta(seconds=3))
def download_episode(episode_id):
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode(episode_id)
#task(trigger=all_finished)
def persist_episodes(episodes):
db_connection = DBConnection(prefect.context.my_config)
...store all episodes by their ID with a success/failure flag...
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episodes)
The peculiarity of my flow, compared with the simple X-Files tutorial, is that I would like to persist results for all the episodes that I have requested, even for the failed ones. Imagine that I'll be writing episodes to a database table as the episode ID decorated with an is_success flag. Moreover, I'd like to write all episodes with a single task instance, in order to be able to perform a bulk insert - as opposed to inserting each episode one by one - hence my persist_episodes task being a reduce task.
The trouble I'm having is in being able to gather the episode ID for the failed downloads from that reduce task, so that I can store the failed information in the table under the appropriate episode ID. I could of course rewrite the download_episode task with a try/catch and always return an episode ID even in the case of failure, but then I'd lose the automatic retry/failure functionality which is a good deal of the appeal of Prefect.
Is there a way for a reduce task to infer the argument(s) of a failed mapped task? Or, could I write this differently to achieve what I need, while still keeping the same level of clarity as in my example?
Mapping over a list preserves the order. This is a property you can use to link inputs with the errors. Check the code I have below, will add more explanation after.
from prefect import Flow, task
import prefect
#task
def retrieve_episode_ids():
return [1,2,3,4,5]
#task
def download_episode(episode_id):
if episode_id == 5:
return ValueError()
return episode_id
#task()
def persist_episodes(episode_ids, episodes):
# Note the last element here will be the ValueError
prefect.context.logger.info(episodes)
# We change that ValueError into a "fail" message
episodes = ["fail" if isinstance(x, BaseException) else x for x in episodes]
# Note the last element here will be the "fail"
prefect.context.logger.info(episodes)
result = {}
for i, episode_id in enumerate(episode_ids):
result[episode_id] = episodes[i]
# Check final results
prefect.context.logger.info(result)
return
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episode_ids, episodes)
flow.run()
The handling will largely happen in the persist_episodes. Just pass the list of inputs again and then we can match the inputs with the failed tasks. I added some handling around identifying errors and replacing them with what you want. Does that answer the question?
Always happy to chat more. You can reach out in the Prefect Slack or Discourse as well.

How to check if a similar scheduled job exists in python-rq?

Below is the function called for scheduling a job on server start.
But somehow the scheduled job is getting called again and again, and this is causing too many calls to that respective function.
Either this is happening because of multiple function calls or something else? Suggestions please.
def redis_schedule():
with current_app.app_context():
redis_url = current_app.config["REDIS_URL"]
with Connection(redis.from_url(redis_url)):
q = Queue("notification")
from ..tasks.notification import send_notifs
task = q.enqueue_in(timedelta(minutes=5), send_notifs)
Refer - https://python-rq.org/docs/job_registries/
Needed to read scheduled_job_registry and retrieve jobids.
Currently below logic works for me as I only have a single scheduled_job.
But in case of multiple jobs, I will need to loop these jobids to find the right job exists or not.
def redis_schedule():
with current_app.app_context():
redis_url = current_app.config["REDIS_URL"]
with Connection(redis.from_url(redis_url)):
q = Queue("notification")
if len(q.scheduled_job_registry.get_job_ids()) == 0:
from ..tasks.notification import send_notifs
task = q.enqueue_in(timedelta(seconds=30), send_notifs)

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.
Here is my scenario:
An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class
90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.
The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.
The function get_article_data_elements call these items:
published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()
Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.
I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:
RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited
I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.
Can someone provide me some pointers to solve my issue?
class NewsExtraction(object):
"""
This class is used to extract common data elements from a news article
on xyz
"""
def __init__(self, url, soup):
self._url = url
self._raw_soup = soup
truncated...
async def _extract_article_published_date(self):
"""
This function is designed to extract the publish date for the article being parsed.
:return: date article was originally published
:rtype: string
"""
json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
if json_date_published is not None:
if len(json_date_published) != 0:
return json_date_published
else:
return None
elif json_date_published is None:
if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
if len(date_published) != 0:
return date_published.text
else:
logger.info('The HTML tag to extract the publish date for the following article was not found.')
logger.info(f'Article URL -- {self._url}')
return None
truncated...
async def get_article_data_elements(self):
"""
This function is designed to extract all the common data elements from a
news article on xyz.
:return: dictionary of data elements related to the article
:rtype: dict
"""
article_data_elements = {}
# I have tried this:
published_date = self._extract_article_published_date().__await__()
# and this
published_date = self.task(self._extract_article_published_date())
await published_date
truncated...
I have also tried to use:
if __name__ == "__main__":
asyncio.run(NewsExtraction.get_article_data_elements())
# asyncio.run(self.get_article_data_elements())
I'm really banging my head on the wall with using asyncio in my news extraction code.
If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.
Can someone provide me some pointers to solve my issue?
Thanks in advance for any guidance on using asyncio
Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.
You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).
if __name__ == '__main__':
extractor = NewsExtraction(...)
# this creates the event loop and runs the coroutine
asyncio.run(extractor.get_article_data_elements())
Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.
async def get_article_data_elements(self):
article_data_elements = {}
# note here that the instance is self
published_date = await self._extract_article_published_date()
truncated...
You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

Multiple MongoDB database triggers in same Python module

I am trying to implement multiple database triggers for my MongoDB (connected through Pymongo to my Python code).
I am able to successfully implement a database trigger but not able to extend this to multiple ones.
The code for the single database trigger can be found below:
try:
resume_token = None
pipeline = [{"$match": {"operationType": "insert"}}]
with db.collection.watch(pipeline) as stream:
for insert_change in stream:
print("postprocessing logic goes here")
except pymongo.errors.PyMongoError:
logging.error("Oops")
The problem is that once a single database is implemented, the post processing code waits to receive incoming requests for that collection and am able to include other collection watches in the same module.
Any help appreciated
Create separate functions for each watch and launch them in separate threads with something like:
import queue
import threading
main_q = queue.Queue()
thread1 = threading.Thread(target=function1, args=(main_q, None))
thread1.daemon = True
thread1.start()
thread2 = threading.Thread(target=function2, args=(main_q,))
thread2.daemon = True
thread2.start()

Can a task re-schedule itself with APScheduler?

I am trying to set up a delayed task task whose timing will depend on several parameters (either passed to it or obtained from a redis database).
the pseudocode would look like this:
def main():
scheduler = BackgroundScheduler()
scheduler.add_job(delayed_task,
id=task_id,
next_run_time=somedate,
args=(task_id, some_data))
scheduler.start()
do_something_else()
def delayed_task(id, passed_data):
rd = connect_to_redis()
redis_data = rd.fetch_data(id)
publish_data(passed_data, redis_data)
updated_run_time = parse(redis_data)
#obtain a scheduler object here
scheduler.modify_job(id, next_run_time=updated_run_time)
The question is the following: is there a way to access the scheduler from a task?
The scheduler cannot be passed as parameter to the task, as this will raise
TypeError: can't pickle _thread.lock objects
For the same reason, I can't put all this in a class and have it called a method since the arguments of the method include self, which is the class containing the scheduler, and will thus result in the same issue.
Is it possible to regain an instance of the scheduler from outside, like I can generate a new connexion to redis?

Resources