I am currently using Task.delay() in celery (with RabbitMQ) to perform asynchronous tasks. The tasks are distributed to 8 workers, and in these tasks, I am performing a database insertion operation.
Sometimes a tasks will fail, and the database insertion does not happen. This can be due to a timeout or JSON decode errors. I want to capture that an execution has failed.
Here is the relevant code:
views.py
def celery_url_filtering(request):
for each_data in dataset:
#each_data is a Json object
res = result.delay(each_data)
while(res.status == 'PENDING'):
pass
return JsonResponse({'ok':'Success'})
tasks.py
#app.task
def result(dataeach_data):
# Parse each_data and do data insertion here
return "Something"
How can I capture the failed executions in a list?
From what I can understand from your concern is that you want to look into the "PENDING" or "FAILURE" tasks and to retry/apply some application logic.
If that is so, you can have a cron running at fixed schedule, daily/hourly etc, based on requirement. This cron job can capture the tasks that failed in last day/hour based on the schedule you have.
You can use django-celery-beat for setting the cron job and django-celery-results to store celery task results using Django ORM.
For example, you can have a celery task like this
from celery import shared_task
from django_celery_results.models import TaskResult
**tasks.py**
#shared_task(name="failed_task_cron", bind=True)
def failed_task_cron(self, **kwargs):
"""
Celery task to run on daily schedule to do something with the failed tasks
"""
tasks = TaskResult.objects.filter(status='FAILURE')
# tasks is a queryset of all the failed tasks
# Perform application logic
You can set cron for the above task like this in your celery settings
from celery.schedules import crontab
# ...
CELERY_BEAT_SCHEDULE = {
"failed_task_cron": {
"task": "path/to/tasks.failed_task_cron",
"schedule": crontab(hour=1) # Runs every hour
}
}
Related
It seems like the task_run_id becomes the run_id for Databricks jobs (Azure) that only have 1 task, whereas the job_run_id becomes the run_id for Databricks jobs that have 2 or more tasks. My question is: Where can I find the job_run_id for Databricks jobs (Azure) that only contain 1 task?
The task_run_id might simply be an indication that a particular job contains only single task. Either it is the job_run_id or the task_run_id, both indicate the run_id for a job run.
You can verify this using the jobs run list that can be obtained using Jobs API. Use the following code:
import requests
import json
auth = {"Authorization": "Bearer <databricks_access_token>"}
response = requests.get('https://<databricks_workspace_url>/api/2.0/jobs/runs/list', headers=auth).json()
print(response)
You can navigate to your job run and check that the job's run id and the task_run_id are same:
I am new to Celery, and I would like advice on how best to use Celery to accomplish the following.
Suppose I have ten large datasets. I realize that I can use Celery to do work on each dataset by submitting ten tasks. But suppose that each dataset consists of 1,000,000+ text documents stored in a NoSQL database (Elasticsearch in my case). The work is performed at the document level. The work could be anything - maybe counting words.
For a given dataset, I need to start the dataset-level task. The task should read documents from the data store. Then workers should process the documents - a document-level task.
How can I do this, given that the task is defined at the dataset level, not the document level? I am trying to move away from using a JoinableQueue to store documents and submit them for work with multiprocessing.
It have read that it is possible to use multiple queues in Celery, but it is not clear to me that that is the best approach.
Lets see if this helps. You can define a workflow and add tasks to it and then run the whole thing after building up your tasks. You can have normal python methods return tasks to can be added into celery primatives (chain, group chord etc) See here for more info. For example lets say you have two tasks that process documents for a given dataset:
def some_task():
return dummy_task.si()
def some_other_task():
return dummy_task.si()
#celery.task()
def dummy_task(self, *args, **kwargs):
return True
You can then provide a task that generates the subtasks like so:
#celery.task()
def dataset_workflow():
datastets = get_datasets(*args, **kwargs)
workflows = []
for dataset in datasets:
documents = get_documents(dataset)
worflow = chain(some_task(documents), some_other_task(documents))
worlfows.append(workflow)
run_workflows = chain(*workflows).apply_aysnc()
Keep in mind that generating alot of tasks can consume alot of memory for the celery workers, so throttling or breaking the task generation up might be needed as you start to scale your workloads.
Additionally you can have the document level tasks on a diffrent queue then your worflow task if needed based on resource contstraints etc.
I have a DAG with one task that fetches the data from the API. I want that task to fetch the data only for certain time interval and marks itself as SUCCESS so that the tasks after that starts running.
Please note that the tasks below are dependent on the tasks which I want to mark SUCCESS. I know I can mark the task SUCCESS manually from CLI or UI but I want to do it automatically.
Is it possible to do that programmatically using python in Airflow?
You can set status of task using python code, like this:
def set_task_status(**kwargs):
execution_date = kwargs['execution_date']
ti = TaskInstance(HiveOperatorTest, execution_date)
ti.set_state(State.SUCCESS)
I'm using APScheduler(3.5.3) to run three different jobs. I need to trigger the second job immediately after the completion of first job. Also I don't know the completion time of first job.I have set trigger type as cron and scheduled to run every 2 hours.
One way I overcame this is by scheduling the next job at the end of each job. Is there any other way we can achieve it through APScheduler?
This can be achieved using scheduler events. Check out this simplified example adapted from the documentation (not tested, but should work):
def execution_listener(event):
if event.exception:
print('The job crashed')
else:
print('The job executed successfully')
# check that the executed job is the first job
job = scheduler.get_job(event.job_id)
if job.name == 'first_job':
print('Running the second job')
# lookup the second job (assuming it's a scheduled job)
jobs = scheduler.get_jobs()
second_job = next((j for j in jobs if j.name == 'second_job'), None)
if second_job:
# run the second job immediately
second_job.modify(next_run_time=datetime.datetime.utcnow())
else:
# job not scheduled, add it and run now
scheduler.add_job(second_job_func, args=(...), kwargs={...},
name='second_job')
scheduler.add_listener(my_listener, EVENT_JOB_EXECUTED | EVENT_JOB_ERROR)
This assumes you don't know jobs' IDs, but identify them by names. If you know the IDs, the logic would be simpler.
I have a MySQL table tasks. In tasks, we can create a normal task or a recurring task that will automatically create a new task in the MySQL tasks table and send an email notification to the user that a task has been created. After a lot of research, I found out that you can do it in four methods
MySQL events
Kue, bull, agenda(node.js scheduling libraries)
Using a cron job to monitor every day for tasks
the recurring tasks would be repeated over weekly, daily, monthly, and yearly.
We must put an option to remove the recurring event at any time. What would be a nice and clean solution?
As you've identified there are a number of ways of going about this, here's how I would do it but I'm making a number of assumptions such as how many tasks you're likely to have and how flexible the system is going forward.
If you're unlikely to change the task time options (daily, weekly, monthly, yearly). Each task would have the following fields last_run_date and next_run_date. Every time a task is run I would update these fields and create an entry in a log table such as task_run_log which will also store the date/time the task was run at.
I would then have a cron job which fires a HTTP message to a nodejs service. This web service would look through the table of tasks, find which ones need to be executed for that day and would dispatch a message for each task into some sort of a queue (AWS SQS, GCP Pub/Sub, Apache Kafka, etc). Each message in the queue would represent a single task that needs to be carried out, workers can subscribe to this queue and process the task themselves. Once a worker has processed a job it would then make the log entry and update the last_run_date and next_run_date fields. If a task fails it'll add it into move that message into an error queue and will log a failed task in the task log.
This system would be robust as any failed jobs would exist as failed jobs in your database and would appear in an error queue (which you can either drain to remove the failed jobs, or you can replay them into the normal queue when the worker is fixed). It would also scale to many tasks that have to happen each day as you can scale up your workers. You also won't be flooding cron, your cron job will just send a single HTTP request each day to your HTTP service which kicks off the processing.
You can also setup alerts based on whether the cron job runs or not to make sure the process gets kicked off properly.
I had to do something very similar, you can use the npm module node-schedule
Node scheduler has many features. You can first create your rule setup, which determines when it runs and then schedules the job, which is where determine what the job performs and activates it, I have an example below from my code which sets a job to run at midnight every day.
var rule = new schedule.RecurrenceRule();
rule.dayOfWeek = [0, new schedule.Range(1, 6)];
var j = schedule.scheduleJob(rule, function(){
sqlUpdate(server);
});
This may not exactly fit all of your requirements alone but there are other features and setups you can do.
For example you can cancel any job with the cancel function
j.cancel()
You can also set start times and end times like so as shown in the npm page
let startTime = new Date(Date.now() + 5000);
let endTime = new Date(startTime.getTime() + 5000);
var j = schedule.scheduleJob({ start: startTime, end: endTime, rule: '*/1 * * * * *' }, function(){
console.log('Time for tea!');
});
There are also other options for scheduling the date and time as this also follows the cron format. Meaning you can set dynamic times
var j = schedule.scheduleJob('42 * * * *', function(){
console.log();
});
As such this would allow node.js to handle everything you need. You would likely need to set up a system to keep track of the scheduled jobs (var j) But it would allow you to cancel it and schedule it to your desire.
It additionally can allow you to reschedule, retrieve the next scheduled event and you can have multiple date formats.
If you need to persist the jobs after the process is turned of and on or reset you will need to save the details of the job, a MySQL database would make sense here, and upon startup, the code could make a quick pull and restart all of the created tasks based on the data from the database. And when you cancel a job you just delete it from the database. It should be noted the process needs to be on for this to work, a job will not run if the process is turned off