I want to schedule a periodic task with Celery dynamically at the end of another group of task.
I know how to create (static) periodic tasks with Celery:
CELERYBEAT_SCHEDULE = {
'poll_actions': {
'task': 'tasks.poll_actions',
'schedule': timedelta(seconds=5)
}
}
But I want to create periodic jobs dynamically from my tasks (and maybe have a way to stop those periodic jobs when some condition is achieved (all tasks done).
Something like:
#celery.task
def run(ids):
group(prepare.s(id) for id in ids) | execute.s(ids) | poll.s(ids, schedule=timedelta(seconds=5))
#celery.task
def prepare(id):
...
#celery.task
def execute(id):
...
#celery.task
def poll(ids):
# This task has to be schedulable on demand
...
The straightforward solution to this requires that you be able to add/remove beat scheduler entries on the fly. As of the answering of this question...
How to dynamically add / remove periodic tasks to Celery (celerybeat)
This was not possible. I doubt it has become available in the interim because ...
You are conflating two concepts here. The notion of "Event Driven Work" and the idea of "Batch Schedule Driven Work"(which is really just the first case where the event happens on a schedule). If you really consider what you are doing here you'll find that there is a rather complex set of edge cases. Messages are distributed in nature what happens when groups spawned from different messages start creating conflicting entries? What do you do when you find yourself under a mountain of previously scheduled kruft?
When working with messaging systems you are really looking to build recursive trees. Spindles of work that do something and spawn more messages to do more things. Cycles(intended or otherwise) aside these ultimately achieve their base cases and terminate.
The answer to whatever you are actually trying to achieve lies with re-encoding your problem within the limitations of your messaging system and asynchronous work framework.
Related
I use celery for doing some tasks, all tasks added by .apply_async and my script do it automatically, depends on some external conditions. I want to get result of tasks not in direct order but in reverse.
For example, I add task1 after that task2 and after that task3 and I want celery to perform tasks in following order: task1, task3, task2. (task1 first, because celery will doing this task after I add it and befire I added task2, it's ok),
How can i get this behavior?
P.S. I use redis as a broker.
The described behavior is not possible, or at least not to a full extent. Also, this mostly depends on the broker chosen. Basically, what you want is the queue to work in LIFO mode, however this is not the case of most message brokers. At least RabbitMQ only works in FIFO mode. With RabbitMQ, you can partly achieve your goal with priorities, but as already said, it's not bulletproof and would need additional logic involved.
I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.
I need to design a thread pool system, in Python in this case, but I'm more interested in the general methodology.
It has to be something along the lines of https://www.metachris.com/2016/04/python-threadpool/, where threads wait idling until some jobs are pushed into the pool. How that works, using condition variables, is clear to me.
I have one additional requirement though: the jobs I'm pushing into the pool cannot run all in parallel. Each of them has a class (i don't mean the object class here, just a simple integer that somehow classifies the job) and only one job per class can be running at the same time. If a job is pushed having the same class of a job that is currently running, it has to wait in the queue until the latter is done.
I have already modified the mentioned class to do this, but what I achieved is pretty messy and I'm not sure it's reliable, so I would ask what modifications would be suggested or whether I should use a totally different approach. Again: I don't need the code, but rather a description.
Thanks.
I'm trying to figure out how utilizing SWF Flow framework, I can have my activity worker poll multiple task list. The use case is for having two different priorities for activity tasks that need to be completed.
Bouns points if someone uses glisten and can point out a way to achieve that.
Thanks!
It is not possible for a single ActivityWorker to poll on multiple task lists. The reason for such design is that each poll request can take up to a minute due to long poll. If a few such polls feed into a single threaded activity implemenation it is not clear how to deal with conflicts that arise if tasks are received on multiple task lists.
Until the SWF natively supports priority task lists the solution is to instantiate one ActivityWorker per task list (priority) and deal with conflicts yourself.
I have a partially ordered set of tasks, where for each task all of the tasks that are strictly before it in the partial order must be executed before it can be executed. I want to execute tasks which are not related (either before or after one other) concurrently to try to minimise the total execution time - but without starting a task before its dependencies are completed.
The tasks will run as (non-perl) child processes.
How should I approach solving a problem like this using Perl? What concurrency control facilities and data structures are available?
I would use a hash of arrays. For each task, all its prerequisities will be mentioned in the corresponding array:
$prereq{task1} = [qw/task2 task3 task4/];
I would keep completed tasks in a different hash, and then just
my #prereq = #{ $prereq{$task} };
if (#prereq == grep exists $completed{$_}, #prereq) {
run($task);
}
Looks like a full solution is NP-complete.
As for a partial solution, I would use some form of reference counting to determine which jobs are ready to run, Forks::Super::Job to run the background jobs and check their statuses and POSIX::pause to sleep when maximum number of jobs is spawned.
No threads are involved since you're already dealing with separate processes.
Read the first link for possible algorithms/heuristics to determine runnable jobs' priorities.