How to add celery task to top of queue? - python-3.x

I use celery for doing some tasks, all tasks added by .apply_async and my script do it automatically, depends on some external conditions. I want to get result of tasks not in direct order but in reverse.
For example, I add task1 after that task2 and after that task3 and I want celery to perform tasks in following order: task1, task3, task2. (task1 first, because celery will doing this task after I add it and befire I added task2, it's ok),
How can i get this behavior?
P.S. I use redis as a broker.

The described behavior is not possible, or at least not to a full extent. Also, this mostly depends on the broker chosen. Basically, what you want is the queue to work in LIFO mode, however this is not the case of most message brokers. At least RabbitMQ only works in FIFO mode. With RabbitMQ, you can partly achieve your goal with priorities, but as already said, it's not bulletproof and would need additional logic involved.

Related

Airflow - Locking between tasks so that only one parallel task runs at a time?

I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.

Classful Thread Pool

I need to design a thread pool system, in Python in this case, but I'm more interested in the general methodology.
It has to be something along the lines of https://www.metachris.com/2016/04/python-threadpool/, where threads wait idling until some jobs are pushed into the pool. How that works, using condition variables, is clear to me.
I have one additional requirement though: the jobs I'm pushing into the pool cannot run all in parallel. Each of them has a class (i don't mean the object class here, just a simple integer that somehow classifies the job) and only one job per class can be running at the same time. If a job is pushed having the same class of a job that is currently running, it has to wait in the queue until the latter is done.
I have already modified the mentioned class to do this, but what I achieved is pretty messy and I'm not sure it's reliable, so I would ask what modifications would be suggested or whether I should use a totally different approach. Again: I don't need the code, but rather a description.
Thanks.

Trigger periodic celery task after an event [duplicate]

I want to schedule a periodic task with Celery dynamically at the end of another group of task.
I know how to create (static) periodic tasks with Celery:
CELERYBEAT_SCHEDULE = {
'poll_actions': {
'task': 'tasks.poll_actions',
'schedule': timedelta(seconds=5)
}
}
But I want to create periodic jobs dynamically from my tasks (and maybe have a way to stop those periodic jobs when some condition is achieved (all tasks done).
Something like:
#celery.task
def run(ids):
group(prepare.s(id) for id in ids) | execute.s(ids) | poll.s(ids, schedule=timedelta(seconds=5))
#celery.task
def prepare(id):
...
#celery.task
def execute(id):
...
#celery.task
def poll(ids):
# This task has to be schedulable on demand
...
The straightforward solution to this requires that you be able to add/remove beat scheduler entries on the fly. As of the answering of this question...
How to dynamically add / remove periodic tasks to Celery (celerybeat)
This was not possible. I doubt it has become available in the interim because ...
You are conflating two concepts here. The notion of "Event Driven Work" and the idea of "Batch Schedule Driven Work"(which is really just the first case where the event happens on a schedule). If you really consider what you are doing here you'll find that there is a rather complex set of edge cases. Messages are distributed in nature what happens when groups spawned from different messages start creating conflicting entries? What do you do when you find yourself under a mountain of previously scheduled kruft?
When working with messaging systems you are really looking to build recursive trees. Spindles of work that do something and spawn more messages to do more things. Cycles(intended or otherwise) aside these ultimately achieve their base cases and terminate.
The answer to whatever you are actually trying to achieve lies with re-encoding your problem within the limitations of your messaging system and asynchronous work framework.

Why in kubernetes cron job two jobs might be created, or no job might be created?

In k8s Cron Job Limitations mentioned that there is no guarantee that a job will executed exactly once:
A cron job creates a job object about once per execution time of its
schedule. We say “about” because there are certain circumstances where
two jobs might be created, or no job might be created. We attempt to
make these rare, but do not completely prevent them. Therefore, jobs
should be idempotent
Could anyone explain:
why this could happen?
what are the probabilities/statistic this could happen?
will it be fixed in some reasonable future in k8s?
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
The controller:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go
starts with a comment that lays the groundwork for an explanation:
I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)
If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.)
Just periodically list jobs and SJs, and then reconcile them.
Periodically means every 10 seconds:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105
The documentation following the quoted limitations also has some useful color on some of the circumstances under which 2 jobs or no jobs may be launched on a particular schedule:
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.
Higher level, solving for only-once in a distributed system is hard:
https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
Clocks and time synchronization in a distributed system is also hard:
https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html
To the questions:
why this could happen?
For instance- the node hosting the CronJobController fails at the time a job is supposed to run.
what are the probabilities/statistic this could happen?
Very unlikely for any given run. For a large enough number of runs, very unlikely to escape having to face this issue.
will it be fixed in some reasonable future in k8s?
There are no idemopotency-related issues under the area/batch label in the k8s repo, so one would guess not.
https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
Think more about the specific definition of idempotent, and the particular points in the job where there are commits. For instance, jobs can be made to support more-than-once execution if they save state to staging areas, and then there is an election process to determine whose work wins.
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
Yes, it's a core distributed systems problem.
For most users, the k8s documentation gives perhaps a more precise and nuanced answer than is necessary. If your scheduled job is controlling some critical medical procedure, it's really important to plan for failure cases. If it's just doing some system cleanup, missing a scheduled run doesn't much matter. By definition, nearly all users of k8s CronJobs fall into the latter category.

How to choose tasks from a list based on some associated meta data?

I have n tasks in a waiting list.
Each task has associated with it an entry that contains some meta information:
Task1 A,B
Task2 A
Task3 B,C
Task4 A,B,C
And an asssociated hashmap that contains entries like:
A 1
B 2
C 2
This implies that if a task, that contains in its meta information A, is already running, then no other task containing
A can run at the same time.
However, since B has a limit of 2 tasks, so either task1 and task3 can run together, or task3 and task4.
But task1, task3 and task4 cannot run together since both the limits of A and B will be violated, though limit of C is not
violated.
If I need to select tasks to run in different threads, what logic/algorithm would you suggest? And, when should this logic
be invoked? I view the task list as a shared resource which might need to be locked when tasks
are selected to run from it. Right now, I think this logic might have to be invoked when a task is added to the list and
also, when a running task has completed. But this could block the addition of new elements to the list, unless I make a copy of the list before running the logic.
How would your logic change if I were to give higher priority to tasks that contain more entries like 'A,B,C'
than that to 'A,B'?
This is kind of a continuation of Choosing a data structure for a variant of producer consumer problem and How to access the underlying queue of a ThreadpoolExecutor in a thread safe way, just in case any one is wondering about the background of the problem.
Yes, this is nasty. I immediately thought of an array/list of semaphores, initialized from the hashmap from which any thread attempting to execute a task would have to get units as defined by the metadata. About a second later, I realized that such a design would deadlock pretty quick!
I think that one dedicated producer thread is going to have to iterate a 'readyJobs' list in an attempt to find a task that can execute with the current resources avaliable. It could do this both when new tasks become available and after a task is completed, so releasing resources. The producer thread could wait on one input queue, (thread-safe producer-consumer queue), to which is queued both new tasks from [wherever] and completed tasks that are queued back from the work threads, (callback fired by the work threads pushes the completed task to the input queue?). Adding new tasks might be blocked briefly, but only while the input queue is blocked by some other task being added.
In the case of assigning 'priorites', you could insert-sort the 'readyJobs' list as you wish, so that higher-priority tasks are checked first to see if they can run with the resources available. If they cannot, then the rest of the list is iterated and a lower-priority job might be able to run.
I hope that you do not want to 'preempt' lower-priority tasks so as to release resources early - that would get really, really messy :(
Rgds,
Martin

Resources