How do I process a partial order of tasks concurrently using Perl? - multithreading

I have a partially ordered set of tasks, where for each task all of the tasks that are strictly before it in the partial order must be executed before it can be executed. I want to execute tasks which are not related (either before or after one other) concurrently to try to minimise the total execution time - but without starting a task before its dependencies are completed.
The tasks will run as (non-perl) child processes.
How should I approach solving a problem like this using Perl? What concurrency control facilities and data structures are available?

I would use a hash of arrays. For each task, all its prerequisities will be mentioned in the corresponding array:
$prereq{task1} = [qw/task2 task3 task4/];
I would keep completed tasks in a different hash, and then just
my #prereq = #{ $prereq{$task} };
if (#prereq == grep exists $completed{$_}, #prereq) {
run($task);
}

Looks like a full solution is NP-complete.
As for a partial solution, I would use some form of reference counting to determine which jobs are ready to run, Forks::Super::Job to run the background jobs and check their statuses and POSIX::pause to sleep when maximum number of jobs is spawned.
No threads are involved since you're already dealing with separate processes.
Read the first link for possible algorithms/heuristics to determine runnable jobs' priorities.

Related

single slurm array vs multiple sbatch calls

I can run N embarrassingly parallel jobs by using a slurm array like:
#SBATCH --array=1-N
Alternately I think I can achieve the same from a scheduling perspective (i.e. scheduled independently and as soon as resources become available) by manually launching 8 job. For example with a simply bash script with a loop.
Since the latter is far more flexible, I don't see the utility I using the --array option built into slurm.
Am I missing something?
Arrays offer a simple way to create parametrised jobs without writing the Bash loop. It
(obviously) creates the jobs and assign them a parameter ;
takes care of output file name parametrisation ;
makes the submission of a dependent job that should run after all those jobs are completer much easier
makes the output of squeue less cluttered
Furthermore, the jobs in an array can be managed as a whole, the squeue, scancel, etc. command can work on the whole array as opposed to writing another loop to cancel them for instance. This is even more interesting in the case you have multiple arrays running at the same time ; you do not need to manage the tracking of each individual job by yourself.
Finally, especially for large arrays, it makes the scheduler easier and can increase the job throughput.
If you need flexibility, then job arrays are not the solution, but maybe a workflow manager could help you.

Make serial DFS search multithreaded

I want to implement a parallel DFS search in a tree. We have a (potentially huge) tree of business objects, and the user can search by a textId (or part of the textId). This is implemented in a serial DFS search at the moment. Which is a huge waste of resources. Our customers have 8 logical cores. No need to wait 5 minutes for a search result...
We already have a global ThreadManager. We use it mainly for parallel calculations of grid cells. The ThreadManager keeps track of how many tasks are queued, how many threads are available and starts the next queued task when a new thread is available.
This idea is to use this with a new task class to parallelize the tree search. Of course, I cannot start a new task on every childNode - that would mean hundreds or thousands of tasks queued. But only parallelize at a high tree level would underuse the cores when a task has only a small subtree.
I have the following idea for the task class:
One object of the task class knows its treeNode, and a common result object. The tasks "execute" method does the following:
if result object already has a result:
return
match treeNode with the search condition. If match:
put treeNodes object in the result object.
return
for each childNode of treeNode:
if the ThreadManager has a thread available:
Create a new task with childNode and put the task on the queue
else:
call "execute" for every childNode
The synchronization of the completed tasks seems complicated. For example, I have to know when the tree is searched but the target textID is not found. But it should be possible to expand the ThreadManager to known if there are still tasks which belong to this search.
Does someone have experience with that kind of algorithm? Will the synchronization overhead be too much to be worth it? Are there other pitfalls I am not seeing?
This discussion is similar: Depth first search in parallel
where "stack" or "global worklist" is my "ThreadManager". Did I get this right?
Thank you!

Airflow - Locking between tasks so that only one parallel task runs at a time?

I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.

Trigger periodic celery task after an event [duplicate]

I want to schedule a periodic task with Celery dynamically at the end of another group of task.
I know how to create (static) periodic tasks with Celery:
CELERYBEAT_SCHEDULE = {
'poll_actions': {
'task': 'tasks.poll_actions',
'schedule': timedelta(seconds=5)
}
}
But I want to create periodic jobs dynamically from my tasks (and maybe have a way to stop those periodic jobs when some condition is achieved (all tasks done).
Something like:
#celery.task
def run(ids):
group(prepare.s(id) for id in ids) | execute.s(ids) | poll.s(ids, schedule=timedelta(seconds=5))
#celery.task
def prepare(id):
...
#celery.task
def execute(id):
...
#celery.task
def poll(ids):
# This task has to be schedulable on demand
...
The straightforward solution to this requires that you be able to add/remove beat scheduler entries on the fly. As of the answering of this question...
How to dynamically add / remove periodic tasks to Celery (celerybeat)
This was not possible. I doubt it has become available in the interim because ...
You are conflating two concepts here. The notion of "Event Driven Work" and the idea of "Batch Schedule Driven Work"(which is really just the first case where the event happens on a schedule). If you really consider what you are doing here you'll find that there is a rather complex set of edge cases. Messages are distributed in nature what happens when groups spawned from different messages start creating conflicting entries? What do you do when you find yourself under a mountain of previously scheduled kruft?
When working with messaging systems you are really looking to build recursive trees. Spindles of work that do something and spawn more messages to do more things. Cycles(intended or otherwise) aside these ultimately achieve their base cases and terminate.
The answer to whatever you are actually trying to achieve lies with re-encoding your problem within the limitations of your messaging system and asynchronous work framework.

How to choose tasks from a list based on some associated meta data?

I have n tasks in a waiting list.
Each task has associated with it an entry that contains some meta information:
Task1 A,B
Task2 A
Task3 B,C
Task4 A,B,C
And an asssociated hashmap that contains entries like:
A 1
B 2
C 2
This implies that if a task, that contains in its meta information A, is already running, then no other task containing
A can run at the same time.
However, since B has a limit of 2 tasks, so either task1 and task3 can run together, or task3 and task4.
But task1, task3 and task4 cannot run together since both the limits of A and B will be violated, though limit of C is not
violated.
If I need to select tasks to run in different threads, what logic/algorithm would you suggest? And, when should this logic
be invoked? I view the task list as a shared resource which might need to be locked when tasks
are selected to run from it. Right now, I think this logic might have to be invoked when a task is added to the list and
also, when a running task has completed. But this could block the addition of new elements to the list, unless I make a copy of the list before running the logic.
How would your logic change if I were to give higher priority to tasks that contain more entries like 'A,B,C'
than that to 'A,B'?
This is kind of a continuation of Choosing a data structure for a variant of producer consumer problem and How to access the underlying queue of a ThreadpoolExecutor in a thread safe way, just in case any one is wondering about the background of the problem.
Yes, this is nasty. I immediately thought of an array/list of semaphores, initialized from the hashmap from which any thread attempting to execute a task would have to get units as defined by the metadata. About a second later, I realized that such a design would deadlock pretty quick!
I think that one dedicated producer thread is going to have to iterate a 'readyJobs' list in an attempt to find a task that can execute with the current resources avaliable. It could do this both when new tasks become available and after a task is completed, so releasing resources. The producer thread could wait on one input queue, (thread-safe producer-consumer queue), to which is queued both new tasks from [wherever] and completed tasks that are queued back from the work threads, (callback fired by the work threads pushes the completed task to the input queue?). Adding new tasks might be blocked briefly, but only while the input queue is blocked by some other task being added.
In the case of assigning 'priorites', you could insert-sort the 'readyJobs' list as you wish, so that higher-priority tasks are checked first to see if they can run with the resources available. If they cannot, then the rest of the list is iterated and a lower-priority job might be able to run.
I hope that you do not want to 'preempt' lower-priority tasks so as to release resources early - that would get really, really messy :(
Rgds,
Martin

Resources