How to choose tasks from a list based on some associated meta data? - multithreading

I have n tasks in a waiting list.
Each task has associated with it an entry that contains some meta information:
Task1 A,B
Task2 A
Task3 B,C
Task4 A,B,C
And an asssociated hashmap that contains entries like:
A 1
B 2
C 2
This implies that if a task, that contains in its meta information A, is already running, then no other task containing
A can run at the same time.
However, since B has a limit of 2 tasks, so either task1 and task3 can run together, or task3 and task4.
But task1, task3 and task4 cannot run together since both the limits of A and B will be violated, though limit of C is not
violated.
If I need to select tasks to run in different threads, what logic/algorithm would you suggest? And, when should this logic
be invoked? I view the task list as a shared resource which might need to be locked when tasks
are selected to run from it. Right now, I think this logic might have to be invoked when a task is added to the list and
also, when a running task has completed. But this could block the addition of new elements to the list, unless I make a copy of the list before running the logic.
How would your logic change if I were to give higher priority to tasks that contain more entries like 'A,B,C'
than that to 'A,B'?
This is kind of a continuation of Choosing a data structure for a variant of producer consumer problem and How to access the underlying queue of a ThreadpoolExecutor in a thread safe way, just in case any one is wondering about the background of the problem.

Yes, this is nasty. I immediately thought of an array/list of semaphores, initialized from the hashmap from which any thread attempting to execute a task would have to get units as defined by the metadata. About a second later, I realized that such a design would deadlock pretty quick!
I think that one dedicated producer thread is going to have to iterate a 'readyJobs' list in an attempt to find a task that can execute with the current resources avaliable. It could do this both when new tasks become available and after a task is completed, so releasing resources. The producer thread could wait on one input queue, (thread-safe producer-consumer queue), to which is queued both new tasks from [wherever] and completed tasks that are queued back from the work threads, (callback fired by the work threads pushes the completed task to the input queue?). Adding new tasks might be blocked briefly, but only while the input queue is blocked by some other task being added.
In the case of assigning 'priorites', you could insert-sort the 'readyJobs' list as you wish, so that higher-priority tasks are checked first to see if they can run with the resources available. If they cannot, then the rest of the list is iterated and a lower-priority job might be able to run.
I hope that you do not want to 'preempt' lower-priority tasks so as to release resources early - that would get really, really messy :(
Rgds,
Martin

Related

Airflow - Locking between tasks so that only one parallel task runs at a time?

I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.

Do processing in separated threads with Activiti

With activiti it is possible to design parallel tasks, however these tasks are internally executed sequentially (by the same thread).
I need to execute tasks in a asynchronous way, and then "join" the tasks once they are finished.
The process is:
preparation -> execute task 1
-> execute task 2 at the same time
-> Then once both are finished, go one
It is a matter of optimization, because tasks 1 and 2 are web-service calls and may require a lot of time.
From everything I read, this is not possible with activiti. Using async tasks, it is not possible to join then properly (detect that both are finished). The first finished task is OK, but the second throws an OptimisticLockException and is restarted (which is not acceptable).
Maybe there is something I misunderstood and this is something possible or even easy??? Did anyone succeed in it?
I am not sure if i understand your question clearly.
but Activiti does support Async processing.
To Join two Async processes you can create another task that will wait till both the Async tasks are completed.

How do I process a partial order of tasks concurrently using Perl?

I have a partially ordered set of tasks, where for each task all of the tasks that are strictly before it in the partial order must be executed before it can be executed. I want to execute tasks which are not related (either before or after one other) concurrently to try to minimise the total execution time - but without starting a task before its dependencies are completed.
The tasks will run as (non-perl) child processes.
How should I approach solving a problem like this using Perl? What concurrency control facilities and data structures are available?
I would use a hash of arrays. For each task, all its prerequisities will be mentioned in the corresponding array:
$prereq{task1} = [qw/task2 task3 task4/];
I would keep completed tasks in a different hash, and then just
my #prereq = #{ $prereq{$task} };
if (#prereq == grep exists $completed{$_}, #prereq) {
run($task);
}
Looks like a full solution is NP-complete.
As for a partial solution, I would use some form of reference counting to determine which jobs are ready to run, Forks::Super::Job to run the background jobs and check their statuses and POSIX::pause to sleep when maximum number of jobs is spawned.
No threads are involved since you're already dealing with separate processes.
Read the first link for possible algorithms/heuristics to determine runnable jobs' priorities.

Multithreading Task Library, Threading.Timer or threads?

Hi we are building an application that will have the possibility to register scheduled tasks.
Each task has an time interval when it should be executed
Each task should have an timeout
The amount of tasks can be infinite but around 100 in normal cases.
So we have an list of tasks that need to be executed in intervals, which are the best solution?
I have looked at giving each task their timer and when the timer elapses the work will be started, another timer keeps tracks on the timeout so if the timeout is reached the other timer stops the thread.
This feels like we are overusing timers? Or could it work?
Another solution is to use timers for each task, but when the time elapses we are putting the task on a queue that will be read with some threads that executes the work?
Any other good solutions I should look for?
There is not too much information but it looks like that you can consider RX as well - check more at MSDN.com.
You can think about your tasks as generated events which should be composed (scheduled) in some way. So you can do the following:
Spawn cancellable tasks with Observable.GenerateWithDisposable and your own Scheduler - check more at Rx 101 Sample
Delay tasks with Observable.Delay
Wait for tasks with 'Observable.Timeout
Compose tasks in any preferable way
Once again you can check more at specified above links.
You should check out Quartz.NET.
Quartz.NET is a full-featured, open
source job scheduling system that can
be used from smallest apps to large
scale enterprise systems.
I believe you would need to implement your timeout requirement by yourself but all the plumbing needed to schedule tasks could be handled by Quartz.NET.
I have done something like this before where there were a lot of socket objects that needed periodic starts and timeouts. I used a 'TimedAction' class with 'OnStart' and 'OnTimeout' events, (socket classes etc. derived from this), and one thread that handled all the timed actions. The thread maintained a list of TimedAction instances ordered by the tick time of the next action required, (delta queue). The TimedAction objects were added to the list by queueing them to the thread input queue. The thread waitied on this input queue with a timeout, (this was Windows, so 'WaitForSingleObject' on the handle of the semaphore that managed the queue), set to the 'next action required' tick count of the first item in the list. If the queue wait timed out, the relevant action event of the first item in the list was called and the item removed from the list - the next queue wait would then be set by the new 'first item in the list', which would contain the new 'nearest action time'. If a new TimedAction arrived on the queue, the thread calculated its timeout tick time, (GetTickCount + ms interval from the object), and inserted it in the sorted list at the correct place, (yes, this sometimes meant moving a lot of objects up the list to make space).
The events called by the timeout handler thread could not take any lengthy actions in order to prevent delays to the handling of other timeouts. Typically, the event handlers would set some status enumeration, signal some synchro object or queue the TimedAction to some other P-C queue or IO completion port.
Does that make sense? It worked OK, processing thousands of timed actions in my server in a reasonably timely and efficient manner.
One enhancement I planned to make was to use multiple lists with a restricted set of timeout intervals. There were only three const timeout intervals used in my system, so I could get away with using three lists, one for each interval. This would mean that the lists would not need sorting explicitly - new TimedActions would always go to the end of their list. This would eliminate costly insertion of objects in the middle of the list/s. I never got around to doing this as my first design worked well enough and I had plenty other bugs to fix :(
Two things:
Beware 32-bit tickCount rollover.
You need a loop in the queue timeout block - there may be items on the list with exactly the same, or near-same, timeout tick count. Once the queue timeout happens, you need to remove from the list and fire the events of every object until the newly claculated timeout time is >0. I fell foul of this one. Two objects with equal timeout tick count arrived at the head of the list. One got its events fired, but the system tick count had moved on and so the calcualted timeout tick for the next object was -1: INFINITE! My server stopped working properly and eventually locked up :(
Rgds,
Martin

What to use to wait on a indeterminate number of tasks?

I am still fairly new to parallel computing so I am not too sure which tool to use for the job.
I have a System.Threading.Tasks.Task that needs to wait for n number number of tasks to finish before starting. The tricky part is some of its dependencies may start after this task starts (You are guaranteed to never hit 0 dependent tasks until they are all done).
Here is kind of what is happening
Parent thread creates somewhere between 1 and (NUMBER_OF_CPU_CORES - 1) tasks.
Parent thread creates task to be run when all of the worker tasks are finished.
Parent thread creates a monitoring thread
Monitoring thread may kill a worker task or spawn a new task depending on load.
I can figure out everything up to step 4. How do I get the task from step 2 to wait to run until any new worker threads created in step 4 finish?
You can pass an array of the Tasks you're waiting on to TaskFactory.ContinueWhenAll, along with the new task to start after they're all done.
edit: Possible workaround for your dynamically generated tasks problem: have a two-step continuation; every "dependent task" you start should have a chained ContinueWith which checks the total number of tasks still running, and if it's zero, launches the actual continuation task. That way, every task will do the check when it's done, but only the last one will launch the next phase. You'll need to synchronize access to the "remaining tasks" counter, of course.

Resources