With activiti it is possible to design parallel tasks, however these tasks are internally executed sequentially (by the same thread).
I need to execute tasks in a asynchronous way, and then "join" the tasks once they are finished.
The process is:
preparation -> execute task 1
-> execute task 2 at the same time
-> Then once both are finished, go one
It is a matter of optimization, because tasks 1 and 2 are web-service calls and may require a lot of time.
From everything I read, this is not possible with activiti. Using async tasks, it is not possible to join then properly (detect that both are finished). The first finished task is OK, but the second throws an OptimisticLockException and is restarted (which is not acceptable).
Maybe there is something I misunderstood and this is something possible or even easy??? Did anyone succeed in it?
I am not sure if i understand your question clearly.
but Activiti does support Async processing.
To Join two Async processes you can create another task that will wait till both the Async tasks are completed.
Related
Check the below code
#app.agent()
async def process(stream):
async for value in stream.take(5000, within=5):
process(value)
The agent takes 5000 records within 5 seconds asynchronously and process them. I don't want the agent to pick another 5000 thousand records until the processing of previous one is finished. Basically I want to run the agent Synchronously. Is there a way we can do it?
I think you could set the concurrency to 1 on the agent and that'd effectively render it synchronous.
You might also find modifying the topic partitions to be useful if you do that but I don't have a complete understanding of the relationship between these two settings (just wanted to point out a potentially useful avenue).
I tried with the following code to see whether the worker is executing second batch of record while the processing of first batch has not yet finished
#app.agent()
async def process(stream):
async for value in stream.take(5000, within=5):
print(1)
await async.sleep(30)
The worker printed 1 and waited for 30 seconds to print 2. The await statement gives control back to the event loop but in this case it waited which implies that the batches are executed one after the another. Hence this is synchronous.
PS. Committing offset, rebalancing, monitoring etc are asynchronous operations which are handled by event loop.
I wonder how Camunda manage multiple instances of a sub-process.
For example this BPMN:
Let's say multi-instances process would iterate on a big collection, 500 instances.
I have a function in a web app that call the endpoint to complete the user common task, and perform another call to camunda engine to get all tasks (on first API call callback). I am supposed to get a list of 500 sub-process user tasks (the ones generated by the multi-instances process).
What if the get tasks call is performed before Camunda Engine successfully instantiated all sub-processes?
Do i get a partial list of task ?
How to detect that main and sub processes are ready?
I don't really know if Camunda is able to manage this problematic by itself so I though of the following solution, knowing I only can use Modeler environment with Groovy to add code (Javascript as well, but the entire code parts already added are groovy):
Use of a sub process throw event to catch in main process, then count and compare tasks ready with awaited tasks number for each signal emitted.
Thanks
I would maybe likely spawn the tasks as parallel process (or 500 of them) and then got to a next step in which I signal or otherwise set a state that indicates the spawning is completed. I would further join the parallel processes all together again and have here a task signaling or otherwise setting a state that indicates all the parallel processes are done. See https://docs.camunda.org/manual/7.12/reference/bpmn20/gateways/parallel-gateway/. This way you can know exactly at what point (after spawning is done and before the join) you have a chance of getting your 500 spawned sub processes
I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.
AI have a WF 4 application which contains a sequence workflow that has a ParallelFor containing a Sequence with three sequential activites.
The first of these activities is compute-bound (it generate Certificate Signing Requests), the second is IO bound (it sends emails) and the third task is also IO bound (it updates database).
I initially developed these all as CodeActivities and the saw that they need to be AsyncCodeActivities to truly run in multi-threaded mode. So I have modified the first compute-bound activity as an AsyncCodeActivity and I can see it is being executed multi-threaded. (At least I can observe much higher processor utilisation on my Dev machine which leads me t believe it is now running multi-threaded)
However the subsequent tasks remain as non-Async CodeActivities. My questions are as follows:
Should I convert the 2nd and 3rd activities to Async as well (I suspect this will be the case)?
If not, how is the processing actually executed within the ParallelFor when the first is AsyncCodeActivitiy and the 2nd and 3rd are not?
With all child activities in a parallel they are scheduled at the same time. This means put in a queue and the scheduler will only execute a single at the time. With async activities this means that the start is scheduled and it can spawn other threads and the end part is scheduled when it is signaled as done and really executes when the scheduler get around to it.
In practice this means that for workflows that execute on a server with plenty of other work the async activity is best used for async IO like network or database IO. On a server adding multiple CPU threads to an already busy system can even slow things down. If the workflow executes on the client both async IO and CPU work make sense.
I am still fairly new to parallel computing so I am not too sure which tool to use for the job.
I have a System.Threading.Tasks.Task that needs to wait for n number number of tasks to finish before starting. The tricky part is some of its dependencies may start after this task starts (You are guaranteed to never hit 0 dependent tasks until they are all done).
Here is kind of what is happening
Parent thread creates somewhere between 1 and (NUMBER_OF_CPU_CORES - 1) tasks.
Parent thread creates task to be run when all of the worker tasks are finished.
Parent thread creates a monitoring thread
Monitoring thread may kill a worker task or spawn a new task depending on load.
I can figure out everything up to step 4. How do I get the task from step 2 to wait to run until any new worker threads created in step 4 finish?
You can pass an array of the Tasks you're waiting on to TaskFactory.ContinueWhenAll, along with the new task to start after they're all done.
edit: Possible workaround for your dynamically generated tasks problem: have a two-step continuation; every "dependent task" you start should have a chained ContinueWith which checks the total number of tasks still running, and if it's zero, launches the actual continuation task. That way, every task will do the check when it's done, but only the last one will launch the next phase. You'll need to synchronize access to the "remaining tasks" counter, of course.