In a context where you are deploying a web role over multiple instances and require to schedule a task that should be done by one instance only (like sending an email to the site admin with some stats), how reliable is it to use RoleEnvironment.CurrentRoleInstance.Id in order to make the task run on one instance only (like only running it if the Id finishes with IN_0) ?
If anyone has ever done this, I'd be interested in his feedback.
I wouldn't use instance ID. What happens if instance 0 gets rebooted (which happens at least once per month)? Now your scheduler or task-runner is offline.
An alternate solution is to use a type of mutex that spans instances. The one I'm thinking of is a blob lease. You can actually acquire a lease on a blob for writing (and there can only be one lease-holder). You could attempt to get a blob lease before running a task. If you get it, run task. If you don't, don't run it.
A slight variation: In a thread (let's say started from your Run() method), attempt to acquire a lease and if successful, launch a scheduler task (maybe a thread or something). If you cannot acquire the lease, sleep for a minute and try again. Eventually, the instance with the lease will be rebooted (or it'll disappear for some other reason). After a few seconds, another instance will acquire the abandoned lease and start up a new scheduler task.
Steve Marx wrote a blog post about concurrency using leases. Tyler Doerksen also has a good post about leases.
yes you can use the InstanceId if needed specificaly
<Startup>
<Task commandLine="StartUpTasks\WindowService\InstallWindowService.bat" executionContext="elevated" taskType="background" >
<Environment>
<Variable name="InstanceId">
<RoleInstanceValue xpath="/RoleEnvironment/CurrentInstance/#id"/>
</Variable>
</Environment>
</Task>
</Startup>
it will be of following form
<deployment Id>.<Application Name>.<Role Name>_IN_<index>
Example mostly MyRole_IN_0, MyRole_IN_1
Access the environmet variable in batch file like this
%InstanceId%
You cane then use substring or last index of _ to get the index from InstanceId.
if this instance having index 0 will have the same index even after a reboot.
More Details
http://blogs.msdn.com/b/cclayton/archive/2012/05/17/windows-azure-start-up-tasks-part-2.aspx
http://msdn.microsoft.com/en-us/library/windowsazure/hh404006.aspx
It's possible to have some block of execution code only run once if you have multiple instances, by for example checking the ID of the current role instance you are executing at.
You could achieve the same result with other solutions, but those might require some more work, like decoupling the task from your instance
Related
I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.
I need to design a thread pool system, in Python in this case, but I'm more interested in the general methodology.
It has to be something along the lines of https://www.metachris.com/2016/04/python-threadpool/, where threads wait idling until some jobs are pushed into the pool. How that works, using condition variables, is clear to me.
I have one additional requirement though: the jobs I'm pushing into the pool cannot run all in parallel. Each of them has a class (i don't mean the object class here, just a simple integer that somehow classifies the job) and only one job per class can be running at the same time. If a job is pushed having the same class of a job that is currently running, it has to wait in the queue until the latter is done.
I have already modified the mentioned class to do this, but what I achieved is pretty messy and I'm not sure it's reliable, so I would ask what modifications would be suggested or whether I should use a totally different approach. Again: I don't need the code, but rather a description.
Thanks.
I am using Quartz scheduler to schedule the process of file download from SFTP.
The job triggered after every 2 hrs. But sometimes due to huge file size, downloading
takes more time and before it completes, the process is restarted. Is their any way we
can hold the scheduler to trigger same job again till the previous process completes processing?
I m using quartz 1.8.5.
Below is code
<flow name="quartzCronModel">
<quartz:inbound-endpoint connector-ref="Quartz"
jobName="cron-job" cronExpression="${database_download_timer}"
encoding="UTF-8">
<quartz:event-generator-job />
</quartz:inbound-endpoint>
<component doc:name="Download Database"
class="com.org.components.sftp.FileTransfer">
<method-entry-point-resolver
acceptVoidMethods="true">
<include-entry-point method="execute" />
</method-entry-point-resolver>
</component>
</flow>
I am reading cron expression from a properties file.
Your job will need to implement the StatefulJob interface. It is a marker interface that tells Quartz that it should not trigger the job if it is still running. In other words it prevents concurrent executions of the job.
It has been a long time since this question was asked. Jan Moravec has answered correctly, but during this time, this class has been deprecated. According to the Quartz documentation, It is currently best to use #DisallowConcurrentExecution and/or #PersistJobDataAfterExecution annotations instead.
I hope it will be useful
I have an application that has to launch jobs repeatingly. But (yes, that would have been to easy without a but...) I would like users to define their backup frequency in application.
In worst case, they would have to choose between :
weekly,
daily,
every 12 hours,
every 6 hours,
hourly
In best case, they should be able to use crontab expressions (see documentation for example)
How to do this? Do I launch a job every minutes that check for last execution time, frequency and then launches another job if needed? Do I create a sort of queue that will be executed by a masterjob?
Any clues, ideas, opinions, best pratices, experiences are welcome!
EDIT : Solved this problem using Akka scheduler. Ok, this is a technical solution not a design answer but still everything works great.
Each user defined repetition is an actor that send messages every period to a new actor to execute the actual job.
There may be two ways to do this depending on your requirements/architecture:
If you can only use Play:
The user creates the job and the frequency it will run (crontab, whatever).
On saving the job, you calculate the first time it will have to be run. You then add an entry to a table JOBS with the execution time, job id, and any other information required. This is required as Play is stateless and information must be stored in the DB for later retrieval.
You have a job that queries the table for entries whose execution date is less than now. Retrieves the first, runs it, removes it from the table and adds a new entry for next execution. You should keep some execution counter so if a task fails (which means the entry is not removed from DB) it won't block execution of the other tasks by the job trying again and again.
The frequency of this job is set to run every second. That way while there is information in the table, you should execute the request around as often as they are required. As Play won't spawn a new job while the current one is working if you have enough tasks this one job will serve all. If not, it will be killed at some point and restored when required.
Of course, the crons of the users will not be too precise, as you have to account for you own cron delays plus execution delays on all the tasks in queue, which will be run sequentially. Not the best approach, unless you somehow disallow crons which run every second or more often than every minute (to be safe). Doing a check on execution time of the crons to kill them if they are over a certain amount of time would be a good idea.
If you can use more than Play:
The better alternative I believe is to use Quartz (see this) to create a future execution when the user creates the job, and reproram it once the execution is over.
There was a discussion on google-groups about it. As far as I remember you must define a job which start every 6 hours and check which backups must be done. So you must remember when the last backup job was finished and make the control yourself. I'm unsure if Quartz can handle such a requirement.
I looked in the source-code (always a good source ;-)) and found a method every, where I think this should be do what you want. How ever I'm unsure if this is a clever design, because if you have 1000 user you will have then 1000 Jobs. I'm unsure if Play was build to handle such a large number of jobs.
[Update] For cron-expressions you should have a look into JobPlugin.scheduleForCRON()
There are several ways to solve this.
If you don't have a really huge load of jobs, I'd just persist them to a table using the required flexibility. Then check all of them every hour (or the lowest interval you support) and run those eligible. Simple.
Or, if you prefer to use cron syntax anyway, just write (export) jobs to a user crontab using a wrapper which calls back to your running app, or starts the job in a standalone process if that's possible.
AI have a WF 4 application which contains a sequence workflow that has a ParallelFor containing a Sequence with three sequential activites.
The first of these activities is compute-bound (it generate Certificate Signing Requests), the second is IO bound (it sends emails) and the third task is also IO bound (it updates database).
I initially developed these all as CodeActivities and the saw that they need to be AsyncCodeActivities to truly run in multi-threaded mode. So I have modified the first compute-bound activity as an AsyncCodeActivity and I can see it is being executed multi-threaded. (At least I can observe much higher processor utilisation on my Dev machine which leads me t believe it is now running multi-threaded)
However the subsequent tasks remain as non-Async CodeActivities. My questions are as follows:
Should I convert the 2nd and 3rd activities to Async as well (I suspect this will be the case)?
If not, how is the processing actually executed within the ParallelFor when the first is AsyncCodeActivitiy and the 2nd and 3rd are not?
With all child activities in a parallel they are scheduled at the same time. This means put in a queue and the scheduler will only execute a single at the time. With async activities this means that the start is scheduled and it can spawn other threads and the end part is scheduled when it is signaled as done and really executes when the scheduler get around to it.
In practice this means that for workflows that execute on a server with plenty of other work the async activity is best used for async IO like network or database IO. On a server adding multiple CPU threads to an already busy system can even slow things down. If the workflow executes on the client both async IO and CPU work make sense.