I am using bull package from npm to manage a queue "npm i bull". I got it mostly figured out and working but there seems to be something that I dont understand in the configuration.
In the configuration for a new queue theres this one:
maxStalledCount: number = 1; // Max amount of times a stalled job will be re-processed.
this is from the reference page of their github
and then theres another configuration that you can define:
attempts: number; // The total number of attempts to try the job until it completes.
I should mention that this is relevant for failing jobs
firstly, it seems that only attempts actually determines anything, regardless of what is in maxStalledCount, the script will only follow the amount of attempts set.
for example: if i set attempts to 3 and maxStalledCount to 1, it will STILL do 3 attempts and then move it to the failed when it "ran out of attempts"
different example: if i set attempts to 1 and maxStalledCount to 3 it will only do 1 attempt before throwing it into failed.
Can someone explain the difference? I could not find anything online.
Ultimately what I want my queue to do is attempt something up to 5 times, then move it to failed, and to be able to get all the failed jobs at a later time to retry them, how would i configure that?
added link to the reference page: https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md
Thanks.
There is a difference between a 'stalled' job and a 'failed' job. According to this note in the bull docs, a job is considered stalled when:
The Node process running your job processor unexpectedly terminates.
Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock.
maxStalledCount is a safeguard so problematic jobs won't get restarted indefinitely.
If you are dealing with failed jobs, the attempts option dictates the number of attempts.
As for your desired behavior:
set your attempts option to 5
at some later time, gather an array of failed jobs with:
const failedJobs = cacheQueue.getFailed();
Retry the failed jobs with:
failedJobs.forEach(job => job.retry());
Related
Because "reasons", we know that when we use azureml-sdk's HyperDriveStep we expect a number of HyperDrive runs to fail -- normally around 20%. How can we handle this without failing the entire HyperDriveStep (and then all downstream steps)? Below is an example of the pipeline.
I thought there would be an HyperDriveRunConfig param to allow for this, but it doesn't seem to exist. Perhaps this is controlled on the Pipeline itself with the continue_on_step_failure param?
The workaround we're considering is to catch the failed run within our train.py script and manually log the primary_metric as zero.
thanks for your question.
I'm assuming that HyperDriveStep is one of the steps in your Pipeline and that you want the remaining Pipeline steps to continue, when HyperDriveStep fails, is that correct?
Enabling continue_on_step_failure, should allow the rest of the pipeline steps to continue, when any single steps fails.
Additionally, the HyperDrive run consists of multiple child runs, controlled by the HyperDriveConfig. If the first 3 child runs explored by HyperDrive fail (e.g. with user script errors), the system automatically cancels the entire HyperDrive run, in order to avoid further wasting resources.
Are you looking to continue other Pipeline steps when the HyperDriveStep fails? or are you looking to continue other child runs within the HyperDrive run, when the first 3 child runs fail?
Thanks!
I am submitting jobs in an array. Occasionally one job will error because of a difficult to diagnose gpu memory issue. Simply rerunning the job results in success.
What I would like to do is catch this error, log it, and put the job back into slurm's queue to be rerun. If this is not possible to do with an array job, that's fine, it's not essential to use arrays (though it is preferred).
I've tried playing around with sbatch --rerun, but this doesn't seem to do what I want (I think this option is for rerunning after a hardware error detected by slurm, or a node is restarted when a job is running - this isn't the case for my jobs).
Any advice well received.
If you can detect the GPU memory issue, you can end your submission job with a construct like this:
if <gpu memory issue>; then
scontrol requeue $SLURM_JOBID
fi
This will put the job back in the scheduling queue and it will be restarted as is. Interestingly, the SLURM_RESTART_COUNT environment variable holds the number of times the job was re-queued.
I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.
For every 1000 messages 1 message is running for 20 minutes and more than that where other messages are completing in less than 1 sec. What could be the reason and I don't know whether it is going to be complete.
Some messages are going to "Never Finished" state other than Success and Failure. What could be the reason and I think my function has no issues if so we are logging it.
If the message processing is taking a long time periodically (or not finishing at all), it must be that every now and then the operations in your job function take a long time or fail. All depends on what your job is actually doing internally. If it is going async in places, the SDK will continue to wait for it to return. We did add a new feature very recently TimeoutAttribute (see release notes: http://github.com/Azure/azure-webjobs-sdk/wiki/Release-Notes). The Dashboard should show any function errors.
If you suspect that your job may be hanging/failing at certain places, you might try verifying locally that this is handled correctly by your logging etc. You could add Task.Delays or errors at various spots and verify that it's logged/handled correctly.
In my code i run a cron job which is run for every five seconds, and I've been getting the same WARNING ever since.
this is the api that i used:
sched.add_cron_job(test_3, second="*/5")
And I get a warning:
WARNING:apscheduler.scheduler:Execution of job "test_3 (trigger: cron[second='*/5'], next run at: 2013-11-28 15:56:30)" skipped: maximum number of running instances reached (1)
I tried giving time gap of 2 minutes it doesn't solve the issue.....
Help me in overcoming this issue..
I used the proc.terminate() to stop the execution of my method. So that the instance of the 1st thread is terminated before a new thread could start again.
Also provide a timing mechanism to complete your process well within the scheduled time say within a minute, hour or day etc. In my application i used *sleep(in_seconds)* for providing the timing mechanism.
I had a similar problem, and it turned out it was just your job 'test_3' lasting too long, more then 5 secs (or 2 minutes as you tried).
APScheduler is trying to re-execute you job, but the previous one is still running.