Automatically rerun jobs submitted with sbatch --array upon error - slurm

I am submitting jobs in an array. Occasionally one job will error because of a difficult to diagnose gpu memory issue. Simply rerunning the job results in success.
What I would like to do is catch this error, log it, and put the job back into slurm's queue to be rerun. If this is not possible to do with an array job, that's fine, it's not essential to use arrays (though it is preferred).
I've tried playing around with sbatch --rerun, but this doesn't seem to do what I want (I think this option is for rerunning after a hardware error detected by slurm, or a node is restarted when a job is running - this isn't the case for my jobs).
Any advice well received.

If you can detect the GPU memory issue, you can end your submission job with a construct like this:
if <gpu memory issue>; then
scontrol requeue $SLURM_JOBID
fi
This will put the job back in the scheduling queue and it will be restarted as is. Interestingly, the SLURM_RESTART_COUNT environment variable holds the number of times the job was re-queued.

Related

Do submitted jobs take a copy the source? Queued jobs?

When submitting jobs with sbatch, is a copy of my executable taken to the compute node? Or does it just execute the file from /home/user/? Sometimes when I am unorganised I will submit a job, then change the source and re-compile to submit another job. This does not seem like a good idea, especially if the job is still in the queue. At the same time it seems like it should be allowed, and it would be much safer if at the moment of calling sbatch a copy of the source was made.
I ran some tests which confirmed (unsurprisingly) that once a job is running, recompiling the source code has no effect. But when the job is in the queue, I am not sure. It is difficult to test.
edit: man sbatch does not seem to give much insight, other than to say that the job is submitted to the Slurm controller "immediately".
The sbatch command creates a copy of the submission script and a snapshot of the environment and saves it in the directory listed as the StateSaveLocation configuration parameter. It can therefore be changed after submission without effect.
But that is not the case for the files used in the submission script. If your submission script starts an executable, if will see the "version" of the executable at the time it starts.
Modifying the program before it starts will lead to the new version being run, modifying it during the run (i.e. while it has already been read from disk and saved into memory) will lead to the old version being run.

What is the difference between Run all at once and Run in sequence,when create a Cognos JOB?

What is the difference between
Run all at once and Run in sequence modes?
If jobs fail both in those two modes, what will happen? (the scheduled report email will be sent or not?)enter image description here
Let's say your job has 10 reports in it.
In sequence mode, the first report will run. When it's done, the second will run. Then the 3rd, etc.
In "all at once" mode, the job doesn't wait for anything to finish. All reports are submitted at once and begin running.
Provided the server has the appropriate resources to do so, running the job all at once will complete everything sooner, but it does put a much larger strain on the system. I have jobs where running in sequence ends up being faster because running them all at once is such a bottleneck that it slows down everything to a crawl.
Failure depends on what you select: In either case you can have the job stop if any component fails, or continue running on the other reports if one fails. However, if any component fails then when you view the job status in you schedule list will still show the overall status of the job as "Failed", and "run history details" will show you exactly where the failure was.

Bull queue package configuration confusion

I am using bull package from npm to manage a queue "npm i bull". I got it mostly figured out and working but there seems to be something that I dont understand in the configuration.
In the configuration for a new queue theres this one:
maxStalledCount: number = 1; // Max amount of times a stalled job will be re-processed.
this is from the reference page of their github
and then theres another configuration that you can define:
attempts: number; // The total number of attempts to try the job until it completes.
I should mention that this is relevant for failing jobs
firstly, it seems that only attempts actually determines anything, regardless of what is in maxStalledCount, the script will only follow the amount of attempts set.
for example: if i set attempts to 3 and maxStalledCount to 1, it will STILL do 3 attempts and then move it to the failed when it "ran out of attempts"
different example: if i set attempts to 1 and maxStalledCount to 3 it will only do 1 attempt before throwing it into failed.
Can someone explain the difference? I could not find anything online.
Ultimately what I want my queue to do is attempt something up to 5 times, then move it to failed, and to be able to get all the failed jobs at a later time to retry them, how would i configure that?
added link to the reference page: https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md
Thanks.
There is a difference between a 'stalled' job and a 'failed' job. According to this note in the bull docs, a job is considered stalled when:
The Node process running your job processor unexpectedly terminates.
Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock.
maxStalledCount is a safeguard so problematic jobs won't get restarted indefinitely.
If you are dealing with failed jobs, the attempts option dictates the number of attempts.
As for your desired behavior:
set your attempts option to 5
at some later time, gather an array of failed jobs with:
const failedJobs = cacheQueue.getFailed();
Retry the failed jobs with:
failedJobs.forEach(job => job.retry());

Timeout a pyspark job

TL;DR
Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.
Longer Version:
The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.
I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)
Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.
Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."
You can set a classic python alarm. Then in handler function you can raise exception or use sys.exit() function to finish driver code. As driver finishes, YARN kills whole application.
You can find example usage in documentation: https://docs.python.org/3/library/signal.html#example

qsub: What is the standard when to get occasional updates on a submitted job?

I have just begun using an HPC, and I'm having trouble adjusting my workflow.
I submit a job using qsub myjob.sh. Then I can view the status of the job by typing qstat -u myusername. This gives me some details about my job, such as how long it has been running for.
My job is a python script that occasionally prints out an update to indicate how things are going in the program. I know that this will instead be found in outputfile once the job is done, but how can I go about monitoring this program as it runs? One way it to print the output to a file instead of printing to screen, but this seems like a bit of a hack.
Any other tips on imporving this process would be great.

Resources