SLURM: Changing the maximum number of simultaneously running tasks for a running array job - slurm

I have set of an array job as follows:
sbatch --array=1:100%5 ...
which will limit the number of simultaneously running tasks to 5. The job is now running, and I would like to change this number to 10 (i.e. I wish I'd run sbatch --array=1:100%10 ...).
The documentation on array jobs mentions that you can use scontrol to change options after the job has started. Unfortunately, it's not clear what this option's variable name is, and I don't think it is listed in the documentation of the sbatch command here.
Any pointers well received.

You can change the array throttling limit with the following command:
scontrol update ArrayTaskThrottle=<count> JobId=<jobID>

Related

Slurm - job name, job ids, how to know which job is which?

I often run many jobs on slurm. Some finish faster than others. However, it is always hard to keep track which job is which. Can I give custom job names on slurm? If so what is the command on the batch script? Would that show up when I do squeue --me?
The parameter is --job-name (or -J), for instance:
#SBATCH --job-name=exp1_run2
The squeue output will list exp1_run2 for the corresponding job ID under column NAME.

Scheduling more jobs than MaxArraySize

Let's say I have 6233 simulations to run. The commands are generated and stored in a file, one in each line. I would like to use Slurm to schedule and run these commands. However, the MaxArraySize limit is 2000. So I can't use one job array to schedule all of them.
One solution is given here, where we create four separate jobs and use arithmetic indexing into the file, with the last job having a smaller number of tasks to run (233).
Is it possible to do this using one sbatch script with one job ID?
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Update:
Following Damien's solution and examples given here, I ended up with the following line in my bash script:
curID=$(( ${SLURM_ARRAY_TASK_ID} * ${SLURM_NTASKS} + ${SLURM_PROCID} ))
The same can be done using Python (shown in the referenced page). The only difference is that the environment variables should be imported into the script.
Is it possible to do this using one sbatch script with one job ID?
No. That solution will give you multiple job IDs
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Yes, that is a factor that you can leverage.
Each job in the array can spawn multiple tasks (--ntasks=...). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $SLURM_PROCID, and the program must be started with srun. Each task in a job member of the array will run in parallel. How large the job can be will depend on the MaxJobsize limit defined on the cluster/partition/qos you have access to.
Another option is to chain the tasks inside each job of the array, with a Bash loop (for i in $seq(...) ; do ...; done). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $i. Each task in a job member of the array will run serially. How large the job can be will depend on the MaxWall limit defined on the cluster/partition/qos you have access to.

Is it possible to assign job names to separate workers in a SLURM array via sbatch?

By default, when submitting a SLURM job as an array, all jobs within the array share the same job name. In the docs (here: https://slurm.schedmd.com/job_array.html), it shows that each job in the array can have its name set separately via scontrol (described under the section "Scontrol Command Use").
Can this be done directly from an sbatch script?
I just created an account because I was trying to do this and I did find a solution.
You can use scontrol to change the name of a job, the syntax is the following:
scontrol update job=<job_id> JobName=<new_name>
You can do this manually, but you can also automatically set the name of the job from within the array job, thus automatically assigning a different name to each job in the array.
I find this useful because I'm mostly running calculations in different directories and if I have one job running much longer than the others I want to be able to quickly retrieve where it's running to see what's going on.
Of course you could set other things as your job name, as you prefer.
In my case, I add the scontrol command to the script I run through the array in order to obtain the following name for each directory: "job_name - directory". The job id and job name can be retrieved from environment variables.
scontrol update job=$SLURM_ARRAY_JOB_ID JobName="$SLURM_JOB_NAME - $folder"

PBS Pro: setting job array slot limit by the user

Using torque user can specify slot limit when submitting the job array by using the %, e.g.: qsub job.sh -t 1-20%5 will create a job array with 20 jobs, but with only 5 running simultaneously.
Currently I work with PBS Professional, but unfortunately, as far as I can see, option % is not supported. How can I achieve similar behavior as % in torque as simple as possible?

Adding time to a running slurm job

I have a job running a linux machine managed by slurm.
Now that the job is running for a few hours I realize that I underestimated the time required for it to finish and thus the value of the --time argument I specified is not enough. Is there a way to add time to an existing running job through slurm?
Use the scontrol command to modify a job
scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
Use the SLURM time format, eg. for 8 days 15 hours: TimeLimit=8-15:00:00
Requires admin privileges, on some machines.
Will be allowed to users only if the job is not running yet, on most machines.
To build on the example provided above, you can also use "+" and "-" to increment / decrement the TimeLimit.
From the [scontrol man page][https://slurm.schedmd.com/scontrol.html]:
either specify a new time limit value or precede the time and equal sign with a "+" or "-" to increment or decrement the current time limit (e.g. "TimeLimit+=30")
We regularly get requests like "I need 3 more hours for job XXXXX to finish!!!", which would translate to:
scontrol update job=XXXXX TimeLimit=+03:00:00

Resources