How to determine job array size for lots of jobs? - slurm

What is the best way to process lots of files in parallel via Slurm?
I have a lot of files (let's say 10000) in a folder (Each files get 10 secs or so to process). I want to determine sbatch job array size as 1000, naturally. (#SBATCH --array=1-10000%100) But it seems I can't determine more than some numbers(probably 1k). How you handle job array numbers? It seems to me because of my process don't take too much time, I think i should determine one job NOT for one file but for multiple files, right?
Thank you

If the process time is 10 second you should consider packing the tasks in a single job, both because such short jobs take longer to schedule than to run and because there is a limit on the number of jobs in an array.
Your submission script could look like this:
#!/bin/bash
#SBATCH --ntasks=16 # or any other number depending on the size of the cluster and the maximum allowed wall time
#SBATCH --mem-per-cpu=...
#SBATCH --time=... # based on the number of files and number of tasks
find . -name file_pattern -print0 | xargs -I{} -0 -P $SLURM_NTASKS srun -n1 -c1 --exclusive name_of_the_program {}
Make sure to replace all the ... and file_pattern and name_of_the_program with appropriate values.
The script will look for all files matching file_pattern in the submission directory and run the name_of_the_program program on it, limiting the number of concurrent instantes to the number of CPUs (more precisely number of tasks) requested. Note the use of --exclusive here which is specific for this use case and is deprecated with --exact in recent Slurm versions.

Related

how to generate batch files in a loop in Python (not a loop in a batch file) with slightly changed parameters per iteration

I am a researcher that needs to run files for a set of years on a SLURM system (high performance computing center). The available nodes for long compute times have a long queue. I have 42 years to run, and the only way to run them that gets my files processed quickly (due to the wait times, and that this is many GB of data, it takes time), is to submit them individually, one batch file per year, as jobs. I cannot include multiple years in a single batch file, or I have to wait a week in the queue to run my data due to the time I have to reserve per batch file. This is the fastest way my university's system lets me run my data.
To do this, I have 2 lines in my batch script that I have to change every time: the name of the job, and the last line which is the python script name plus a parameter being passed to it (the year)
like so: pythonscript.py 2020.
I would like to generate batch files with a python or other script I can run, where it loops over a list of years and just changes the job name to jobNameYEAR and changes the last line to pythonscript.py YEAR, writes that to a file jobNameYEAR.sl, then continues in a loop to output the next batch file. ...Even better if it can write the batch file and submit the job (sjob jobNameYEAR) before continuing in the loop, but I realize maybe this is asking too much. But separately...
Is there a way to submit jobs in a loop once these files are created? E.g. loop through the year list and submit sjob jobName2000.sl, sjob jobName2001.sl, sjob jobName2002.sl
I do not want a loop in the batch file changing the variable, this would mean reserving too many hours on the SLURM system for a single job. I want a loop outside of the batch file that generates multiple batch files I can submit as jobs.
Thank you for your help!
This is what one of my .sl files looks like, it works fine, I just want to generate these files in a loop so I can stop editing them by hand:
#!/bin/bash -l
# The -l above is required to get the full environment with modules
# Set the allocation to be charged for this job
# not required if you have set a default allocation
#SBATCH -A MYFOLDER
# The name of the job
#SBATCH -J jobNameYEAR
# 24 hour wall-clock time will be given to this job
#SBATCH -t 3:00:00
# Job partition
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=30GB
#SBATCH -p main
# load the anaconda module
ml PDC/21.11
ml Anaconda3/2021.05
conda activate myEnv
python pythonfilename.py YEAR
Create a script with the following content (let's call it chainsubmit.sh):
#!/bin/bash
SCRIPT=${1?Usage: $0 script.slum arg1 arg2 ...}
shift
ARG=$1
ID=$(sbatch --job-name=jobName$ARG --parsable $SCRIPT $ARG)
shift
for ARG in "$#"; do
ID=$(sbatch --job-name=jobName$ARG --parsable --dependency=afterok:${ID%%;*} $SCRIPT $ARG)
done
Then, adapt your script so that the last line
python pythonfilename.py YEAR
is replaced with
python pythonfilename.py $1
Finally submit all the jobs with
./chainsubmit.sh jobName.sl {2000..2004}
for instance for YEAR ranging from 2000 to 2004
… script I can run, where it loops over a list of years and just changes the job name to jobNameYEAR and changes the last line to pythonscript.py YEAR, writes that to a file jobNameYEAR.sl… submit the job (sjob jobNameYEAR) before continuing in the loop…
It can easily be done with a few shell commands and sed. Assume you have a template file jobNameYEAR.sl as shown, which literally contains jobNameYEAR and YEAR as the parameters. Then we can substitute YEAR with each given year in the loop, e. g.
seq 2000 2002|while read year
do <jobNameYEAR.sl sed s/YEAR$/$year/ >jobName$year.sl
sjob jobName$year.sl
done
If your years aren't in sequence, we can use e. g. echo 1962 1965 1970 instead of seq ….
Other variants are on Linux also possible, like for year in {2000..2002} instead of seq 2000 2002|while read year, and using envsubst instead of sed.

How does SLURM array Interface with SBATCH resource allocation?

#!/bin/bash
#SBATCH -p RM-shared
#SBATCH -n 4
#SBATCH -t 24:00:00
#SBATCH --array=1-
I am trying to start an array and for each task in the array I would like it to use 4 cores on the RM-shared partition. Am I doing this correctly or does this designate that ALL of the tasks output by the array with have to share 4 cores?
I will ask a separate question for this, but for some reason when I run this, the $SLURM_ARRAY_TASK_ID variable is empty....
when I run
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
after my headers setting up the job, it returns
My SLURM_ARRAY_TASK_ID:
First you are right about using --cpus-per-task=4 rather than ntasks. Second, it could be a copy/paste error, but your --array line is incomplete
#SBATCH --array=1-
should be
#SBATCH --array=1-10
for instance for a 10-job array.
Each job in the array will have 4 distinct cores allocated to it. And the job will be scheduled independently, so they could for instance start all 10 on a 40-core nodes at the same time, or on 10 distinct nodes at the same time, or on one 4-core nodes one at a time, or any possible in-between combination depending on the cluster configuration and jobs in the queue.
I wasnt calling the script properly. was calling:
./ThisScript.sh
instead of
sbatch ./ThisScript.sh
Regarding the allocation of cores per array job, a helpdesk person said to use
#SBATCH --ntasks-per-node=4
instead of
#SBATCH --cpus-per-task=4
but I do not know why... I would expect --ntasks-per-node=4 to command that each node needed only run 4 jobs, so if you had 12 jobs in your array it would require 3 full nodes.
--cpus-per-task=4 on the other hand, would command that each CPU (each with hosts a number of cores) would only run 4 tasks, so if you had 12 jobs in your array, it would requre 3 CPUs (and, if the nodes on your system have 3 or more CPUs, it would only require 1 node).

Dealing with job submission limits

I am running slurm job arrays with --array, and I would like to run about 2000 tasks/array items. However this is beyond the cluster's job submission limit of ~500 at a time.
Are there any tips/best practices for splitting this up? I'd like to submit it all at once and still be able to pass the array id arguments 1-2000 to my programs if possible. I think something like waiting to submit pieces of the array might be helpful but I'm not sure how to do this at the moment.
If the limit is on the size of an array:
You will have to split the array into several job arrays. The --array parameter accepts values of the form <START>-<END> so you can submit four jobs:
sbatch --array=1-500 ...
sbatch --array=501-1000 ...
sbatch --array=1001-1500 ...
sbatch --array=1501-200 ...
This way you will bypass the 500-limit and still keep the SLURM_ARRAY_TASK_ID ranging from 1 to 2000.
To ease things a bit, you can write this all in one line like this:
paste -d- <(seq 1 500 2000) <(seq 500 500 2000) | xargs -I {} sbatch --array={} ...
If the limit is on the number of submitted jobs:
Then one option is to have the last job of the array submit the following chunk.
#!/bin/bash
#SBATCH ...
...
...
if [[ $((SLURM_ARRAY_TASK_ID % 500)) == 0 ]] ; then
sbatch --array=$((SLURM_ARRAY_TASK_ID+1))-$((SLURM_ARRAY_TASK_ID+500)) $0
fi
Note that ideally, the last running job of the array should submit the job, and it may or may not be the one with the highest TASK ID, but this has worked for all practical purposes in many situations.
Another options is to setup a cron job to monitor the queue and submit each chunk when possible, or to use a workflow manager that will that for you.
you can run a script to submit your jobs and try to make the program sleep a few seconds after every 500 submissions. see https://www.osc.edu/resources/getting_started/howto/howto_submit_multiple_jobs_using_parameters

Cores assigned to SLURM job

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:
node02:7
node06:14
node09:3
Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:
scontrol show jobid -dd
In its output the abovementioned info is stored (together with plenty of other details).
Is there a better way to get this info?
The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run
srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE
You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as
srun hostname -s > $MACHINEFILE
And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.

How do the terms "job", "task", and "step" relate to each other?

How do the terms "job", "task", and "step" as used in the SLURM docs relate to each other?
AFAICT, a job may consist of multiple tasks, and also it make consist of multiple steps, but, assuming this is true, it's still not clear to me how tasks and steps relate.
It would be helpful to see an example showing the full complexity of jobs/tasks/steps.
A job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.
Jobs are typically created with the sbatch command, steps are created with the srun command, tasks are requested, at the job level with --ntasks or --ntasks-per-node, or at the step level with --ntasks. CPUs are requested per task with --cpus-per-task. Note that jobs submitted with sbatch have one implicit step; the Bash script itself.
Assume the hypothetical job:
#SBATCH --nodes 8
#SBATCH --tasks-per-node 8
# The job requests 64 CPUs, on 8 nodes.
# First step, with a sub-allocation of 8 tasks (one per node) to create a tmp dir.
# No need for more than one task per node, but it has to run on every node
srun --nodes 8 --ntasks 8 mkdir -p /tmp/$USER/$SLURM_JOBID
# Second step with the full allocation (64 tasks) to run an MPI
# program on some data to produce some output.
srun process.mpi <input.dat >output.txt
# Third step with a sub allocation of 48 tasks (because for instance
# that program does not scale as well) to post-process the output and
# extract meaningful information
srun --ntasks 48 --nodes 6 --exclusive postprocess.mpi <output.txt >result.txt &
# Fourth step with a sub-allocation on a single node
# to compress the raw output. This step runs at the same time as
# the previous one thanks to the ampersand `&`
srun --ntasks 12 --nodes 1 --exclusive compress.mpi output.txt &
wait
Four steps were created and so the accounting information for that job will have 5 lines; one per step plus one for the Bash script itself.

Resources