How does SLURM array Interface with SBATCH resource allocation? - linux

#!/bin/bash
#SBATCH -p RM-shared
#SBATCH -n 4
#SBATCH -t 24:00:00
#SBATCH --array=1-
I am trying to start an array and for each task in the array I would like it to use 4 cores on the RM-shared partition. Am I doing this correctly or does this designate that ALL of the tasks output by the array with have to share 4 cores?
I will ask a separate question for this, but for some reason when I run this, the $SLURM_ARRAY_TASK_ID variable is empty....
when I run
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
after my headers setting up the job, it returns
My SLURM_ARRAY_TASK_ID:

First you are right about using --cpus-per-task=4 rather than ntasks. Second, it could be a copy/paste error, but your --array line is incomplete
#SBATCH --array=1-
should be
#SBATCH --array=1-10
for instance for a 10-job array.
Each job in the array will have 4 distinct cores allocated to it. And the job will be scheduled independently, so they could for instance start all 10 on a 40-core nodes at the same time, or on 10 distinct nodes at the same time, or on one 4-core nodes one at a time, or any possible in-between combination depending on the cluster configuration and jobs in the queue.

I wasnt calling the script properly. was calling:
./ThisScript.sh
instead of
sbatch ./ThisScript.sh
Regarding the allocation of cores per array job, a helpdesk person said to use
#SBATCH --ntasks-per-node=4
instead of
#SBATCH --cpus-per-task=4
but I do not know why... I would expect --ntasks-per-node=4 to command that each node needed only run 4 jobs, so if you had 12 jobs in your array it would require 3 full nodes.
--cpus-per-task=4 on the other hand, would command that each CPU (each with hosts a number of cores) would only run 4 tasks, so if you had 12 jobs in your array, it would requre 3 CPUs (and, if the nodes on your system have 3 or more CPUs, it would only require 1 node).

Related

How to determine job array size for lots of jobs?

What is the best way to process lots of files in parallel via Slurm?
I have a lot of files (let's say 10000) in a folder (Each files get 10 secs or so to process). I want to determine sbatch job array size as 1000, naturally. (#SBATCH --array=1-10000%100) But it seems I can't determine more than some numbers(probably 1k). How you handle job array numbers? It seems to me because of my process don't take too much time, I think i should determine one job NOT for one file but for multiple files, right?
Thank you
If the process time is 10 second you should consider packing the tasks in a single job, both because such short jobs take longer to schedule than to run and because there is a limit on the number of jobs in an array.
Your submission script could look like this:
#!/bin/bash
#SBATCH --ntasks=16 # or any other number depending on the size of the cluster and the maximum allowed wall time
#SBATCH --mem-per-cpu=...
#SBATCH --time=... # based on the number of files and number of tasks
find . -name file_pattern -print0 | xargs -I{} -0 -P $SLURM_NTASKS srun -n1 -c1 --exclusive name_of_the_program {}
Make sure to replace all the ... and file_pattern and name_of_the_program with appropriate values.
The script will look for all files matching file_pattern in the submission directory and run the name_of_the_program program on it, limiting the number of concurrent instantes to the number of CPUs (more precisely number of tasks) requested. Note the use of --exclusive here which is specific for this use case and is deprecated with --exact in recent Slurm versions.

How to create a batch script, which submitts several jobs and allocates each of the this jobs on a separate node?

I am new to HPC and SLURM especially, and i ran into some troubles.
I was provided with acces to a HPC cluster with 32 CPUs on each node. In order to do the needed calculations I made 12 Python multiprocessing Scripts, where each Script uses excactly 32 CPU's.
How, instead of starting each Script manually in the interactive modus ( which is also an option btw. but it takes a lot of time) I decided to write a Batch Script in order to start all my 12 Scripts automatically.
//SCRIPT//
#!/bin/bash
#SBATCH --job-name=job_name
#SBATCH --partition=partition
#SBATCH --nodes=1
#SBATCH --time=47:59:59
#SBATCH --export=NONE
#SBATCH --array=1-12
module switch env env/system-gcc
module load python/3.8.5
source /home/user/env/bin/activate
python3.8 $HOME/Script_directory/Script$SLURM_ARRAY_TASK_ID.py
exit
//UNSCRIPT//
But as far as i understand, this script would start all of the Jobs from the Array on the same node and thus the underlying python scripts might start a "fight" for the available CPU's and thus slow down.
How should i modify my bash file in Order to start each task from the array on a separate node?
Thanks in advance!
This script will start 12 independent jobs, possibly on 12 distinct nodes at the same time, or all 12 in sequence on the same node or any other combination depending on the load of the cluster.
Each job will run the corresponding Script$SLURM_ARRAY_TASK_ID.py script. There will be no competition for resources.
Note that if nodes are shared in the cluster, you would add the --exclusive parameter to request whole nodes with their 32 CPUs.

How do the terms "job", "task", and "step" relate to each other?

How do the terms "job", "task", and "step" as used in the SLURM docs relate to each other?
AFAICT, a job may consist of multiple tasks, and also it make consist of multiple steps, but, assuming this is true, it's still not clear to me how tasks and steps relate.
It would be helpful to see an example showing the full complexity of jobs/tasks/steps.
A job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.
Jobs are typically created with the sbatch command, steps are created with the srun command, tasks are requested, at the job level with --ntasks or --ntasks-per-node, or at the step level with --ntasks. CPUs are requested per task with --cpus-per-task. Note that jobs submitted with sbatch have one implicit step; the Bash script itself.
Assume the hypothetical job:
#SBATCH --nodes 8
#SBATCH --tasks-per-node 8
# The job requests 64 CPUs, on 8 nodes.
# First step, with a sub-allocation of 8 tasks (one per node) to create a tmp dir.
# No need for more than one task per node, but it has to run on every node
srun --nodes 8 --ntasks 8 mkdir -p /tmp/$USER/$SLURM_JOBID
# Second step with the full allocation (64 tasks) to run an MPI
# program on some data to produce some output.
srun process.mpi <input.dat >output.txt
# Third step with a sub allocation of 48 tasks (because for instance
# that program does not scale as well) to post-process the output and
# extract meaningful information
srun --ntasks 48 --nodes 6 --exclusive postprocess.mpi <output.txt >result.txt &
# Fourth step with a sub-allocation on a single node
# to compress the raw output. This step runs at the same time as
# the previous one thanks to the ampersand `&`
srun --ntasks 12 --nodes 1 --exclusive compress.mpi output.txt &
wait
Four steps were created and so the accounting information for that job will have 5 lines; one per step plus one for the Bash script itself.

How does one specify in slurm to send e-mail when a single job finishes and not when each slurm array task finishes?

I was running lots of jobs in slurm with sbatch as follow:
#!/usr/bin/env python
#SBATCH --job-name=Python
#SBATCH --array=1-200
#SBATCH --mem=4000
#SBATCH --time=0-18:20
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my_mail#yahoo.com
however, what seems to be happening is that slurm is sending me an e-mail for each array. However, I did not want slurm to do that, I only want me to send me an e-mail when all the slurm array have finishes (or failed) running. i.e. when a specific job (and ALL its job arrays are done). Is that possible to do in slurm?
I was reading the documentation and it says the following under --mail-type=<type>:
Unless the ARRAY_TASKS option is specified, mail notifications on job
BEGIN, END and FAIL apply to a job array as a whole rather than
generating individual email messages for each task in the job array.
so might the problem be that I am using ALL instead of BEGIN,END or FAIL? I am honestly just interested when all the job arrays are done running even if a single fails its ok.
The document:
--mail-type=
...
Unless the ARRAY_TASKS option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.
...
https://slurm.schedmd.com/sbatch.html
Hope this answers your question.

qsub job array, where each job uses a single process?

I have a job script with the following PBS specifications at the beginning:
#PBS -t 0-99
#PBS -l nodes=1:ppn=1
The first line says that this is a job array, with jobs indexed from 0 to 99.
I want each individual indexed job to use only a single node and a single core per node, hence my second PBS line. But I am worried that TORQUE qsub will interpret the second line as saying that the whole job array should run sequentially on a single core.
How does TORQUE qsub interpret the PBS second line?
It interprets it as 100 jobs that should each use 1 execution slot on one node. For more information, please look at the qsub documentation and look for the details on the -t switch.

Resources