How to create a batch script, which submitts several jobs and allocates each of the this jobs on a separate node? - slurm

I am new to HPC and SLURM especially, and i ran into some troubles.
I was provided with acces to a HPC cluster with 32 CPUs on each node. In order to do the needed calculations I made 12 Python multiprocessing Scripts, where each Script uses excactly 32 CPU's.
How, instead of starting each Script manually in the interactive modus ( which is also an option btw. but it takes a lot of time) I decided to write a Batch Script in order to start all my 12 Scripts automatically.
//SCRIPT//
#!/bin/bash
#SBATCH --job-name=job_name
#SBATCH --partition=partition
#SBATCH --nodes=1
#SBATCH --time=47:59:59
#SBATCH --export=NONE
#SBATCH --array=1-12
module switch env env/system-gcc
module load python/3.8.5
source /home/user/env/bin/activate
python3.8 $HOME/Script_directory/Script$SLURM_ARRAY_TASK_ID.py
exit
//UNSCRIPT//
But as far as i understand, this script would start all of the Jobs from the Array on the same node and thus the underlying python scripts might start a "fight" for the available CPU's and thus slow down.
How should i modify my bash file in Order to start each task from the array on a separate node?
Thanks in advance!

This script will start 12 independent jobs, possibly on 12 distinct nodes at the same time, or all 12 in sequence on the same node or any other combination depending on the load of the cluster.
Each job will run the corresponding Script$SLURM_ARRAY_TASK_ID.py script. There will be no competition for resources.
Note that if nodes are shared in the cluster, you would add the --exclusive parameter to request whole nodes with their 32 CPUs.

Related

How does SLURM array Interface with SBATCH resource allocation?

#!/bin/bash
#SBATCH -p RM-shared
#SBATCH -n 4
#SBATCH -t 24:00:00
#SBATCH --array=1-
I am trying to start an array and for each task in the array I would like it to use 4 cores on the RM-shared partition. Am I doing this correctly or does this designate that ALL of the tasks output by the array with have to share 4 cores?
I will ask a separate question for this, but for some reason when I run this, the $SLURM_ARRAY_TASK_ID variable is empty....
when I run
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
after my headers setting up the job, it returns
My SLURM_ARRAY_TASK_ID:
First you are right about using --cpus-per-task=4 rather than ntasks. Second, it could be a copy/paste error, but your --array line is incomplete
#SBATCH --array=1-
should be
#SBATCH --array=1-10
for instance for a 10-job array.
Each job in the array will have 4 distinct cores allocated to it. And the job will be scheduled independently, so they could for instance start all 10 on a 40-core nodes at the same time, or on 10 distinct nodes at the same time, or on one 4-core nodes one at a time, or any possible in-between combination depending on the cluster configuration and jobs in the queue.
I wasnt calling the script properly. was calling:
./ThisScript.sh
instead of
sbatch ./ThisScript.sh
Regarding the allocation of cores per array job, a helpdesk person said to use
#SBATCH --ntasks-per-node=4
instead of
#SBATCH --cpus-per-task=4
but I do not know why... I would expect --ntasks-per-node=4 to command that each node needed only run 4 jobs, so if you had 12 jobs in your array it would require 3 full nodes.
--cpus-per-task=4 on the other hand, would command that each CPU (each with hosts a number of cores) would only run 4 tasks, so if you had 12 jobs in your array, it would requre 3 CPUs (and, if the nodes on your system have 3 or more CPUs, it would only require 1 node).

Cores assigned to SLURM job

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:
node02:7
node06:14
node09:3
Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:
scontrol show jobid -dd
In its output the abovementioned info is stored (together with plenty of other details).
Is there a better way to get this info?
The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run
srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE
You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as
srun hostname -s > $MACHINEFILE
And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.

slurm is ignoring the --distribution=cyclic flag in my sbatch file and using the 'block' distribution method instead

I would like to distribute my job evenly across multiple nodes and have specified the --distribution=cyclic in my sbatch file, but slurm ignores that and uses the block distribution instead.
Before, the tasks were distributing evenly across the nodes. From reading the docs, this is what I expect the default behavior to be, unless otherwise specified in slurm.conf.
Starting today, the tasks are clustering on the first node with only one task on each of the other nodes. I've obviously changed something in the config, but can't figure out where to fix it. I did make a change to the image for the compute nodes and rebooted them today.
When I try to stop the slurmctld on the head node, it is restarted immediately by my Bright Cluster Manager monitor. Not sure if this is preventing configuration updates or not.
I've investigated the slurm.conf file but it looks OK. I've tried both SelectTypeParameters=CR_Core and CR_CPU but get the same result.
To try and work around this I added the --distribution=cyclic to my sbatch file, but slurm is still allocation using the 'block' method. But adding this to the sbatch should not be necessary, at leas according to my understanding of the docs.
Here are the relevant lines from slurm.conf and my sbatch script:
# RESOURCES
SelectType=select/cons_res
SelectTypeParameters=CR_Core
# Node Description
NodeName=DEFAULT Sockets=2 CoresPerSocket=20 ThreadsPerCore=1
# Scheduler
SchedulerType=sched/backfill
#SBATCH --ntasks=12
#SBATCH --nodes=3
#SBATCH --distribution=cyclic:cyclic
I would expect the tasks to be distributed evenly between the nodes, with 4 tasks on each of the 3 nodes.
Here is how the tasks are actually getting distributed:
NODELIST STATE CPUS(A/I/O/T) MEMORY TMP_DISK REASON
compute001 mixed 10/30/0/40 192006 2038 none
compute002 mixed 1/39/0/40 192006 2038 none
compute003 mixed 1/39/0/40 192006 2038 none
compute004 idle 0/40/0/40 192006 2038 none
According to https://slurm.schedmd.com/sbatch.html, the distribution flag is only useful for srun:
Specify alternate distribution methods for remote processes. In sbatch, this only sets environment variables that will be used by subsequent srun requests.
(As to why it's like this… I have no idea. But it does appear it's by design.)
Depending on your configuration, you may be able to approximate what you want by setting SelectType=cons_res or cons_tres and SelectTypeParameters=CR_LLN. If either of these parameters changed recently, that might be the reason the behavior changed, as well.
I managed to manually distribute my processes across the nodes by modifying the sbatch file to limit the number of tasks assigned to each node:
#SBATCH --ntasks=12
#SBATCH --nodes=3
#SBATCH --tasks-per-node=4
This results in the expected distribution of the tasks across the nodes:
NODELIST STATE CPUS(A/I/O/T) MEMORY TMP_DISK REASON
compute001 mixed 4/36/0/40 192027 2038 none
compute002 mixed 4/36/0/40 192027 2038 none
compute003 mixed 4/36/0/40 192027 2038 none
compute004 idle 0/40/0/40 192027 2038 none

SLURM job taking up entire node when using just one GPU

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?
My script looks like this:
#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe
I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf, like this:
PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE
After that, I was able to run four jobs with --gres=gpu:1, and each one took a different GPU (a fifth job is queued, as expected).

How do the terms "job", "task", and "step" relate to each other?

How do the terms "job", "task", and "step" as used in the SLURM docs relate to each other?
AFAICT, a job may consist of multiple tasks, and also it make consist of multiple steps, but, assuming this is true, it's still not clear to me how tasks and steps relate.
It would be helpful to see an example showing the full complexity of jobs/tasks/steps.
A job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.
Jobs are typically created with the sbatch command, steps are created with the srun command, tasks are requested, at the job level with --ntasks or --ntasks-per-node, or at the step level with --ntasks. CPUs are requested per task with --cpus-per-task. Note that jobs submitted with sbatch have one implicit step; the Bash script itself.
Assume the hypothetical job:
#SBATCH --nodes 8
#SBATCH --tasks-per-node 8
# The job requests 64 CPUs, on 8 nodes.
# First step, with a sub-allocation of 8 tasks (one per node) to create a tmp dir.
# No need for more than one task per node, but it has to run on every node
srun --nodes 8 --ntasks 8 mkdir -p /tmp/$USER/$SLURM_JOBID
# Second step with the full allocation (64 tasks) to run an MPI
# program on some data to produce some output.
srun process.mpi <input.dat >output.txt
# Third step with a sub allocation of 48 tasks (because for instance
# that program does not scale as well) to post-process the output and
# extract meaningful information
srun --ntasks 48 --nodes 6 --exclusive postprocess.mpi <output.txt >result.txt &
# Fourth step with a sub-allocation on a single node
# to compress the raw output. This step runs at the same time as
# the previous one thanks to the ampersand `&`
srun --ntasks 12 --nodes 1 --exclusive compress.mpi output.txt &
wait
Four steps were created and so the accounting information for that job will have 5 lines; one per step plus one for the Bash script itself.

Resources