Questions on alternative ways to run 4 parallel jobs - slurm

Below are three different sbatch scripts that produce roughly similar results.
(I show only the parts where the scripts differ; the ## prefix indicates the output obtained by submitting the scripts to sbatch.)
Script 0
#SBATCH -n 4
srun -l hostname -s
## ==> slurm-7613732.out <==
## 0: node-73
## 1: node-73
## 2: node-73
## 3: node-73
Script 1
#SBATCH -n 1
#SBATCH -a 1-4
srun hostname -s
## ==> slurm-7613733_1.out <==
## node-72
##
## ==> slurm-7613733_2.out <==
## node-73
##
## ==> slurm-7613733_3.out <==
## node-72
##
## ==> slurm-7613733_4.out <==
## node-73
Script 2
#SBATCH -N 4
srun -l -n 4 hostname -s
## ==> slurm-7613738.out <==
## 0: node-74
## 2: node-76
## 1: node-75
## 3: node-77
Q: Why would one choose one such approach over the others?
(I see that the jobs spawned by Script 0 all ran on the same node, but I can't tell if this is a coincidence.)
Also, the following variant of Script 2 (the only difference being -N 2 instead of -N 4) fails:
Script 3
#SBATCH -N 2
srun -l -n 4 hostname -s
## ==> slurm-7614825.out <==
## srun: error: Unable to create job step: More processors requested than permitted
Ditto for the following variant of Script 2 (the only difference between this and Script 3 is that here srun also has the flag -c 2):
Script 4
#SBATCH -N 2
srun -l -n 4 -c 2 hostname -s
## ==> slurm-7614827.out <==
## srun: error: Unable to create job step: More processors requested than permitted
Qs: are the errors I get with Script 3 and Script 4 due to wrong syntax, wrong semantics, or site-specific configs? IOW, is there something inherently wrong with these scripts (that would cause them to fail under any instance of SLURM), or are the errors only due to violations of restrictions imposed by the particular instance of SLURM I'm submitting the jobs to? If the latter is the case, how can I pinpoint the configs responsible for the error?

Q: Why would one choose one such approach over the others?
Script 0: you request 4 tasks, to be allocated at the same time to a single job, with no other specification as to how those tasks should be allocated to nodes. Typical use: an MPI program.
Script 1: you request 4 jobs, each with 1 task. The jobs will be scheduled independently one from another. Typical use: Embarrassingly parallel jobs.
Script 2: you request 4 nodes, with one task per node. It is similar to Script 0 except that you request the tasks to be allocated to four distinct nodes. Typical use: MPI program with a lot of IOs on local disks for instance.
The fact that all jobs were allocated the same first node is due to the fact that Slurm always allocates the nodes in the same order, and you probably run all the tests one after another so the other started on the resources the previous one just freed.
Script 3: You request two nodes, with implicitly, 1 task per node, so you are allocated two tasks, but then you try to use 4 tasks with srun. You should change it to
#SBATCH -N 2
#SBATCH --tasks-per-node 2
srun -l -n 4 hostname -s
two request two tasks per node, or
#SBATCH -N 2
#SBATCH -n 4
srun -l -n 4 hostname -s
to request four tasks, with no additional constraint on the distribution of tasks across nodes.
Script 4: You request two nodes, with implicitly, 1 task per node, and, also implicitly, one CPU per task, so you are allocated two CPUs, but then you try to use 4 tasks with srun, each with 2 CPUS so 8 in total. You should change it to
#SBATCH -N 2
#SBATCH --tasks-per-node 2
#SBATCH --cpus-per-task 2
srun -l -n 4 -c 2 hostname -s
or,
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --cpus-per-task 2
srun -l -n 4 -c 2 hostname -s
The bottom line: in the submission script, you request resources with the #SBATCH directives, and you cannot use more resource than than in the subsequent calls to srun.

Related

runing a command multiple times in different nodes in SLURM

I want to run three instances of GROMACS mdrun on three different nodes.
I have three temperatures 200,220 and 240 K and I want to run 200 K simulation on node 1, 220 K simulation on node 2 and 240 K simulation on node 3. I need to do all this in one script as I have job number limit.
How can I do that in slurm?
Currently I have:
#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks=3
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --job-name=1us
#SBATCH --error=h.err
#SBATCH --output=h.out
#SBATCH --partition=standard
as my sbatch parameters and
for i in 1 2 3
do
T=$(($Ti+($i-1)*20))
cd T_$T/1000
gmx_mpi grompp -f heating.mdp -c init_conf.gro -p topol.top -o quench.tpr -maxwarn 1
gmx_mpi mdrun -s quench.tpr -deffnm heatingLDA -v &
cd ../../
done
wait
this is how I am running mdrun but this is not running as fast I want it to run. Firstly, the mdrun does not start simultaneously but it starts in 200K then after 2-3 min it starts on 220K. Secondly, the speed is much slower as expected.
Could you all tell me how can I achieve that?
Thank you in advance.
Best regards,
Ved
You need to add a line in the slurm script
#SBATCH --nodelist=${NODENAME}
where ${NODENAME} is the name of any of nodes 1, 2 or 3

How to use srun command to assign different GPU for each task in a node with multiple GPUs?

How can I change my slurm script below so that each python job gets a unique GPU? The node had 4 GPUs, I would like to run 1 python job per each GPU.
The problem is that all jobs use the first GPU and other GPUs are idle.
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
srun python gpu_code.py &
cd ..
done
wait
In your example your four jobs will be executed sequentially. I suggest submitting four separate jobs that only request a single GPU. Then the jobs only use one GPU and will be executed simultaneously. If the jobs have depencies you can use:
sbatch --dependency=afterok:${jobid_of_previous_job} submit.sh. This job will start after the prvious has finished.
As you do not request GPUs in the submission scripts, you will have to manage the CUDA_VISIBLE_DEVICES var by yourself to direct each python script to one specific GPU.
Try with
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait

Parallel jobs in SLURM

How can I run a number of python scripts in different nodes in SLURM?
Suppose,
I select 5 cluster nodes using #SBATCH --nodes=5
and
I have 5 python scripts code1.py, code2.py....code5.py and I want to run each of these scripts in 5 different nodes simultaneously. How can I achieve this?
Do these five scripts need to run in a single job? Do they really need to run simultaneously? Is there some communication happeneing between them? Or are they independent from one another?
If they are essentially independent, then you should most likely pu tthem into 5 different jobs with one nodes each. That way you don't have to find five free nodes, but your the first job can start as soon as there is a single free node. If there are many scripts you want to start like that, it might be interesting to look into job arrays.
If you need to run them in parallel, you will need to use srun in your jobscript to start the scripts. This example shows a job where you have 10 cores per task and each node has one task.
#!/bin/bash
#[...]
#SBATCH -N 5
#SBATCH -n 5
#SBATCH -c 10
#[...]
srun -N 1 -n1 python code1.py &
srun -N 1 -n1 python code2.py &
srun -N 1 -n1 python code3.py &
srun -N 1 -n1 python code4.py &
srun -N 1 -n1 python code5.py &
wait
You need to run the srun calls in the background, as bash would otherwise wait for them to finish before executing the next one.

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

Why does slurm assign more tasks than I asked when I "sbatch" multiple jobs with a .sh file?

I submit some cluster mode spark jobs which run just fine when I do it one by one with below sbatch specs.
#!/bin/bash -l
#SBATCH -J Spark
#SBATCH --time=0-05:00:00 # 5 hour
#SBATCH --partition=batch
#SBATCH --qos qos-batch
###SBATCH -N $NODES
###SBATCH --ntasks-per-node=$NTASKS
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 7
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --dependency=singleton
If I use a launcher to submit the same job with different node and task numbers, the system gets confused and tries to assign according to $SLURM_NTASK which gives 16. However I ask for example only 1 node,3tasks.
#!/bin/bash -l
for n in {1..4}
do
for t in {3..4}
do
echo "Running benchmark with ${n} nodes and ${t} tasks per node"
sbatch -N ${n} --ntasks-per-node=${t} spark-teragen.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-terasort.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-teravalidate.sh
sleep 5
done
done
How can I fix the error below by preventing slurm assign weird number of tasks per node which exceeds the limit.
Error:
srun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the
requested tasks 16 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 233838: More processors requested than
permitted

Resources