How to allocate node by node with slurm? - slurm

My goal :
I would like to launch multiple codes, nodes by nodes and allocated 100% each nodes
epic* up infinite 4 alloc lio[1-2]
And what I get :
epic* up infinite 4 mix lio[1-3,5]
my script :
#!/bin/bash
#SBATCH -A pt
#SBATCH -p epic
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH -J concentration
#SBATCH --array=1-4
. /usr/share/Modules/init/bash
module purge
module load openmpi-gcc/4.0.4-pmix_v2
MAXLEVEL=14
Ranf=8000
case $SLURM_ARRAY_TASK_ID in
1) phi='0.01'
;;
2) phi='0.008'
;;
3) phi='0.005'
;;
4) phi='0.001'
;;
esac
mkdir RBnf-P=$phi
cp RBnf `pwd`/RBnf-P=$phi/
cd RBnf-P=$phi
srun --mpi=pmix_v2 -J Ra${phi} ./RBnf $Ranf $MAXLEVEL $Phi
Each computation needs 16 process per node and each node have 32 process.
I have 4 computations to make.
My question : How can I allocated at 100% only 2 nodes ?
Because my script will use 4 nodes. So each nodes will be used at 50% of it's capacity (4 * 16/32). I would like to have my codes running on only 2 nodes at 100% of their capacity (2 * 32/32).
With this script slurm will allocate an other node instead of fill a node already used. That's why I have "mix" node and I want only 2 nodes "alloc".
Have you any ideas ?

I found why I couldn't allocated node by node.
The option "oversubscribe" in the slurm.conf file wasn't specified.
That's why I get nodes "mix" and not at 100% allocated.
https://slurm.schedmd.com/cons_res_share.html
Now I automatically use two nodes.

Related

runing a command multiple times in different nodes in SLURM

I want to run three instances of GROMACS mdrun on three different nodes.
I have three temperatures 200,220 and 240 K and I want to run 200 K simulation on node 1, 220 K simulation on node 2 and 240 K simulation on node 3. I need to do all this in one script as I have job number limit.
How can I do that in slurm?
Currently I have:
#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks=3
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --job-name=1us
#SBATCH --error=h.err
#SBATCH --output=h.out
#SBATCH --partition=standard
as my sbatch parameters and
for i in 1 2 3
do
T=$(($Ti+($i-1)*20))
cd T_$T/1000
gmx_mpi grompp -f heating.mdp -c init_conf.gro -p topol.top -o quench.tpr -maxwarn 1
gmx_mpi mdrun -s quench.tpr -deffnm heatingLDA -v &
cd ../../
done
wait
this is how I am running mdrun but this is not running as fast I want it to run. Firstly, the mdrun does not start simultaneously but it starts in 200K then after 2-3 min it starts on 220K. Secondly, the speed is much slower as expected.
Could you all tell me how can I achieve that?
Thank you in advance.
Best regards,
Ved
You need to add a line in the slurm script
#SBATCH --nodelist=${NODENAME}
where ${NODENAME} is the name of any of nodes 1, 2 or 3

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

How do I dynamically assign the number of cores per node?

#!/bin/bash
#SBATCH --job-name=Parallel# Job name
#SBATCH --output=slurmdiv.out# Output file name
#SBATCH --error=slurmdiv.err # Error file name
#SBATCH --partition=hadoop# Queue
#SBATCH --nodes = 1
#SBATCH --time=01:00:00# Time limit
The above script does not work without specifying --ntasks-per-node directive. The number of cores per node depends on the queue being used. I would like to assign the maximum number of cores per node without having to specify it ahead of time in the slurm script. I'm using this to run an R script that uses detectCores() and mclapply.
You can try adding the #SBATCH --exclusive parameter to your submission script so that Slurm will allocate a full node for your job, without the need to specify explicitly a specific number of tasks. Then you can use detectCores() in your script.

Why does slurm assign more tasks than I asked when I "sbatch" multiple jobs with a .sh file?

I submit some cluster mode spark jobs which run just fine when I do it one by one with below sbatch specs.
#!/bin/bash -l
#SBATCH -J Spark
#SBATCH --time=0-05:00:00 # 5 hour
#SBATCH --partition=batch
#SBATCH --qos qos-batch
###SBATCH -N $NODES
###SBATCH --ntasks-per-node=$NTASKS
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 7
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --dependency=singleton
If I use a launcher to submit the same job with different node and task numbers, the system gets confused and tries to assign according to $SLURM_NTASK which gives 16. However I ask for example only 1 node,3tasks.
#!/bin/bash -l
for n in {1..4}
do
for t in {3..4}
do
echo "Running benchmark with ${n} nodes and ${t} tasks per node"
sbatch -N ${n} --ntasks-per-node=${t} spark-teragen.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-terasort.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-teravalidate.sh
sleep 5
done
done
How can I fix the error below by preventing slurm assign weird number of tasks per node which exceeds the limit.
Error:
srun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the
requested tasks 16 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 233838: More processors requested than
permitted

Slurm: select nodes with specified number of CPUs

I'm using slurm on a cluster where single partitions have dissimilar nodes. Specifically, the nodes have varying # CPUs. My code is a single-core application being used for a parameter sweep and thus I want to fully use an (eg.) 32 CPU node by sending it 32 jobs.
How can I select nodes (within a named partition) that have a specified number of CPUs?
I know my Partition configuration via
sinfo -e -p <partition_name> -o "%9P %3c %.5D %6t " -t idle,mix
PARTITION CPU NODES STATE
<partition_name> 16 63 mix
<partition_name> 32 164 mix
But if I use a submissions script like
[snip preamble]
#SBATCH --partition <partition_name> # resource to be used
#SBATCH --nodes 1 # Num nodes
#SBATCH -N 1 # Num cores per job
#SBATCH --cores-per-socket=32 # Cores per node
the slurm scheduler says
sbatch: error: Socket, core and/or thread specification can not be satisfied
PS. A minor correction: my code to get partition info isn't the best. Just in case anyone looks up this question later, here is a better query (using X,Y for socket, core counts) that helps identify the problem that damien's excellent answer solved
sinfo -e -p <partition_name> -o "%9P %3c %.3D %6t %2X %2Y %N" -t idle,mix
To strictly answer your question: With
#SBATCH --cores-per-socket=32
you request 32 core per socket, which is per physical CPU. I guess those machines have two CPUs so you should request something like
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=16
Another way of requesting the same is to ask for
#SBATCH --nodes 1
#SBATCH --tasks-per-node 32
But please note that, if your cluster allows node sharing, what you do seems more suited for job arrays :
#SBATCH --ntasks 1
#SBATCH --arrays 1-32
IDS=($(seq RUN_ID_FIRST RUN_ID_LAST))
RUN_ID=${IDS[$SLURM_ARRAY_TASK_ID]}
matlab -nojvm -singleCompThread -r "try myscript(${RUN_ID}); catch me; disp(' *** error'); end; exit" > ./result_${RUN_ID}
This will launch 32 independent jobs, each taking care of running the Matlab script for one value of the parameter sweep.
To answer your additional question; if a 32-process job is scheduled on a 16-CPU node, the node will be overloaded, and depending on the containment solution set up by the administrators, your processes might impact others' jobs and slow them down.

Resources