Slurm: select nodes with specified number of CPUs - slurm

I'm using slurm on a cluster where single partitions have dissimilar nodes. Specifically, the nodes have varying # CPUs. My code is a single-core application being used for a parameter sweep and thus I want to fully use an (eg.) 32 CPU node by sending it 32 jobs.
How can I select nodes (within a named partition) that have a specified number of CPUs?
I know my Partition configuration via
sinfo -e -p <partition_name> -o "%9P %3c %.5D %6t " -t idle,mix
PARTITION CPU NODES STATE
<partition_name> 16 63 mix
<partition_name> 32 164 mix
But if I use a submissions script like
[snip preamble]
#SBATCH --partition <partition_name> # resource to be used
#SBATCH --nodes 1 # Num nodes
#SBATCH -N 1 # Num cores per job
#SBATCH --cores-per-socket=32 # Cores per node
the slurm scheduler says
sbatch: error: Socket, core and/or thread specification can not be satisfied
PS. A minor correction: my code to get partition info isn't the best. Just in case anyone looks up this question later, here is a better query (using X,Y for socket, core counts) that helps identify the problem that damien's excellent answer solved
sinfo -e -p <partition_name> -o "%9P %3c %.3D %6t %2X %2Y %N" -t idle,mix

To strictly answer your question: With
#SBATCH --cores-per-socket=32
you request 32 core per socket, which is per physical CPU. I guess those machines have two CPUs so you should request something like
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=16
Another way of requesting the same is to ask for
#SBATCH --nodes 1
#SBATCH --tasks-per-node 32
But please note that, if your cluster allows node sharing, what you do seems more suited for job arrays :
#SBATCH --ntasks 1
#SBATCH --arrays 1-32
IDS=($(seq RUN_ID_FIRST RUN_ID_LAST))
RUN_ID=${IDS[$SLURM_ARRAY_TASK_ID]}
matlab -nojvm -singleCompThread -r "try myscript(${RUN_ID}); catch me; disp(' *** error'); end; exit" > ./result_${RUN_ID}
This will launch 32 independent jobs, each taking care of running the Matlab script for one value of the parameter sweep.
To answer your additional question; if a 32-process job is scheduled on a 16-CPU node, the node will be overloaded, and depending on the containment solution set up by the administrators, your processes might impact others' jobs and slow them down.

Related

How to allocate node by node with slurm?

My goal :
I would like to launch multiple codes, nodes by nodes and allocated 100% each nodes
epic* up infinite 4 alloc lio[1-2]
And what I get :
epic* up infinite 4 mix lio[1-3,5]
my script :
#!/bin/bash
#SBATCH -A pt
#SBATCH -p epic
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH -J concentration
#SBATCH --array=1-4
. /usr/share/Modules/init/bash
module purge
module load openmpi-gcc/4.0.4-pmix_v2
MAXLEVEL=14
Ranf=8000
case $SLURM_ARRAY_TASK_ID in
1) phi='0.01'
;;
2) phi='0.008'
;;
3) phi='0.005'
;;
4) phi='0.001'
;;
esac
mkdir RBnf-P=$phi
cp RBnf `pwd`/RBnf-P=$phi/
cd RBnf-P=$phi
srun --mpi=pmix_v2 -J Ra${phi} ./RBnf $Ranf $MAXLEVEL $Phi
Each computation needs 16 process per node and each node have 32 process.
I have 4 computations to make.
My question : How can I allocated at 100% only 2 nodes ?
Because my script will use 4 nodes. So each nodes will be used at 50% of it's capacity (4 * 16/32). I would like to have my codes running on only 2 nodes at 100% of their capacity (2 * 32/32).
With this script slurm will allocate an other node instead of fill a node already used. That's why I have "mix" node and I want only 2 nodes "alloc".
Have you any ideas ?
I found why I couldn't allocated node by node.
The option "oversubscribe" in the slurm.conf file wasn't specified.
That's why I get nodes "mix" and not at 100% allocated.
https://slurm.schedmd.com/cons_res_share.html
Now I automatically use two nodes.

Slurm: Why do we need Srun in Sbatch script file?

I am new to Slurm and I also found the related questions about this topic. However, I am still confused about several points of how to use srun. According to the official document, srun will typically first allocate resources and then run the parallel jobs. For example, I want to run 20 tasks and if I submit my job based on the following script, I am not sure how many tasks are created. Because sbatch only takes care of allocating resources instead of executing program.
#!/bin/sh
#SBATCH -n 20
#SBATCH --mpi=pmi2
#SBATCH -o myoutputfile.txt
module load mpi/mpich-x86_64
mpirun mpiprogram < inputfile.txt
If I am trying to run sequential program like the following, I am not whether there will be a difference or not. For example, I can simply remove the srun command in this script. What will happen?
#!/bin/sh
#SBATCH -n 1
#SBATCH -N 1
srun tar zxf julia-0.3.11.tar.gz
echo "prefix=/software/julia-0.3.11" > julia/Make.user
cd julia
srun make
The first example will spawn 20 tasks ; sbatch will request 20 CPUs and also set up the environment so that mpirun knows how many CPUs were requested for the job. mpirun will then spawn as many processes as were allocated (provided that OpenMPI was compiled with Slurm support).
The #SBATCH --mpi=pmi2 part is meant for srun so it will have no effect if srun is not called in the submission script.
In the second example, there will be no difference in the number of processes spawned as only one is needed. But, with srun, the output of sstat will be more reliable, the management of signals will be more precise, and the buffering of the output will be more controlled (via the srun command line options).
If you request multiple tasks, srun will instantiate that many processes. It can be an MPI program, or a sequential program that adapts its behaviour based on the SLURM_PROC_ID environment variable.
Also you can run multiple srun in the same submission script. Each instance of srun (called a "step") is then accounted separately in the accounting (sacct).
Finally, srun can use a subset of the allocation and organise the micro-scheduling of many small tasks in a single job (see the example in the srun manpage).

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

How to gather processed information from nodes SLURM/PBS

I am new to Parallel computing, I cannot understand the use of PBS systems. I have successfully install SLURM and set up processing nodes. But cannot get the idea how I can distribute a task between multiple nodes.
There are a lot of simple examples, but they just run simple "Hello World" programs and that's all.
Consider the following example, I've found on the internet.
#!/bin/bash
#SBATCH -N 4
#SBATCH -c 1
#SBATCH --time=0-00:15:00 # 30 minutes
#SBATCH --job-name="just_a_test"
module load python
python --version
Simple script that run gets the Python version.
When I run it using sbatch python.slurm, the result is saved only on the first node even if I set the number to 4. But srun -N4 /bin/hostname works fine on the other hand.
But this is not the main question.
I cannot understand what I have to write my parallel algorithm.
Any example of parallel algorithm like array sorting, matrix multiplication or whatever.
The steps that are used for example in Hadoop or just multithreaded environment.
Get input from a source.
Divide the input into chunks,the number of chunks should be related to the node count.
Send these chunk to each processing node/thread
Wait for all threads to complete
Gather processed information and show it user after merging
How can I do the same using SLURM or any PBS.
#!/bin/bash
#SBATCH -N 4
#SBATCH -c 1
#SBATCH --time=0-00:15:00 # 30 minutes
#SBATCH --job-name="just_a_test"
what I have to write here ?
Please explain this or give a good article to read about, because I haven't found any.
Thanks
The most basic way to do this is to use pbsdsh:
pbsdsh hostname
will make the hostname command execute once for each execution slot (core or thread) in your job. I will also point out that you'll need to translate your #SBATCH to their #PBS equivalents.
The more universal way to do this is through MPI implementations.

Run a "monitor" task alongside mpi task in SLURM

I've got an mpi job I run in slurm using an sbatch script which looks something like:
# request 384 processors across 16 nodes for exclusive use:
#SBATCH --exclusive
#SBATCH --ntasks-per-node=24
#SBATCH -n 384
#SBATCH -N 16
#SBATCH --time 3-00:00:00
mpirun myprog
I want to monitor the memory/cpu usage and some other behaviour of the "myprog" processes. I've written a simple script (call it "monitor") which can do this, but I'm stumped on how to use sbatch to run ONE copy of it on each allocated node, at the same time as "myprog".
I think I need to modify the above to something like:
...
srun monitor
mpirun myprog
But I'm confused about whether a) that means "monitor" will run in the background and b) how I can control where "monitor" runs.
To have monitor run 'in the background', so actually the srun is non-blocking and the subsequent mpirun command can start, you simply need to add an ampersand (&) at the end.
To make sure that program runs on the 'master node' of the allocation, just remove the srun command.
If you need that program to run on a specific node, use the -n1 --nodelist option (you probably first need to get the list of all allocated nodes first.) You should also consider using the --overcommit option of srun to avoid dedicating a full CPU to your monitoring program which I assume is not CPU-bound.

Resources