Parallel jobs in SLURM - slurm

How can I run a number of python scripts in different nodes in SLURM?
Suppose,
I select 5 cluster nodes using #SBATCH --nodes=5
and
I have 5 python scripts code1.py, code2.py....code5.py and I want to run each of these scripts in 5 different nodes simultaneously. How can I achieve this?

Do these five scripts need to run in a single job? Do they really need to run simultaneously? Is there some communication happeneing between them? Or are they independent from one another?
If they are essentially independent, then you should most likely pu tthem into 5 different jobs with one nodes each. That way you don't have to find five free nodes, but your the first job can start as soon as there is a single free node. If there are many scripts you want to start like that, it might be interesting to look into job arrays.
If you need to run them in parallel, you will need to use srun in your jobscript to start the scripts. This example shows a job where you have 10 cores per task and each node has one task.
#!/bin/bash
#[...]
#SBATCH -N 5
#SBATCH -n 5
#SBATCH -c 10
#[...]
srun -N 1 -n1 python code1.py &
srun -N 1 -n1 python code2.py &
srun -N 1 -n1 python code3.py &
srun -N 1 -n1 python code4.py &
srun -N 1 -n1 python code5.py &
wait
You need to run the srun calls in the background, as bash would otherwise wait for them to finish before executing the next one.

Related

How to use srun command to assign different GPU for each task in a node with multiple GPUs?

How can I change my slurm script below so that each python job gets a unique GPU? The node had 4 GPUs, I would like to run 1 python job per each GPU.
The problem is that all jobs use the first GPU and other GPUs are idle.
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
srun python gpu_code.py &
cd ..
done
wait
In your example your four jobs will be executed sequentially. I suggest submitting four separate jobs that only request a single GPU. Then the jobs only use one GPU and will be executed simultaneously. If the jobs have depencies you can use:
sbatch --dependency=afterok:${jobid_of_previous_job} submit.sh. This job will start after the prvious has finished.
As you do not request GPUs in the submission scripts, you will have to manage the CUDA_VISIBLE_DEVICES var by yourself to direct each python script to one specific GPU.
Try with
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait

Slurm: Why do we need Srun in Sbatch script file?

I am new to Slurm and I also found the related questions about this topic. However, I am still confused about several points of how to use srun. According to the official document, srun will typically first allocate resources and then run the parallel jobs. For example, I want to run 20 tasks and if I submit my job based on the following script, I am not sure how many tasks are created. Because sbatch only takes care of allocating resources instead of executing program.
#!/bin/sh
#SBATCH -n 20
#SBATCH --mpi=pmi2
#SBATCH -o myoutputfile.txt
module load mpi/mpich-x86_64
mpirun mpiprogram < inputfile.txt
If I am trying to run sequential program like the following, I am not whether there will be a difference or not. For example, I can simply remove the srun command in this script. What will happen?
#!/bin/sh
#SBATCH -n 1
#SBATCH -N 1
srun tar zxf julia-0.3.11.tar.gz
echo "prefix=/software/julia-0.3.11" > julia/Make.user
cd julia
srun make
The first example will spawn 20 tasks ; sbatch will request 20 CPUs and also set up the environment so that mpirun knows how many CPUs were requested for the job. mpirun will then spawn as many processes as were allocated (provided that OpenMPI was compiled with Slurm support).
The #SBATCH --mpi=pmi2 part is meant for srun so it will have no effect if srun is not called in the submission script.
In the second example, there will be no difference in the number of processes spawned as only one is needed. But, with srun, the output of sstat will be more reliable, the management of signals will be more precise, and the buffering of the output will be more controlled (via the srun command line options).
If you request multiple tasks, srun will instantiate that many processes. It can be an MPI program, or a sequential program that adapts its behaviour based on the SLURM_PROC_ID environment variable.
Also you can run multiple srun in the same submission script. Each instance of srun (called a "step") is then accounted separately in the accounting (sacct).
Finally, srun can use a subset of the allocation and organise the micro-scheduling of many small tasks in a single job (see the example in the srun manpage).

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

How to distribute slurm tasks evenly over the nodes?

I want to run a script on a cluster ~200 times using srun commands in one sbatch script. Since executing the script takes some time it would be great to distribute the tasks evenly over the nodes in the cluster. Sadly, I have issues with that.
Now, I created an example script ("hostname.sh") to test different parameters in the sbatch script:
echo `date +%s` `hostname`
sleep 10
This is my sbatch script:
#SBATCH --ntasks=15
#SBATCH --cpus-per-task=16
for i in `seq 200`; do
srun -n1 -N1 bash hostname.sh &
done
wait
I would expect that hostname.sh is executed 200 times (for loop) but only 15 tasks running at the same time (--ntasks=15). Since my biggest node has 56 cores only three jobs should be able to run on this node at the same time (--cpus-per-task=16).
From the ouptut of the script I can see that the first nine tasks are distributed over nine nodes from the cluster but all the other tasks (191!) are executed on one node at the same time. The whole sbatch script execution just took about 15 seconds.
I think I misunderstand some of slurm's parameters but looking at the official documentation did not help me.
You need to use the --exclusive option of srun in that context:
srun -n1 -N1 --exclusive bash hostname.sh &
From the srun manpage:
By default, a job step has access to every CPU allocated to the job.
To ensure that distinct CPUs are allocated to each job step, use the
--exclusive option.
See also the last-but-one example in said documentation.

parallel but different Slurm srun job step invocations not working

I'd like to run the same program on a large number of different input files. I could just submit each as a separate Slurm submission, but I don't want to swamp the queue by dumping 1000s of jobs on it at once. I've been trying to figure out how to process the same number of files by instead creating an allocation first, then within that allocation looping over all the files with srun, giving each invocation a single core from the allocation. The problem is that no matter what I do, only one job step runs at a time. The simplest test case I could come up with is:
#!/usr/bin/env bash
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
It doesn't matter how many cores I assign the allocation:
time salloc -n 1 test
time salloc -n 2 test
time salloc -n 4 test
it always takes 4 seconds. Is it not possible to have multiple job steps execute in parallel?
It turned out to be that the default memory per cpu was not defined, so even single core jobs were running by reserving all the node's RAM.
Setting DefMemPerCPU, or specifying explicit RAM reservations did the trick.
Beware that in that scenario, you measure both the running time and the waiting time. Your submission script should look like this:
#!/usr/bin/env bash
time {
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
srun --exclusive --ntasks 1 -c 1 sleep 1 &
wait
}
and simply submit with
salloc -n 1 test
salloc -n 2 test
salloc -n 4 test
You then should observe the difference, along with messages such as srun: Job step creation temporarily disabled, retrying when using n<4.
Since the OP solved his issue but didn't provide the code, I'll share my take on this problem below.
In my case, I encountered the error/warning step creation temporarily disabled, retrying (Requested nodes are busy). This is because, the srun command that executed first, allocated all the memory. The same cause as encountered by the OP. To solve this, one first optionally(?) specify the total memory allocation for sbatch (if you are using an sbatch script):
#SBATCH --ntasks=4
#SBATCH --mem=[XXXX]MB
And then specify the memory use per srun task:
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/4]MB sleep 1 &
wait
I didn't specify CPU count for srun because in my sbatch script I included #SBATCH --cpus-per-task=1. For the same reason I suspect you could use --mem instead of --mem-per-cpu in the srun command, but I haven't tested this configuration.

Resources