GPU allocation within a SBATCH - slurm

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...

When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

Related

Running parallel jobs in slurm

I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)
As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/),
I have designed the following bash script
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
The, I ran sbatch as follows: sbatch test_bash
However, when I view the outputs, it is apparent that only one of the sruns in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?
**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB?
**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work
**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something
Thank you!!
Try adding --exclusive to the srun command line:
srun --exclusive --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
This will instruct srun to use a sub-allocation and work as you intended.
Note that the --exclusive option has a different meaning in this context than if used with sbatch.
Note also that different versions of Slurm have a distinct canonical way of doing this, but using --exclusive should work across most versions.
Even though you have solved your problem which turned out to be something else, and that you have already specified --mem_per_cpu=300MB in your sbatch script, I would like to add that in my case, my Slurm setup doesn't allow --mem_per_cpu in sbatch, only --mem. So the srun command will still allocate all the memory and block the subsequent steps. The key for me, is to specify --mem_per_cpu (or --mem) in the srun command.

Slurm: Why do we need Srun in Sbatch script file?

I am new to Slurm and I also found the related questions about this topic. However, I am still confused about several points of how to use srun. According to the official document, srun will typically first allocate resources and then run the parallel jobs. For example, I want to run 20 tasks and if I submit my job based on the following script, I am not sure how many tasks are created. Because sbatch only takes care of allocating resources instead of executing program.
#!/bin/sh
#SBATCH -n 20
#SBATCH --mpi=pmi2
#SBATCH -o myoutputfile.txt
module load mpi/mpich-x86_64
mpirun mpiprogram < inputfile.txt
If I am trying to run sequential program like the following, I am not whether there will be a difference or not. For example, I can simply remove the srun command in this script. What will happen?
#!/bin/sh
#SBATCH -n 1
#SBATCH -N 1
srun tar zxf julia-0.3.11.tar.gz
echo "prefix=/software/julia-0.3.11" > julia/Make.user
cd julia
srun make
The first example will spawn 20 tasks ; sbatch will request 20 CPUs and also set up the environment so that mpirun knows how many CPUs were requested for the job. mpirun will then spawn as many processes as were allocated (provided that OpenMPI was compiled with Slurm support).
The #SBATCH --mpi=pmi2 part is meant for srun so it will have no effect if srun is not called in the submission script.
In the second example, there will be no difference in the number of processes spawned as only one is needed. But, with srun, the output of sstat will be more reliable, the management of signals will be more precise, and the buffering of the output will be more controlled (via the srun command line options).
If you request multiple tasks, srun will instantiate that many processes. It can be an MPI program, or a sequential program that adapts its behaviour based on the SLURM_PROC_ID environment variable.
Also you can run multiple srun in the same submission script. Each instance of srun (called a "step") is then accounted separately in the accounting (sacct).
Finally, srun can use a subset of the allocation and organise the micro-scheduling of many small tasks in a single job (see the example in the srun manpage).

Slurm can't run more than one sbatch task

I've installed Slurm on a 2-node cluster. Both nodes are compute nodes, one is the controller also. I am able to successfully run srun with multiple jobs at once. I am running GPU jobs and have confirmed I can get multiple jobs running on multiple GPUs with srun, up to the number of GPUs in the systems.
However, when I try running sbatch with the same test file, it will only run one batch job, and it only runs on the compute node which is also the controller. The others fail, with an ExitCode of 1:0 in the sacct summary. If I try forcing it to run on the compute node that's not the controller, it won't run and shows the 1:0 exit code. However, just using srun will run on any compute node.
I've made sure the /etc/slurm/slurm.conf files are correct with the specs of the machines. Here is the sbatch .job file I am using:
#!/bin/bash
#SBATCH --job-name=tf_test1
#SBATCH --output=/storage/test.out
#SBATCH --error=/storage/test.err
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
##SBATCH --mem=10gb
#SBATCH --gres=gpu:1
~/anaconda3/bin/python /storage/tf_test.py
Maybe there is some limitation with sbatch I don't know about?
sbatch creates a job allocation and launches what is called the 'batch step'.
If you aren't familiar with what a job step is, I recommend this page: https://slurm.schedmd.com/quickstart.html
The batch step runs the script passed to it from sbatch. The only way to launch additional job steps is to invoke srun inside the batch step. In your case, it would be
srun ~/anaconda3/bin/python /storage/tf_test.py
This will create a job step running tf_test.py on each task in the allocation. Note that while the command is the same as when you run srun directly, it detects that is inside an allocation via environment variables from sbatch. You can split up the allocation into multiple job steps by running srun with flags like -n[num tasks] instead. ie
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 something.py
srun --ntasks=1 somethingelse.py
I don't know if you're having any other problems because you didn't post any other error messages or logs.
If using srun on the second node works and using sbatch with the submission script you mention fails without any output written, the most probable reason would be that /storage does not exist, or is not writable by the user, on the second node.
The slurmd logs on the second node should be explicit about this. The default location is /var/log/slurm/slurmd.log, but check the output of scontrol show config| grep Log for definitive information.
Another probable cause that lead to the same behaviour would be that the user is not defined or has a different UID on the second node (but then srun would fail too)
#damienfrancois answer was closest and maybe even correct. After making sure the /storage location was available on all nodes, things run with sbatch. The biggest issue was the /storage location is shared via NFS, but it was read-only for the compute nodes. This had to be changed in /etc/exports to look more like:
/storage *(rw,sync,no_root_squash)
Before it was ro...
The job file I have that works is also a bit different. Here is the current .job file:
#!/bin/bash
#SBATCH -N 1 # nodes requested
#SBATCH --job-name=test
#SBATCH --output=/storage/test.out
#SBATCH --error=/storage/test.err
#SBATCH --time=2-00:00
#SBATCH --mem=36000
#SBATCH --qos=normal
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER#nothing.com
#SBATCH --gres=gpu
srun ~/anaconda3/bin/python /storage/tf_test.py

How to distribute slurm tasks evenly over the nodes?

I want to run a script on a cluster ~200 times using srun commands in one sbatch script. Since executing the script takes some time it would be great to distribute the tasks evenly over the nodes in the cluster. Sadly, I have issues with that.
Now, I created an example script ("hostname.sh") to test different parameters in the sbatch script:
echo `date +%s` `hostname`
sleep 10
This is my sbatch script:
#SBATCH --ntasks=15
#SBATCH --cpus-per-task=16
for i in `seq 200`; do
srun -n1 -N1 bash hostname.sh &
done
wait
I would expect that hostname.sh is executed 200 times (for loop) but only 15 tasks running at the same time (--ntasks=15). Since my biggest node has 56 cores only three jobs should be able to run on this node at the same time (--cpus-per-task=16).
From the ouptut of the script I can see that the first nine tasks are distributed over nine nodes from the cluster but all the other tasks (191!) are executed on one node at the same time. The whole sbatch script execution just took about 15 seconds.
I think I misunderstand some of slurm's parameters but looking at the official documentation did not help me.
You need to use the --exclusive option of srun in that context:
srun -n1 -N1 --exclusive bash hostname.sh &
From the srun manpage:
By default, a job step has access to every CPU allocated to the job.
To ensure that distinct CPUs are allocated to each job step, use the
--exclusive option.
See also the last-but-one example in said documentation.

Run a "monitor" task alongside mpi task in SLURM

I've got an mpi job I run in slurm using an sbatch script which looks something like:
# request 384 processors across 16 nodes for exclusive use:
#SBATCH --exclusive
#SBATCH --ntasks-per-node=24
#SBATCH -n 384
#SBATCH -N 16
#SBATCH --time 3-00:00:00
mpirun myprog
I want to monitor the memory/cpu usage and some other behaviour of the "myprog" processes. I've written a simple script (call it "monitor") which can do this, but I'm stumped on how to use sbatch to run ONE copy of it on each allocated node, at the same time as "myprog".
I think I need to modify the above to something like:
...
srun monitor
mpirun myprog
But I'm confused about whether a) that means "monitor" will run in the background and b) how I can control where "monitor" runs.
To have monitor run 'in the background', so actually the srun is non-blocking and the subsequent mpirun command can start, you simply need to add an ampersand (&) at the end.
To make sure that program runs on the 'master node' of the allocation, just remove the srun command.
If you need that program to run on a specific node, use the -n1 --nodelist option (you probably first need to get the list of all allocated nodes first.) You should also consider using the --overcommit option of srun to avoid dedicating a full CPU to your monitoring program which I assume is not CPU-bound.

Resources