How to run Pytorch script on Slurm? - slurm

I am struggling with a basic python script that uses Pytorch to print the CUDA devices on Slurm.
This is the output of sinfo.
(ml) [s.1915438#sl2 pytorch_gpu_check]$ sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST GRES
compute* up 3-00:00:00 1 drain* scs0123 (null)
compute* up 3-00:00:00 1 down* scs0050 (null)
compute* up 3-00:00:00 120 alloc scs[0001-0009,0011-0 (null)
compute* up 3-00:00:00 1 down scs0010 (null)
developmen up 30:00 1 drain* scs0123 (null)
developmen up 30:00 1 down* scs0050 (null)
developmen up 30:00 120 alloc scs[0001-0009,0011-0 (null)
developmen up 30:00 1 down scs0010 (null)
gpu up 2-00:00:00 2 mix scs[2001-2002] gpu:v100:2
gpu up 2-00:00:00 2 idle scs[2003-2004] gpu:v100:2
accel_ai up 2-00:00:00 1 mix scs2041 gpu:a100:8
accel_ai up 2-00:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_d up 2:00:00 1 mix scs2041 gpu:a100:8
accel_ai_d up 2:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_m up 12:00:00 1 idle scs2046 gpu:1g.5gb
s_highmem_ up 3-00:00:00 1 mix scs0151 (null)
s_highmem_ up 3-00:00:00 1 idle scs0152 (null)
s_compute_ up 3-00:00:00 2 idle scs[3001,3003] (null)
s_compute_ up 1:00:00 2 idle scs[3001,3003] (null)
s_gpu_eng up 2-00:00:00 1 idle scs2021 gpu:v100:4
I've access to accel_ai partition.
This is the Python file I am trying to run.
(ml) [s.1915438#sl2 pytorch_gpu_check]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")
try:
print(f"Current Devices: {torch.cuda.current_device()}")
except :
print('Current Devices: Torch is not compiled for GPU or No GPU')
print(f"No. of GPUs: {torch.cuda.device_count()}")
And this is my bash file to submit the job.
(ml) [s.1915438#sl2 pytorch_gpu_check]$ cat check_gpu.sh
#!bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --mem-per-cpu=10
#SBATCH --gres=gpu:1
#SBATCH --account=scs2045
#SBATCH --partition=accel_ai
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
This is what happends when I run the bash script to submit the job.
(ml) [s.1915438#sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
One thing I would like to make clear is that this Pytorch version comes with CUDA 11.3 from Pytorch's website.
Can anyone tell, what am I doing wrong?
Also, here even I exclude these lines the output is the same.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml

As per your sinfo, you have separate partitions with gpu access. You need to run your program on one of those. The job submission script can be modified as follows. You also need to specify gpu type using --gres.
...
...
#SBATCH --partition=gpu
#SBATCH --gres=<Enter gpu type>
...
...

There is a couple of blunders in my approach. In the job file, the first line should be #!/bin/bash not #!bin/bash.
Also, Slurm has a special command SBATCH to submit your job file. So in order to run your job file, for example check_gpu.sh, we should use sbatch check_gpu.sh not bash check_gpu.sh.
The reason I was getting the following output is that bash thinks # is a comment.
(ml) [s.1915438#sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
Thus, only the following lines are executed from the job script.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
After the correction, I ran the job script and it works as expected.
[s.1915438#sl1 pytorch_gpu_check]$ sbatch check_gpu.sh
Submitted batch job 7133028
[s.1915438#sl1 pytorch_gpu_check]$ cat gpu.7133029.out
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
[s.1915438#sl1 pytorch_gpu_check]$ cat gpu.7133029.err

Related

How to use srun command to assign different GPU for each task in a node with multiple GPUs?

How can I change my slurm script below so that each python job gets a unique GPU? The node had 4 GPUs, I would like to run 1 python job per each GPU.
The problem is that all jobs use the first GPU and other GPUs are idle.
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
srun python gpu_code.py &
cd ..
done
wait
In your example your four jobs will be executed sequentially. I suggest submitting four separate jobs that only request a single GPU. Then the jobs only use one GPU and will be executed simultaneously. If the jobs have depencies you can use:
sbatch --dependency=afterok:${jobid_of_previous_job} submit.sh. This job will start after the prvious has finished.
As you do not request GPUs in the submission scripts, you will have to manage the CUDA_VISIBLE_DEVICES var by yourself to direct each python script to one specific GPU.
Try with
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

Why does slurm assign more tasks than I asked when I "sbatch" multiple jobs with a .sh file?

I submit some cluster mode spark jobs which run just fine when I do it one by one with below sbatch specs.
#!/bin/bash -l
#SBATCH -J Spark
#SBATCH --time=0-05:00:00 # 5 hour
#SBATCH --partition=batch
#SBATCH --qos qos-batch
###SBATCH -N $NODES
###SBATCH --ntasks-per-node=$NTASKS
### -c, --cpus-per-task=<ncpus>
### (multithreading) Request that ncpus be allocated per process
#SBATCH -c 7
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --dependency=singleton
If I use a launcher to submit the same job with different node and task numbers, the system gets confused and tries to assign according to $SLURM_NTASK which gives 16. However I ask for example only 1 node,3tasks.
#!/bin/bash -l
for n in {1..4}
do
for t in {3..4}
do
echo "Running benchmark with ${n} nodes and ${t} tasks per node"
sbatch -N ${n} --ntasks-per-node=${t} spark-teragen.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-terasort.sh
sleep 5
sbatch -N ${n} --ntasks-per-node=${t} spark-teravalidate.sh
sleep 5
done
done
How can I fix the error below by preventing slurm assign weird number of tasks per node which exceeds the limit.
Error:
srun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the
requested tasks 16 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 233838: More processors requested than
permitted

ArangoDB goes silent after slurm sbatch submission

I am trying to run ArangoDB in cluster-mode on a Cray-supercomputer.
It runs on a login node.
I followed these instructions:
https://docs.arangodb.com/3.3/Manual/Deployment/Local.html
To make proper use of the Cray-cluster I however need to submit it as a batch-job (Slurm / sbatch).
I am having issues getting it running because "arangod" goes silent, that is its command-line output does not end up in the slurm-log-file.
I have tried to change the log-settings using this link:
https://docs.arangodb.com/3.3/Manual/Administration/Configuration/Logging.html
If I put the logging to "info" then I get nothing. If I use "trace" like this:
build/bin/arangod --server.endpoint tcp://0.0.0.0:5003 --agency.my-address tcp://148.187.32.9:5001 --server.authentication false --agency.activate true --agency.size 3 --agency.supervision true --database.directory db_dir/agency_2 --log.level startup=trace --log.level agency=trace --log.level queries=trace --log.level replication=trace --log.level threads=trace
I get something, but it does not print any line I'm interested in, namely if it created the database-directory, if it ends up in gossip-mode and so on. I don't get a line of the expected output I would get in the console if I just ran it from the terminal.
As I said: on the login-node it all works. I suspect the problem might be in the interaction of Slurm and arangod.
Can you help me?
* EDIT *
I ran a small experiment. First I ran this (expecting an error message):
#!/bin/bash -l
#SBATCH --job-name=slurm_test
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=debug
#SBATCH --constraint=mc
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun build/bin/arangod --server.endpoint tcp://0.0.0.0:5001
And got this (first line from arangodb, what we expect):
slurm-....out:
no database path has been supplied, giving up, please use the '--database.directory' option
srun: error: nid00008: task 0: Exited with exit code 1
srun: Terminating job step 8106415.0
Batch Job Summary Report for Job "slurm_test" (8106415) on daint
-----------------------------------------------------------------------------------------------------
Submit Eligible Start End Elapsed Timelimit
------------------- ------------------- ------------------- ------------------- ---------- ----------
2018-06-20T22:41:54 2018-06-20T22:41:54 Unknown Unknown 00:00:00 00:30:00
-----------------------------------------------------------------------------------------------------
Username Account Partition NNodes Energy
---------- ---------- ---------- ------ --------------
peterem g34 debug 1 joules
This job did not utilize any GPUs
----------------------------------------------------------
Scratch File System Files Quota
-------------------- ---------- ----------
/scratch/snx3000 85020 1000000
Then I ran this:
#!/bin/bash -l
#SBATCH --job-name=slurm_test
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=debug
#SBATCH --constraint=mc
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun build/bin/arangod --server.endpoint tcp://0.0.0.0:5001 --agency.my-address tcp://127.0.0.1:5001 --server.authentication false --agency.activate true --agency.size 1 --agency.supervision true --database.directory agency1
This created the "agency1" directory but did not complete (ran for over 3min). So after a few minutes I "scancel" the job. This is the only output (slurm-....out:):
srun: got SIGCONT
slurmstepd: error: *** STEP 8106340.0 ON nid00008 CANCELLED AT 2018-06-20T22:38:03 ***
slurmstepd: error: *** JOB 8106340 ON nid00008 CANCELLED AT 2018-06-20T22:38:03 ***
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Batch Job Summary Report for Job "slurm_test" (8106340) on daint
-----------------------------------------------------------------------------------------------------
Submit Eligible Start End Elapsed Timelimit
------------------- ------------------- ------------------- ------------------- ---------- ----------
2018-06-20T22:32:15 2018-06-20T22:32:15 Unknown Unknown 00:00:00 00:30:00
-----------------------------------------------------------------------------------------------------
Username Account Partition NNodes Energy
---------- ---------- ---------- ------ --------------
peterem g34 debug 1 joules
This job did not utilize any GPUs
----------------------------------------------------------
Scratch File System Files Quota
-------------------- ---------- ----------
/scratch/snx3000 85020 1000000
So: I know it is running in both cases (gives output or crates folder). But I have no idea why it gives no output in the second case.
I hope this clarifies my issue.
Thanks, Emanuel
Would you please print your entire slurm job command / job file? arangod logs to stdout. When stdout is redirected to an output file, as cluster batch systems do per default, you should monitor that file. As far as I remember slurm per default writes to slurm-$jobid.out

How to run a MPI task?

I am newbie in Linux and recently started working with our university super-computer and I need to install my program ( GAMESS Quantum Chemistry Software ) on my own allocated space. I have installed and ran it successfully under 'sockets' but actually I need to run it under 'mpi' ( otherwise there will be little advantage of using a super-computer ).
System Setting:
OS: Linux64 , Redhat, intel
MPI: impi
compiler: ifort
modules: slurm , intel/intel-15.0.1 , intel/impi-15.0.1
This software runs ' rungms ' and receives arguments as:
rungms [fileName][Version][CPU count ] ( for example: ./rungms Opt 00 4 )
Here is my bash file ( my feeling is this is the main culprit for my problem !):
#!/bin/bash
#Based off of Monte's Original Script for Torque:
#https://gist.github.com/mlunacek/6306340#file-matlab_example-pbs
#These are SBATCH directives specifying name of file, queue, the
#Quality of Service, wall time, Node Count, #of CPUS, and the
#destination output file (which appends node hostname and JobID)
#SBATCH -J OptMPI
#SBATCH --qos janus-debug
#SBATCH -t 00-00:10:00
#SBATCH -N2
#SBATCH --ntasks-per-node=1
#SBATCH -o output-OptMPI-%N-JobID-%j
#NOTE: This Module Will Be Replaced With Slurm Specific:
module load intel/impi-15.0.1
mpirun /projects/augenda/gamess/rungms Opt 00 2 > OptMPI.out
As I said before, the program is compiled for mpi ( and not 'sockets' ) .
My problem is when I run run sbatch Opt.sh , I receive this error:
srun: error: PMK_KVS_Barrier duplicate request from task 1
when I change -N number , sometimes I receive error saying (4 !=2
).
with odd number of -N I receive error saying it expects even number of processes.
What am I missing ?
Here is the code from our super-computer website as a bash file example
The Slurm Workload Manager has a few ways of invoking an Intel MPI process. Likely, all you have to do is use srun rather than mpirun in your case. If errors are still present, refer here for alternative ways to invoke Intel MPI jobs; it's rather dependent on how the HPC admins configured the system.

Resources