Run 2 slurm jobs only when both get the allocated resources - pytorch

One job is submitted to get hold of 4 GPUs. The second is submitted to get hold of the next 4 GPUs (on a different node). How can I ensure that both of the jobs run at the same time such that they eventually synchronise (Pytorch DPP).
Having an extra script to check the available resources does the trick, however other jobs might have priority because they have been in the queue, rather than waiting...
The particular partition I am using does not allow for a request of 2 nodes directly.
I am also aware of the --dependency flag, however this can only be used as a completion check of the first job.

The simple answer is to be more explicit with slurm.
idx=0; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=1 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=2 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=3 &
wait
srun examples
Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.
Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.
Flags explained
--gres - Generic resources required per node
--gpus - GPUs required per job
--gpus-per-node - GPUs required per node. Equivalent to the --gres option for GPUs.
--gpus-per-socket - GPUs required per socket. Requires the job to specify a task socket.
--gpus-per-task - GPUs required per task. Requires the job to specify a task count.
--cpus-per-gpu - Count of CPUs allocated per GPU.
--gpu-bind - Define how tasks are bound to GPUs.
--gpu-freq - Specify GPU frequency and/or GPU memory frequency.
--mem-per-gpu - Memory allocated per GPU.
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait
Another example:
srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0
You can further automate this with a bash script:
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait
The complex but better answer...
The Multi-Process Service MPS is an implementation variant compatible with the CUDA programming interface. The MPS execution architecture is designed to let co-operative multi-process CUDA applications, generally for MPI jobs, use Hyper-Q functionalities on the very latest NVIDIA GPUs. Hyper-Q allows CUDA kernels to be processed simultaneously on the same GPU; this can improve performance when the GPU calculation capacity is underused by a single application process.
CUDA MPS is included by default in the different CUDA modules available to the users.
For a multi-GPU MPI batch job, the usage of CUDA MPS can be activated with the -C mps option. However, the node must be exclusively reserved via the --exclusive option.
For an execution via the default gpu partition (nodes with 40 physical cores and 4 GPUs) using only one node:
mps_multi_gpu_mpi.slurm
#!/bin/bash
SBATCH --job-name=gpu_cuda_mps_multi_mpi # name of job
SBATCH --ntasks=40 # total number of MPI tasks
SBATCH --ntasks-per-node=40 # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:4 # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 # number of cores per task
# /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading.
SBATCH --hint=nomultithread # hyperthreading deactivated
SBATCH --time=00:10:00 # maximum execution time requested (HH:MM:SS)
SBATCH --output=gpu_cuda_mps_multi_mpi%j.out # name of output file
SBATCH --error=gpu_cuda_mps_multi_mpi%j.out # name of error file (here, common with the output)
SBATCH --exclusive # exclusively reserves the node
SBATCH -C mps # the MPS is activated
# cleans out modules loaded in interactive and inherited by default
module purge
# loads modules
module load ...
# echo of launched commands
set -x
# execution of the code with binding via bind_gpu.sh: 4 GPUs for 40 MPI tasks.
srun ./executable_multi_gpu_mpi
Submit script via the sbatch command:
sbatch mps_multi_gpu_mpi.slurm
Similarly, you can execute your job on an entire node of the gpu_p2 partition (nodes with 24 physical cores and 8 GPUs) by specifying:
SBATCH --partition=gpu_p2 ​ # GPU partition requested
SBATCH --ntasks=24 ​ # total number of MPI tasks
SBATCH --ntasks-per-node=24 ​ # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:8 # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 ​ # number of cores per task
Be careful, even if you use only part of the node, it has to be reserved in exclusive mode. In particular, this means that the entire node is invoiced.
I recommend that you compile and execute your codes in the same environment by loading the same modules. In this example, I assume that the executable_mps_multi_gpu_mpi executable file is found in the submission directory, i.e. the directory in which the sbatch command is entered.
The calculation output file, gpu_cuda_mps_multi_mpi<numero_job>.out, is also found in the submission directory. It is created at the start of the job execution: Editing or modifying it while the job is running can disrupt the execution.
The module purge is made necessary by the Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job making the execution of your job dependent on what you have done previously.
PROTIP: To avoid errors in the automatic task distribution, I recommend using srun to execute your code instead of mpirun. This guarantees a distribution which conforms to the specifications of the resources you requested in the submission file.
Misc.
Jobs have resources defined in Slurm by default, per partition and per QoS (Quality of Service). You can modify the limits or specify another partition and / or QoS as shown in the documentation detailing the partitions and QoS.
That was exhaustive, I HOPE THAT HELPS!

Related

Slurm - Host node allocation?

When I submit by SBATCH job to our HPC, I believe slurm allocates nodes based on resources, and in my case, the Host is always spawned on Node 0 which is set as being the 1st in an alphabetical sort of the node/machine names. This is causing problems because (sometimes) this Host node may only have 1 core running, (and thus a small amountof memory) meaning it is unable to write large results/data files I need.
Is there any way to set the host node manually, given the resources slurm allocates in my nodefile?
I could fix this with -mincpus but I only need >1 cpu for this one purpose. Other solutions increasing --mem-per-cpu or just --mem also just add more resources to the job and delay it from starting.
You can use the --nodelist parameter to set specific nodes that should be used:
sbatch --nodelist=<NODE-NAME> script.sh
Or even --exclude the ones you do not want to use (e.g. node 0):
sbatch --exclude=node0 script.sh
The official documentation provides more information on both options.

How to set number of threads as a downstream variable in PBS job queue

Is there any way to determine how many threads are available to a program when I run it from a PBS job script?
In the header of my PBS job script I set
#PBS -l nodes=1:ppn=8
Is there a command I can use that will detect the number of threads - so that I can set a variable to equal this number (for downstream processes).
This way I can set threads as $k, for downstream processes, instead of going through the code line by line every time I change #PBS -l nodes=1:ppn=_.
Thanks all!
I found a workaround -
So if using a single node the variable I am looking for is $PBS_NUM_PPN
By default, PBS doesn't expose the ppn setting in the running job. And there is no way that a shell script can read its comments ... without knowing and parsing its source (and that's probably not going to work here for a couple of reasons.)
But here are a couple of ideas:
You could pass an arbitrary variable from the qsub command line using the -v option. (You might be able to do the same thing using #PBS -v ... but would be equivalent to setting a variable in your script in the normal way.)
You should be able to specify the resources (using -l) on the qsub command line instead of in the job script.
Put them together like this:
qsub ... -l nodes=1:ppn=8 - v NOSTHREADS=8 myscript.pbs
where myscript.pbs is:
#!/bin/bash
#PBS directives ... without the "-l" !!!
# ordinary shell commands.
somecommand --someoption $NOSTHREADS ...
Note: I recommend that you don't mix specifying resources on the command line and in the script. Put the "-l" options in one place only. If you put them in both places AND your Torque / PBS installation uses job submission filters, things can get rather confused.
Alternatively, you could write a shell (or python or whatever) launcher that generates the PBS script with matching values of the ppn (etc) resource(s) and the corresponding variable(s) embedded in the generated script.
This approach can have the advantage of being more reproducible ... if you do a few other things as well. (Ask a local eResearch analyst about reproducibility in your scientific computing.)
If neither of the above can be made to work, you might be able to check the ulimit settings within the job script. However, my understanding is that the PBS mon will typically not use ulimit restrictions as the means of enforcing thread / process limits. Instead, it will monitor the number of cores that are active. (The ppn resource limits the number of processors, not the number of threads or processes.)

Monitor the CPU usage of an OpenFOAM simulation running on a slurm job

I'm running an OpenFOAM simulation on a cluster. I have used the Scotch decomposition method and my decomposeParDict looks like this:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
object decomposeParDict;
}
numberOfSubdomains 6;
method scotch;
checkMesh and decomposePar finish with no issues. I have assigned 6 nodes to the slurm by
srun -N6 -l sonicFoam
and the solver runs smoothly without any errors.
The issue is the solution speed is not improved in comparison to the non-parallel simulation I ran before. I want to monitor the CPU usage to see if all of the 6 nodes I have assigned are similarly loaded. The squeue --user=foobar command return the jobNumber and list of nodes assigned (NODELIST(REASON)) which looks like this:
foo,bar[061-065]
from sinfo command these nodes are both in debug and main* PARTITIONs (which I have absolutely no idea what it means!).
This post says that you can use the sacct or sstat commands to monitor CPU time and memory usage of a slurm job. But when I run
sacct --format="CPUTime,MaxRSS"
it gives me:
CPUTime MaxRSS
---------- ----------
00:00:00
00:00:00
00:07:36
00:00:56
00:00:26
15:26:24
which I can not understand. And when I specify the job number by
sacct --job=<jobNumber> --format="UserCPU"
The return is empty. So my questions are
Is my simulation loading all nodes or is it running on one or two and the rest are free?
am I running the right commands? if yes what those numbers mean? how they represent the CPU usage per node?
If not then what are the right --format="..."s for sacct and/or sstat (or maybe other slurm commands) to get the CPU usage/load?
P.S.1. I have followed the OpenFOAM compiling following the official instructions. I did not do anything with OpenMPI and it's mpicc compiler for that matter though.
P.S.2 For those of you who might end up here. Maybe I'm running the wrong command apparently one can first allocate some resources by:
srun -N 1 --ntasks-per-node=7 --pty bash
where 7 is the number of cores you want and bash is just a name. and then run the solver with:
mpirun -np 7 sonicFoam -parallel -fileHandler uncollated
I'm not sure yet though.
You can use
sacct --format='jobid,AveCPU,MinCPU,MinCPUTask,MinCPUNode'
to check whether all CPUs have been active. Compare AveCPU (average CPU time of all tasks in job) with MinCPU (minimum CPU time of all tasks in job). If they are equal, all 6 tasks (you requested 6 nodes, with, implicitly, 1 task per node) worked equally. If they are not equal, or even MinCPU is zero, then some tasks have been doing nothing.
But in your case, I believe you will observe that all tasks have been working hard, but they were all doing the same thing.
Besides the remark concerning the -parallel flag by #timdykes, you also must be aware that launching an MPI job with sun requires that OpenMPI was compiled with Slurm support. During your installation of OpenFOAM, it installed its own version of OpenMPI, and if file /usr/include/slurm/slurm.h or /usr/include/slurm.h exists, then Slurm support was probably compiled in. But the safest is probably to use mpirun.
But to do that, you will have to first request an allocation from Slurm with either sbatch or salloc.
Have you tried running with the '-parallel' argument? All of the OpenFOAM examples online use this argument when running a parallel job, one example is the official guide for running in parallel.
srun -N $NTASKS -l sonicFOAM -parallel
As an aside - I saw you built openfoam yourself, have you checked whether the cluster admins have provided a module for it? You can usually run module avail to see a list of the available modules, and then module load moduleName if there is an existing OpenFOAM module. This is useful as you can probably trust its been built with all the right options and would automatically set up your $PATH etc.

Limit number of cores used by OMPython

Background
I need to run a blocks simulation. I've used OMEdit to create the system and I call omc to run the simulation using OMPython with zmq for messaging. The simulation works fine but now I need to move it to a server to simulate the system for long times.
Since the server is shared among a team of people, it uses slurm to queue the jobs. The server has 32 cores but they asked me to use only 8 while I tune my script and then 24 when I want to run my final simulation.
I've configured slurm to call my script in the following manner:
#!/bin/bash
#
#SBATCH --job-name=simulation_Test_ADC2_pipe_4096s
#SBATCH --output=simulation_Test_ADC2_pipe_4096s.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=10000
source activate /home/jabozzo/conda_envs/py27
#which python
python ./Test_ADC2_pipe_4096s.py
Then I execute the slurm file using sbatch.
Problem
The omc compilation works fine. When it starts to simulate all the 32 cores of the server become loaded, even if it was configured to only use 8.
I've tried
There are compilation and simulation flags that can be passed to omc. I've tried to use --numProcs (a compilation flag) but this only seem to apply during the compilation process and does not affect the final executable. I've scanned the page of simulation flags looking for something related but it seems there is no option to change the cpu usage.
The only thing that we add when doing our OpenModelica testing in parallel is to add the GC_MARKERS=1 environment variable and --numProcs=1; this makes our nightly library testing of 10000 tests all run in serial. But GC_MARKERS shouldn't affect simulations unless they are allocating extreme amounts of memory. Other than that, OpenModelica simulations are serial unless perhaps you use a parallel blas/lapack/sundials library which might use more cores without OpenModelica knowing anything about it; in that case you would need to read the documentation for the library that's consuming all your resources.
What's a bit surprising is also how slurm allows your process to consume more CPUs than you set; it could use the taskset command to make the kernel force the process to only use certain CPUs.
My server administrator was unsure if taskset would interfere with slurm internals. So we found another option. If omc uses openMP for compilation we can also limit the number of cores replacing the last line of the slurm file with:
OMP_NUM_THREADS=8 python ./Test_ADC2_pipe_4096s.py
I'm leaving this anwser here to complement sjoelund.se anwser

SGE/UGE/etc..standardized way to submit OpenMP jobs to multiple cores?

I'm looking for a way to submit an OpenMP job to a Grid Engine scheduler, while specifying the number of cores it should run on. Something equivalent to LSF's -n option, or PBS's -l nodes=[count] option.
When I search on this, I'm see a bunch of answers specifying syntax like "-pe threaded [number of cores]". In those answers, there is no mention of having to create a parallel environment called "threaded". But when I try this syntax, it fails, saying that the requested parallel environment threaded does not exist. And when I type "qconf -spl", the only result I get is "make". So - should this "threaded" parallel environment exist by default, or is this something that has to be manually created on the cluster?
If it has to be manually created, is there any other syntax to submit jobs to multiple cores that does not rely on configurable naming on a cluster? This is for a third party program submitting to a cluster, so I don't want to have to rely not only on the client having created this pe, but naming it the same, etc... I was hoping the -l option might have something, but I haven't been able to find any permutation of that to achieve this.
If you get only "make" as possible parallel environment then this means that there are no parallel environments set on your cluster.
There are two solutions to your problem, depending on these 2 situations:
A) you have root/admin access to the cluster
B) you don't
In case B, well ask your administrator to create a parallel environment. In case A, you have to create a parallel environment. To create a new parallel environment you must type (requires root/admin privilege):
qconf -ap <pe_name>
And the default editor will start with a default pe_conf file that you must edit. If you need to setup only an openMP parallel environment you can use these options:
pe_name smp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
and for a MPI parallel environment:
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
as you notice, in the latter case you will point SGE to the right initialization script and shutdown script for your MPI configuration. In the first case, you simply point to /bin/true.
The allocation_rule are different in this example. $fill_up means that SGE will fill any CPU it can find with parts of the MPI job, while for smp configuration you simply allocate the correct number of slots on the same machine, i.e. $pe_slots.
If you use MPI, your nodes should be connected using a high performance switch such as infiniband otherwise your jobs will spend much more time communicating than calculating.
EDIT:
oh, btw: the correct synthax to submit a job with a parallel environment is effectively:
qsub -pe <pe_name> <nb_slots>
FINAL EDIT:
the final answer to the question comes in the comments here below. In practice, SGE cannot handle multi-thread jobs if a parallel environment (PE) is not set on the cluster. If you do not have admin privileges on the cluster, you must either guess for the correct PE that has to be used using qconf -spl and inspect the different PEs with qconf -sp <pe_name>, or add an option in your software that allows the users to specify the PE that has to be used.
Otherwise, i.e. if no PE are available on the cluster, you cannot use a parallel version of your software.
See the comments for further information.

Resources