Limit number of cores used by OMPython - slurm

Background
I need to run a blocks simulation. I've used OMEdit to create the system and I call omc to run the simulation using OMPython with zmq for messaging. The simulation works fine but now I need to move it to a server to simulate the system for long times.
Since the server is shared among a team of people, it uses slurm to queue the jobs. The server has 32 cores but they asked me to use only 8 while I tune my script and then 24 when I want to run my final simulation.
I've configured slurm to call my script in the following manner:
#!/bin/bash
#
#SBATCH --job-name=simulation_Test_ADC2_pipe_4096s
#SBATCH --output=simulation_Test_ADC2_pipe_4096s.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=10000
source activate /home/jabozzo/conda_envs/py27
#which python
python ./Test_ADC2_pipe_4096s.py
Then I execute the slurm file using sbatch.
Problem
The omc compilation works fine. When it starts to simulate all the 32 cores of the server become loaded, even if it was configured to only use 8.
I've tried
There are compilation and simulation flags that can be passed to omc. I've tried to use --numProcs (a compilation flag) but this only seem to apply during the compilation process and does not affect the final executable. I've scanned the page of simulation flags looking for something related but it seems there is no option to change the cpu usage.

The only thing that we add when doing our OpenModelica testing in parallel is to add the GC_MARKERS=1 environment variable and --numProcs=1; this makes our nightly library testing of 10000 tests all run in serial. But GC_MARKERS shouldn't affect simulations unless they are allocating extreme amounts of memory. Other than that, OpenModelica simulations are serial unless perhaps you use a parallel blas/lapack/sundials library which might use more cores without OpenModelica knowing anything about it; in that case you would need to read the documentation for the library that's consuming all your resources.
What's a bit surprising is also how slurm allows your process to consume more CPUs than you set; it could use the taskset command to make the kernel force the process to only use certain CPUs.

My server administrator was unsure if taskset would interfere with slurm internals. So we found another option. If omc uses openMP for compilation we can also limit the number of cores replacing the last line of the slurm file with:
OMP_NUM_THREADS=8 python ./Test_ADC2_pipe_4096s.py
I'm leaving this anwser here to complement sjoelund.se anwser

Related

Run 2 slurm jobs only when both get the allocated resources

One job is submitted to get hold of 4 GPUs. The second is submitted to get hold of the next 4 GPUs (on a different node). How can I ensure that both of the jobs run at the same time such that they eventually synchronise (Pytorch DPP).
Having an extra script to check the available resources does the trick, however other jobs might have priority because they have been in the queue, rather than waiting...
The particular partition I am using does not allow for a request of 2 nodes directly.
I am also aware of the --dependency flag, however this can only be used as a completion check of the first job.
The simple answer is to be more explicit with slurm.
idx=0; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=1 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=2 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=3 &
wait
srun examples
Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.
Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.
Flags explained
--gres - Generic resources required per node
--gpus - GPUs required per job
--gpus-per-node - GPUs required per node. Equivalent to the --gres option for GPUs.
--gpus-per-socket - GPUs required per socket. Requires the job to specify a task socket.
--gpus-per-task - GPUs required per task. Requires the job to specify a task count.
--cpus-per-gpu - Count of CPUs allocated per GPU.
--gpu-bind - Define how tasks are bound to GPUs.
--gpu-freq - Specify GPU frequency and/or GPU memory frequency.
--mem-per-gpu - Memory allocated per GPU.
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait
Another example:
srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0
You can further automate this with a bash script:
#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive
for i in `seq 0 3`; do
cd ${i}
export CUDA_VISIBLE_DEVICES=$i
python gpu_code.py &
cd ..
done
wait
The complex but better answer...
The Multi-Process Service MPS is an implementation variant compatible with the CUDA programming interface. The MPS execution architecture is designed to let co-operative multi-process CUDA applications, generally for MPI jobs, use Hyper-Q functionalities on the very latest NVIDIA GPUs. Hyper-Q allows CUDA kernels to be processed simultaneously on the same GPU; this can improve performance when the GPU calculation capacity is underused by a single application process.
CUDA MPS is included by default in the different CUDA modules available to the users.
For a multi-GPU MPI batch job, the usage of CUDA MPS can be activated with the -C mps option. However, the node must be exclusively reserved via the --exclusive option.
For an execution via the default gpu partition (nodes with 40 physical cores and 4 GPUs) using only one node:
mps_multi_gpu_mpi.slurm
#!/bin/bash
SBATCH --job-name=gpu_cuda_mps_multi_mpi # name of job
SBATCH --ntasks=40 # total number of MPI tasks
SBATCH --ntasks-per-node=40 # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:4 # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 # number of cores per task
# /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading.
SBATCH --hint=nomultithread # hyperthreading deactivated
SBATCH --time=00:10:00 # maximum execution time requested (HH:MM:SS)
SBATCH --output=gpu_cuda_mps_multi_mpi%j.out # name of output file
SBATCH --error=gpu_cuda_mps_multi_mpi%j.out # name of error file (here, common with the output)
SBATCH --exclusive # exclusively reserves the node
SBATCH -C mps # the MPS is activated
# cleans out modules loaded in interactive and inherited by default
module purge
# loads modules
module load ...
# echo of launched commands
set -x
# execution of the code with binding via bind_gpu.sh: 4 GPUs for 40 MPI tasks.
srun ./executable_multi_gpu_mpi
Submit script via the sbatch command:
sbatch mps_multi_gpu_mpi.slurm
Similarly, you can execute your job on an entire node of the gpu_p2 partition (nodes with 24 physical cores and 8 GPUs) by specifying:
SBATCH --partition=gpu_p2 ​ # GPU partition requested
SBATCH --ntasks=24 ​ # total number of MPI tasks
SBATCH --ntasks-per-node=24 ​ # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:8 # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 ​ # number of cores per task
Be careful, even if you use only part of the node, it has to be reserved in exclusive mode. In particular, this means that the entire node is invoiced.
I recommend that you compile and execute your codes in the same environment by loading the same modules. In this example, I assume that the executable_mps_multi_gpu_mpi executable file is found in the submission directory, i.e. the directory in which the sbatch command is entered.
The calculation output file, gpu_cuda_mps_multi_mpi<numero_job>.out, is also found in the submission directory. It is created at the start of the job execution: Editing or modifying it while the job is running can disrupt the execution.
The module purge is made necessary by the Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job making the execution of your job dependent on what you have done previously.
PROTIP: To avoid errors in the automatic task distribution, I recommend using srun to execute your code instead of mpirun. This guarantees a distribution which conforms to the specifications of the resources you requested in the submission file.
Misc.
Jobs have resources defined in Slurm by default, per partition and per QoS (Quality of Service). You can modify the limits or specify another partition and / or QoS as shown in the documentation detailing the partitions and QoS.
That was exhaustive, I HOPE THAT HELPS!

Slurm - Host node allocation?

When I submit by SBATCH job to our HPC, I believe slurm allocates nodes based on resources, and in my case, the Host is always spawned on Node 0 which is set as being the 1st in an alphabetical sort of the node/machine names. This is causing problems because (sometimes) this Host node may only have 1 core running, (and thus a small amountof memory) meaning it is unable to write large results/data files I need.
Is there any way to set the host node manually, given the resources slurm allocates in my nodefile?
I could fix this with -mincpus but I only need >1 cpu for this one purpose. Other solutions increasing --mem-per-cpu or just --mem also just add more resources to the job and delay it from starting.
You can use the --nodelist parameter to set specific nodes that should be used:
sbatch --nodelist=<NODE-NAME> script.sh
Or even --exclude the ones you do not want to use (e.g. node 0):
sbatch --exclude=node0 script.sh
The official documentation provides more information on both options.

How can I know the real-time memory usage of a running job on slurm?

I know little about how cpu communicates with memories, so I’m not sure whether this is a ‘correct’ question to ask.
In a job script I submit to a slurm cluster, the script needs to read data from a database stored in the working dictionary. I want to monitor the memory used by running this script.
How can I write a bash script to do this? I have tried #CoffeeNerd's script. However, while the job is running, there is only one line of output in the file
AveCPU|AveRSS|MaxRSS
How can I modify this script to output the real-time memory usage?
I know sstat command, but I'm not sure whether something like sstat -j $JOBID.batch --format=MaxVMSize is the solution to my problem.
Slurm has a plugin that records a 'profile' of a job (PCU usage, memory usage, etc) into a HDF5 file. It holds a time series for each item measured.
Use
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
to activate it.
See the documentation here.

Identify Resource Usage of a Process: CPU, Memory and I/O

I need to request some AWS resources and in order to do so, I need to identify what are my requirements for:
Number of CPU cores (maybe GPUs as well) (performs parallel processing)
Amount of Memory required
I/O & network read/write time (optional / good to know)
How can I profile my script so that I know if I am using the requested resources to the fullest?
How can I profile the whole system? Something along the following lines:
i. Request large number of compute resources (CPUs and RAM) on AWS
ii. Start the system profiler
iii. Run my program and wait for it to finish
iv. Stop the system profiler and identify the peak #CPUs and RAM used
Context: Unix / Linux
You could use time(1), but there are two variants of it.
the time builtin in most shells is usually not enough for your needs, so...
you need the /usr/bin/time program from the time package
Then you'll run /usr/bin/time -v yourscript
and inside your script you could use the times builtin (see this)
You also should consider using perf(1) and oprofile(1)
At last, you might eventually code something or use something querying the kernel thru /proc/ (see proc(5)). Utilities like xosview are using that.

Monitor the CPU usage of an OpenFOAM simulation running on a slurm job

I'm running an OpenFOAM simulation on a cluster. I have used the Scotch decomposition method and my decomposeParDict looks like this:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
object decomposeParDict;
}
numberOfSubdomains 6;
method scotch;
checkMesh and decomposePar finish with no issues. I have assigned 6 nodes to the slurm by
srun -N6 -l sonicFoam
and the solver runs smoothly without any errors.
The issue is the solution speed is not improved in comparison to the non-parallel simulation I ran before. I want to monitor the CPU usage to see if all of the 6 nodes I have assigned are similarly loaded. The squeue --user=foobar command return the jobNumber and list of nodes assigned (NODELIST(REASON)) which looks like this:
foo,bar[061-065]
from sinfo command these nodes are both in debug and main* PARTITIONs (which I have absolutely no idea what it means!).
This post says that you can use the sacct or sstat commands to monitor CPU time and memory usage of a slurm job. But when I run
sacct --format="CPUTime,MaxRSS"
it gives me:
CPUTime MaxRSS
---------- ----------
00:00:00
00:00:00
00:07:36
00:00:56
00:00:26
15:26:24
which I can not understand. And when I specify the job number by
sacct --job=<jobNumber> --format="UserCPU"
The return is empty. So my questions are
Is my simulation loading all nodes or is it running on one or two and the rest are free?
am I running the right commands? if yes what those numbers mean? how they represent the CPU usage per node?
If not then what are the right --format="..."s for sacct and/or sstat (or maybe other slurm commands) to get the CPU usage/load?
P.S.1. I have followed the OpenFOAM compiling following the official instructions. I did not do anything with OpenMPI and it's mpicc compiler for that matter though.
P.S.2 For those of you who might end up here. Maybe I'm running the wrong command apparently one can first allocate some resources by:
srun -N 1 --ntasks-per-node=7 --pty bash
where 7 is the number of cores you want and bash is just a name. and then run the solver with:
mpirun -np 7 sonicFoam -parallel -fileHandler uncollated
I'm not sure yet though.
You can use
sacct --format='jobid,AveCPU,MinCPU,MinCPUTask,MinCPUNode'
to check whether all CPUs have been active. Compare AveCPU (average CPU time of all tasks in job) with MinCPU (minimum CPU time of all tasks in job). If they are equal, all 6 tasks (you requested 6 nodes, with, implicitly, 1 task per node) worked equally. If they are not equal, or even MinCPU is zero, then some tasks have been doing nothing.
But in your case, I believe you will observe that all tasks have been working hard, but they were all doing the same thing.
Besides the remark concerning the -parallel flag by #timdykes, you also must be aware that launching an MPI job with sun requires that OpenMPI was compiled with Slurm support. During your installation of OpenFOAM, it installed its own version of OpenMPI, and if file /usr/include/slurm/slurm.h or /usr/include/slurm.h exists, then Slurm support was probably compiled in. But the safest is probably to use mpirun.
But to do that, you will have to first request an allocation from Slurm with either sbatch or salloc.
Have you tried running with the '-parallel' argument? All of the OpenFOAM examples online use this argument when running a parallel job, one example is the official guide for running in parallel.
srun -N $NTASKS -l sonicFOAM -parallel
As an aside - I saw you built openfoam yourself, have you checked whether the cluster admins have provided a module for it? You can usually run module avail to see a list of the available modules, and then module load moduleName if there is an existing OpenFOAM module. This is useful as you can probably trust its been built with all the right options and would automatically set up your $PATH etc.

Resources