Slurm: by default assign a certain number of GPUs - slurm

If I do not specify any --gres=gpu:1 option then the process will use up all GPUs in the compute node.
We only use Slurm for GPU sharing so we would like that every process be assigned one GPU automatically... Is it possible to specify that by default srun --gres=gpu:1?

You can set a default for --gres by setting the SBATCH_GRES env variable to all users, for instance in /etc/profile.d on the login node. Simply create a file in there, that has the following content:
export SBATCH_GRES=gpu:1
Note that the documentation says
Note that environment variables will override any options set in a batch script
so people who will want to use more than one, or not use a GPU at all will need to override this default using the command line option, and won't be able to override it with a #SBATCH --gres line in their submission script.
Another option would be, to set the CUDA_VISIBLE_DEVICES to an empty string for all users by default. Then, in jobs that request GPUs, the variable will be modified by Slurm according to the request, and jobs that do not make the GPU request will not 'see' the GPUs.
If users are likely to play the system (the CUDA_VISIBLE_DEVICES variable can be overwritten by the users), then you will have to set cgroups.

Related

Get available memory inside SLURM step

I'm trying to generate a script that automatically adapt its requirements to whatever is the environment where it is running.
I already got the number of CPUs available by accessing the SLURM_CPUS_PER_TASK environment variable. If it does not exists, I assume it is an interactive execution and default the value to 1.
Now I need to get the memory available, but this is not so straightforward. We have SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE. If I'm not wrong, this numbers are not always present, and there's the special case of asking for zero memory. But I need to have the real number, as I'm trying to run a java application and I need to put something specific in the -Xmx parameter.
Is there any easy way to get that info? Or I have to test for availability of any of the variables and query SLURM/the system in order to get total memory available in case of zero?
If you request memory (--mem) on your submit script these environment variables should be set.
Else you can try (scontrol show config)
or parse /etc/slurm/slurm.conf for MaxMemPerNode with the PartitionName you are running.
ref: https://slurm.schedmd.com/sbatch.html

Slurm - Host node allocation?

When I submit by SBATCH job to our HPC, I believe slurm allocates nodes based on resources, and in my case, the Host is always spawned on Node 0 which is set as being the 1st in an alphabetical sort of the node/machine names. This is causing problems because (sometimes) this Host node may only have 1 core running, (and thus a small amountof memory) meaning it is unable to write large results/data files I need.
Is there any way to set the host node manually, given the resources slurm allocates in my nodefile?
I could fix this with -mincpus but I only need >1 cpu for this one purpose. Other solutions increasing --mem-per-cpu or just --mem also just add more resources to the job and delay it from starting.
You can use the --nodelist parameter to set specific nodes that should be used:
sbatch --nodelist=<NODE-NAME> script.sh
Or even --exclude the ones you do not want to use (e.g. node 0):
sbatch --exclude=node0 script.sh
The official documentation provides more information on both options.

How to set number of threads as a downstream variable in PBS job queue

Is there any way to determine how many threads are available to a program when I run it from a PBS job script?
In the header of my PBS job script I set
#PBS -l nodes=1:ppn=8
Is there a command I can use that will detect the number of threads - so that I can set a variable to equal this number (for downstream processes).
This way I can set threads as $k, for downstream processes, instead of going through the code line by line every time I change #PBS -l nodes=1:ppn=_.
Thanks all!
I found a workaround -
So if using a single node the variable I am looking for is $PBS_NUM_PPN
By default, PBS doesn't expose the ppn setting in the running job. And there is no way that a shell script can read its comments ... without knowing and parsing its source (and that's probably not going to work here for a couple of reasons.)
But here are a couple of ideas:
You could pass an arbitrary variable from the qsub command line using the -v option. (You might be able to do the same thing using #PBS -v ... but would be equivalent to setting a variable in your script in the normal way.)
You should be able to specify the resources (using -l) on the qsub command line instead of in the job script.
Put them together like this:
qsub ... -l nodes=1:ppn=8 - v NOSTHREADS=8 myscript.pbs
where myscript.pbs is:
#!/bin/bash
#PBS directives ... without the "-l" !!!
# ordinary shell commands.
somecommand --someoption $NOSTHREADS ...
Note: I recommend that you don't mix specifying resources on the command line and in the script. Put the "-l" options in one place only. If you put them in both places AND your Torque / PBS installation uses job submission filters, things can get rather confused.
Alternatively, you could write a shell (or python or whatever) launcher that generates the PBS script with matching values of the ppn (etc) resource(s) and the corresponding variable(s) embedded in the generated script.
This approach can have the advantage of being more reproducible ... if you do a few other things as well. (Ask a local eResearch analyst about reproducibility in your scientific computing.)
If neither of the above can be made to work, you might be able to check the ulimit settings within the job script. However, my understanding is that the PBS mon will typically not use ulimit restrictions as the means of enforcing thread / process limits. Instead, it will monitor the number of cores that are active. (The ppn resource limits the number of processors, not the number of threads or processes.)

How can I configure SLURM at the user level (e.g. with something like a ".slurmrc")?

Is there something like .slurmrc for SLURM that would allow each user to set their own defaults for parameters that they would normally specify on the command line.
For example, I run 95% of my jobs on what I'll call our HighMem partition. Since my routine jobs can easily go over the default of 1GB, I almost always request 10GB of RAM. To make the best use of my time, I would like to put the partition and RAM requests in a configuration file so that I don't have to type them in all the time. So, instead of typing the following:
sbatch --partition=HighMem --mem=10G script.sh
I could just type this:
sbatch script.sh
I tried searching for multiple variations on "SLURM user-level configuration" and it seemed that all SLURM-related hits dealt with slurm.conf (a global-level configuration file).
I even tried creating slurm.conf and .slurmrc in my home directory, just in case that worked, but they didn't have any effect on the partition used.
update 1
Yes, I thought about scontrol, but the only configuration file it deals with is global and most parameters in it aren't even relevant for a normal user.
update 2
My supervisor pointed out the SLURM Perl API to me. The last time I looked at it, it seemed too complicated to me, but this time upon looking at the code for https://github.com/SchedMD/slurm/blob/master/contribs/perlapi/libslurm/perl/t/06-complete.t, it would seem that it wouldn't too be hard to create a script that behaves similar to sbatch that reads in a default configuration file and sets the desired parameters. However, I haven't had any success in setting the 'std_out' to a file name that gets written to.
If your example is representative, defining an alias
alias sbatch='sbatch --partition=HighMem --mem=10G'
could be the easiest way. Alternatively, a Bash function could also be used
sbatch() {
command sbatch --partition=HighMem --mem=10G "$#"
}
Put any of these in your .bash_profile for persistence.

SGE/UGE/etc..standardized way to submit OpenMP jobs to multiple cores?

I'm looking for a way to submit an OpenMP job to a Grid Engine scheduler, while specifying the number of cores it should run on. Something equivalent to LSF's -n option, or PBS's -l nodes=[count] option.
When I search on this, I'm see a bunch of answers specifying syntax like "-pe threaded [number of cores]". In those answers, there is no mention of having to create a parallel environment called "threaded". But when I try this syntax, it fails, saying that the requested parallel environment threaded does not exist. And when I type "qconf -spl", the only result I get is "make". So - should this "threaded" parallel environment exist by default, or is this something that has to be manually created on the cluster?
If it has to be manually created, is there any other syntax to submit jobs to multiple cores that does not rely on configurable naming on a cluster? This is for a third party program submitting to a cluster, so I don't want to have to rely not only on the client having created this pe, but naming it the same, etc... I was hoping the -l option might have something, but I haven't been able to find any permutation of that to achieve this.
If you get only "make" as possible parallel environment then this means that there are no parallel environments set on your cluster.
There are two solutions to your problem, depending on these 2 situations:
A) you have root/admin access to the cluster
B) you don't
In case B, well ask your administrator to create a parallel environment. In case A, you have to create a parallel environment. To create a new parallel environment you must type (requires root/admin privilege):
qconf -ap <pe_name>
And the default editor will start with a default pe_conf file that you must edit. If you need to setup only an openMP parallel environment you can use these options:
pe_name smp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
and for a MPI parallel environment:
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
as you notice, in the latter case you will point SGE to the right initialization script and shutdown script for your MPI configuration. In the first case, you simply point to /bin/true.
The allocation_rule are different in this example. $fill_up means that SGE will fill any CPU it can find with parts of the MPI job, while for smp configuration you simply allocate the correct number of slots on the same machine, i.e. $pe_slots.
If you use MPI, your nodes should be connected using a high performance switch such as infiniband otherwise your jobs will spend much more time communicating than calculating.
EDIT:
oh, btw: the correct synthax to submit a job with a parallel environment is effectively:
qsub -pe <pe_name> <nb_slots>
FINAL EDIT:
the final answer to the question comes in the comments here below. In practice, SGE cannot handle multi-thread jobs if a parallel environment (PE) is not set on the cluster. If you do not have admin privileges on the cluster, you must either guess for the correct PE that has to be used using qconf -spl and inspect the different PEs with qconf -sp <pe_name>, or add an option in your software that allows the users to specify the PE that has to be used.
Otherwise, i.e. if no PE are available on the cluster, you cannot use a parallel version of your software.
See the comments for further information.

Resources