How to find max number of tasks that I can run using slurm? - slurm

I have access to supercomputer that uses slurm, but I need one information, that I cannot find. How many parallel tasks can I run? I know I can use --ntasks to set the number, and e.g. if I have parallel prblen and I want to check it running 1000 processes I can run it with --ntasks 1000 but what sets max number? Nuber of nodes or number of CPUs or something else?

There is a physical limitation which is the total number of cores available in the cluster. You can check that with sinfo -o%C; the last number in the output will be the total number of CPUs.
There can also be limits defined in the "Quality of Services". You can see them with sacctmgr show qos. Look for the MaxTRES column.
But there can be also administrative limits specific to your user or your account. You can see them with sacctmgr show user $USER withassoc. Look for the MaxCPUMins column.

Related

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

Monitor memory usage of each node in a slurm job

My slurm job uses several nodes, and I want to know the maximum memory usage of each node for a running job. What can I do?
Right now, I can ssh into each node and do free -h -s 30 > memory_usage, but I think there must be a better way to do this.
The Slurm accounting will give you the maximum memory usage over time over all tasks directly. If that information is not sufficient, you can setup profiling following this documentaiton and you will receive from Slurm the full memory usage of each process as a time series for the duration of the job. You can then aggregate per node, find the maximum, etc.

SLURM: Changing the maximum number of simultaneously running tasks for a running array job

I have set of an array job as follows:
sbatch --array=1:100%5 ...
which will limit the number of simultaneously running tasks to 5. The job is now running, and I would like to change this number to 10 (i.e. I wish I'd run sbatch --array=1:100%10 ...).
The documentation on array jobs mentions that you can use scontrol to change options after the job has started. Unfortunately, it's not clear what this option's variable name is, and I don't think it is listed in the documentation of the sbatch command here.
Any pointers well received.
You can change the array throttling limit with the following command:
scontrol update ArrayTaskThrottle=<count> JobId=<jobID>

Apache Flink: Limit number of CPUs in a TaskManager

First I am running on Standalone mode!
I have been trying to find any configuration but I haven't found anything about this.
In Spark there are some configurations which let you limit the number of CPUs to use in each slave:
SPARK_WORKER_CORES (worker configurations)
spark.executor.cores (cluster configuration)
But in Flink you just can set the maximun memory to use and the number of task slots (which just divides the memory) as said in the official documentation:
taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that a single TaskManager can run (DEFAULT:
1). If this value is larger than 1, a single TaskManager takes
multiple instances of a function or operator. That way, the
TaskManager can utilize multiple CPU cores, but at the same time, the
available memory is divided between the different operator or function
instances. This value is typically proportional to the number of
physical CPU cores that the TaskManager’s machine has (e.g., equal to
the number of cores, or half the number of cores).
And here more focused on my question:
Each task slot represents a fixed subset of resources of the
TaskManager. A TaskManager with three slots, for example, will
dedicate 1/3 of its managed memory to each slot. Slotting the
resources means that a subtask will not compete with subtasks from
other jobs for managed memory, but instead has a certain amount of
reserved managed memory. Note that no CPU isolation happens here;
currently slots only separate the managed memory of tasks.
Thanks!!
I was looking for the same question. In my understanding, there is no configuration that will set number of CPUs per slot. Setting the number of slots will divide the memory among the slots reducing the memory per slot. My best guess is set the number of slots to 1 and have CPUs available to the task manager process running in a container(may be docker). You can achieve the same parallelism by increasing the number taskmanagers.
Think this is in the flink config documentation:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#yarn
yarn.containers.vcores -1
The number of virtual cores (vcores) per YARN container. By default, the number of vcores is set to the number of slots per TaskManager, if set, or to 1, otherwise. In order for this parameter to be used your cluster must have CPU scheduling enabled. You can do this by setting the org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
david is correct above -- but the reasoning is because of this setting and I think this more closely answers the OPs question. So if you leave the default value, adjusting the number of task slots will adjust the number of cores.

Requesting integer multiple of "M" cores per node on SGE

I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots parameter and then SGE tries to find num_slots slots following the allocation policy of the pe_name PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M slots per host (and not a multiple of M), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn, set its allocation_rule to 8, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40 and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1 for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G. Assuming that the nodes are configured with h_vmem of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem of less than 47 GB. h_vmem serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores with k being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.

Resources