SLURM error: sbatch: error: Batch job submission failed: Requested node configuration is not available - slurm

I'm trying to use a cluster to run an MPI code. the cluster hardware consist of 30 nodes, each with the following specs:
16 Cores at 2 Sockets (Intel Xeon e5-2650 v2) - (32 Cores with multithreading enabled)
64 GByte 1866 MT/s main memory
named: aria
the slurm config file is as following:
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --cpus-per-task=1 # Number of cores per MPI rank
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks-per-node=32 # How many tasks on each node
#SBATCH --ntasks-per-socket=16 # How many tasks on each CPU or socket
#SBATCH --mem-per-cpu=100mb # Memory per core
when I submit the job, a return message comes out with the following content: sbatch: error: Batch job submission failed: Requested node configuration is not available
which is a little bit confusing. I'm submitting one task per a cpu and dividing the tasks equally between nodes and sockets, can anyone please advise on the problem with the aforementioned configs? and one more thing: what is the optimum configuration given the hardware specs?
Thanks in advance

Look exactly what nodes offer with the sinfo -Nl command.
If could be that:
hyper threading is not enabled (which is often the case on HPC clusters)
or one core is reserved for Slurm and the Operating System
or hyper threading is enabled but Slurm is configured to schedule physical cores
As for optimal job configuration, it depends how 'optimal' is defined. For optimal time to solution, often it is better to let Slurm decide how to organise the ranks on the nodes because it will then be able to start your job sooner.
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --mem-per-cpu=100mb # Memory per core
For optimal job performance (in case of benchmarks, or cost analysis, etc.) you will need to take switches into accounts as well. (although with 30 nodes you probably have only one switch)
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --exclusive
#SBATCH --switches=1
#SBATCH --mem-per-cpu=100mb # Memory per core
Using --exclusive will make sure your job will not be bothered by other jobs.

Related

Difference between dask node and compute node for slurm configuration

First off, apologies if I use confusing or incorrect terminology, I am still learning.
I am trying to set up configuration for a Slurm-enabled adaptive cluster.
Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:
Partition Name
Max Nodes per Job
Max Job Runtime
Max resources used simultaneously
Shared Node Usage
Default Memory per CPU
Max Memory per CPU
compute
512
8 hours
no limit
no
1920 MB
8000 MB
compute
This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:
Component
Value
# of CPU Cores
64
# of Threads
128
Here is some output from control show partition:
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=960 MaxMemPerCPU=3840
Here is what I have so far:
cluster = SLURMCluster(
name='dask-cluster',
processes=32,
cores=64,
memory=f"{8000 * 64 * 0.90} MB",
project="ab0995",
queue="compute",
interface='ib0',
walltime='08:00:00',
asynchronous=0,
# job_extra=["--ntasks-per-node=50",],
)
Some things to mention:
In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.
If I now print the job script as configured above, I get:
print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00
/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://
So, questions:
By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
Can someone give me a good distinction between dask nodes, dask workers, compute nodes as seen in slurm?
I am in a similar situation to yours, where I am trying to understand how dask-ditributed works with Slurms.
The dask-distributed docs report that Slurm uses KB or GB, but it actually means KiB or GiB, so Dask converts your value in GiB.
What I've found is that nprocs=processes, nthreads=cores/processes and memory-limit=allocated-memory/processes. The job will then launch a dask_worker using nprocs processes of nthreads each (which will be the Workers of your SLURMCluster).
This is not clear to me as well, so I don't have a good answer. I think that since Slurm nodes have several CPUs, the Scheduler of your SLURMCluster manages Dask Workers based on the allocated CPUs. (I didn't find anything about dask nodes from the docs though)

Can I set a variable amount of memory for SLURM jobs?

I am on a school server, where we have a login node, and several GPU nodes that can be accessed with SLURM. Specifically, I am running several jobs on a machine with 512 GB RAM, 64 CPUS, and 8 GPUs.
I want to train a large deep-learning model with a ton of text. After some trial and error, it seems I need 100 GB of RAM to load all the text data (12 GB batch file on disk) successfully, before training starts. However, while training, it only uses about 30 GB of RAM. I want to run several versions of this model, so I can only do up to 5 before running out of RAM to allocate.
Is there a way to have SLURM use a variable amount of RAM? I don't want to hog up too many resources unnecessarily. This are the SBATCH directives I am currently using:
#SBATCH --mem-per-cpu=20G
#SBATCH --cpus-per-task=5
#SBATCH --nodelist=gpu-large
#SBATCH --gpus=1
I currently have 3 jobs running, and this is what I see from free:
total used free shared buff/cache available
Mem: 503Gi 94Gi 246Gi 1.6Gi 161Gi 403Gi
Swap: 4.0Gi 96Mi 3.9Gi
So my jobs indeed only are using 30 GB each. Help would be greatly appreciated!
Disclaimer: the following is from memory, I have no access to a slurm cluster ATM to check that.
You should be able to update the memory requirement of the job after it started. AFAIR, as a simple user (as opposed to a slurm admin or root), you can only lower the limits, not increase them (to prevent highjacking priority or ressources). Say, if your job has a memory-intensive step0 followed by a less intensive step1, you could try something like the following:
# Memory intensive part
step0
# Lower ressource reservation
scontrol update JobId=$SLURM_JOBID MinMemoryCPU=6G
# Less intensive step
step1
Some general info there: https://slurm.schedmd.com/scontrol.html

Is there a way to use CPUs individually on a cluster with slurm?

I've been using a cluster of 200 nodes with 32 cores each for simulating stochastic processes.
I have to do around 10 000 simulations of the same system, so I am running the same simulation (with different RNG seeds) in 32 cores of one node until it does all the 10 000 simulations. (each simulation is completely independent of the others)
In doing so some of the simulations, depending on the seed, take much more time then the others and after some time I usually have the full node allocated to me but only with one core running (so I am unnecessarily occupying 31 cores).
in my sbatch script I have this:
# Specify the number of nodes(nodes=) and the number of cores per nodes(tasks-pernode=) to be used
#SBATCH -N 1
#SBATCH --ntasks-per-node=32
...
cat list.dat | parallel --colsep '\t' -j 32 ./main{} > "Results/A.out"
which runs the 32 ./main's at a time in the same node until all lines of list.dat are used (10 000 lines).
Is there a way to free this unused cores for other jobs?
And is there a way for me to send this 32 jobs to random nodes, that is one job submission using a maximum of 32 cores in (potentially) different nodes (whatever is free at the moment)?
Thank you!
If the cluster is configured to share compute nodes between jobs, one option is to submit a 10 000-jobs job array. The submission script would look like this (untested):
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-10000
cat list.dat | sed -n "${SLURM_ARRAY_TASK_ID} p" | xargs -I{} ./main{} > "Results/A.out_${SLURM_ARRAY_TASK_ID}"
Every simulation would then be scheduled independently from one another and use all the free cores on the cluster, without leaving allocated but unused cores.
By contrast with submitting 10 000 independent jobs, the job array will allow you to manage all the jobs with a single command. Also, job arrays put much less burden on the scheduler than individual jobs.
If there is a limitation on the number of jobs that are allowed in a job array, you can simply pack multiple simulation in the same job, either sequentially, or in parallel like you are doing at the moment, but with maybe 8 cores or 12.
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH --array=1-10000:100
cat list.dat | sed -n "${SLURM_ARRAY_TASK_ID},$((SLURM_ARRAY_TASK_ID+99)) p" | parallel --colsep '\t' -j 12 ./main{} > "Results/A.out_${SLURM_ARRAY_TASK_ID}"

How to set maximum allowed CPUs per job in Slurm?

How can I set the maximum number of CPUs each job can ask for in Slurm?
We're running a GPU cluster and want a sensible number of CPUs to be always available for GPU jobs. This is kind of fine as long as the job asks for GPUs because there's GPU <-> CPU mapping in the gres.conf. But this doesn't stop a job that doesn't ask for any GPUs not to acquire all CPUs in the system.
To set the maximum number of CPUs a single job can use, at the cluster level, you can run the following command:
sacctmgr modify cluster <cluster_name> set maxtresperjob=cpu=<nb of CPUs>
Note that you must have SelectType=select/cons_tres in your configuration file for this to work.
Alternatively the same restriction can be applied partition-wise, QOS-wise, account-wise, etc.

how to limit the number of jobs running on the same node using SLURM?

I have a job array of 100 jobs. I want at most 2 jobs from the job array can be allocated to the same node. How can I do this using SLURM? Thanks!
Assuming that jobs can share nodes, and that nodes have homogeneous configuration, and that you are alone on the cluster,
use the sinfo -Nl command to find the number of CPUs per nodes
submit jobs that request half that number with either of #SBATCH --tasks-per-node=... or #SBATCH --cpus-per-task=... based on what your jobs do
If you are administrating a cluster that is shared among other people, you can define GRES of a dummy type, and assign two of them to each node in slurm.conf and then request one per job with --gres=dummy:1

Resources