Slurm says drained Low RealMemory - slurm

I want to install slurm on localhost. I already installed slurm on similar machine, and it works fine, but on the other machine i got this:
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
transgen-4-partition* up infinite 1 drain transgen-4
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo -Nel
Fri Jun 25 17:42:56 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
transgen-4 1 transgen-4-partition* drained 48 1:24:2 541008 0 1 (null) Low RealMemory
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ srun -n8 sleep 10
srun: Required node not available (down, drained or reserved)
srun: job 5 queued and waiting for resources
^Csrun: Job allocation 5 has been revoked
srun: Force Terminated job 5
I found the advice to do so:
sudo scontrol update NodeName=transgen-4 State=DOWN Reason=hung_completing
sudo systemctl restart slurmctld slurmd
sudo scontrol update NodeName=transgen-4 State=RESUME
, but it had no effect.
slurm.conf:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm.state
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
#SlurmctldLogFile=
#SlurmdDebug=info
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=transgen-4 NodeAddr=localhost CPUs=48 Sockets=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=541008 State=UNKNOWN
PartitionName=transgen-4-partition Nodes=transgen-4 Default=YES MaxTime=INFINITE State=UP
cgroup.conf:
###
# Slurm cgroup support configuration file.
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=no
ConstrainDevices=yes
ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=no
ConstrainSwapSpace=no
TaskAffinity=no #use task/affinity plugin instead
How can i get slurm working?
Thanks in advance.

This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
If that works, you could try to raise the value a bit.

Related

SLURM: Cannot use more than 8 CPUs in a job

Sorry for a stupid question. We setup a small cluster using slurm-16.05.9. The sinfo command shows:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
g01 1 batch* idle 40 2:20:2 258120 63995 1 (null) none
g02 1 batch* idle 40 2:20:2 103285 64379 1 (null) none
g03 1 batch* idle 40 2:20:2 515734 64379 1 (null) none
So each node has 2 socket, each socket 20 cores, and totally 40 CPUs. However, we cannot submit a job using more than 8 CPUs. For example, with the following job description file:
#!/bin/bash
#SBATCH -J Test # Job name
#SBATCH -p batch
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task=10
#SBATCH -o log
#SBATCH -e err
then submitting this job gave the following error message:
sbatch: error: Batch job submission failed: Requested node configuration is not available
even there is no jobs at all in the cluster, unless we set --cpus-per-task <= 8.
Our slurm.conf has the following contents:
ControlMachine=cbc1
ControlAddr=192.168.80.91
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmi2
ProctrackType=proctrack/pgid
ReturnToService=0
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=128000
FastSchedule=0
MaxMemPerCPU=128000
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
GresTypes=gpu
NodeName=g01 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
NodeName=g02 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN Gres=gpu:P100:1
NodeName=g03 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
PartitionName=batch Nodes=g0[1-3] Default=YES MaxTime=UNLIMITED State=UP
Could anyone give us a hint how to fixe this problem ?
Thank you very much.
T.H.Hsieh
The problem is most probably not the number of CPUs but the memory. The job script does not specify memory requirements, and the configuration states
DefMemPerCPU=128000
Therefore the job with 10 CPUs is requesting a total of 1 280 000 MB of
The memory computation per job in that old version of Slurm is possibly related to cores and not threads so the actual request would be half of that 640 000
RAM on a single node while the maximum available is 515 734 MB.
A job requesting 8 CPUs under that hypothesis requests 512 000MB. You can confirm this with scontrol show job <JOBID>

slurmctld.service: Can't open PID file No such file or directory

I have the following error message after trying to start slurm on Ubuntu 18.04
slurmctld.service: Can't open PID file /var/run/slurm-llnl/slurmctld.pid (yet?) after start: No such file or directory
here's the ownership of the slurmllnl directory :
drwxr-xr-x 2 slurm slurm 60 juin 22 11:06 slurm-llnl
And in this directory i have slurmd.pid but i don't have slurmctld.pid
Here is my slurm.conf file :
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=daoud
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# COMPUTE NODES
NodeName=daoud CPUs=64 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=standard Nodes=daoud Default=YES MaxTime=INFINITE State=UP
This is a message issued by systemd, not Slurm, and is caused by using PIDfile in the systemd unit. Slurmctld should keep the Slurmctld from starting.
Newer versions of Slurm switched to Type=simple, therefore not needing a PIDfile anymore

Job not getting requested memory

I have a Slurm job expensive that is killed due to an out of memory error:
I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
slurmstepd: error: Detected 2 oom-kill event(s) in step expensive.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
The job is defined as:
#!/bin/bash
#SBATCH --job-name="expensive"
#SBATCH --mem=64G
#SBATCH --gres=gpu:rtx2080ti:6
#SBATCH --time=05:00:00
#SBATCH --partition=cpu-part
python expensive.py
It is clearly stated that the job requests 64 GB of RAM. When I run sacct -j {jobid} -o JobID,JobName,MaxRSS,AveRSS,CPUTime to profile it the output is:
JobID JobName MaxRSS AveRSS CPUTime
------------ ---------- ---------- ---------- ----------
{jobid} expensive 00:04:01
{jobid}.bat+ batch 21124408K 21124408K 00:04:01
It basically just ran 4 minutes, consumed about 20GB of ram and then crashed due to OOM. What am I missing here?

error: _slurm_rpc_node_registration node=xxxxx: Invalid argument

I am trying to setup Slurm - I have only one login node (called ctm-login-01) and one compute node (called ctm-deep-01). My compute node has several CPUs and 3 GPUs.
My compute node keeps being in drain state and I cannot for the life of me figure out where to start...
Login node
sinfo
ctm-login-01:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain ctm-deep-01
The reason?
sinfo -R
ctm-login-01:~$ sinfo -R
REASON USER TIMESTAMP NODELIST
gres/gpu count repor slurm 2020-12-11T15:56:55 ctm-deep-01
Indeed, I keep getting these error messages in /var/log/slurm-llnl/slurmctld.log:
/var/log/slurm-llnl/slurmctld.log
[2020-12-11T16:17:39.857] gres/gpu: state for ctm-deep-01
[2020-12-11T16:17:39.857] gres_cnt found:0 configured:3 avail:3 alloc:0
[2020-12-11T16:17:39.857] gres_bit_alloc:NULL
[2020-12-11T16:17:39.857] gres_used:(null)
[2020-12-11T16:17:39.857] error: _slurm_rpc_node_registration node=ctm-deep-01: Invalid argument
(Notice that I have set slurm.conf Debug to verbose and also set DebugFlags=Gres for more details on the GPU.)
These are the configuration files I have in all nodes and some of their contents...
/etc/slurm-llnl/* files
ctm-login-01:/etc/slurm-llnl$ ls
cgroup.conf cgroup_allowed_devices_file.conf gres.conf plugstack.conf plugstack.conf.d slurm.conf
ctm-login-01:/etc/slurm-llnl$ tail slurm.conf
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=ctm-deep-01 Gres=gpu:3 CPUs=24 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ctm-deep-01 Default=YES MaxTime=INFINITE State=UP
# default
SallocDefaultCommand="srun --gres=gpu:1 $SHELL"
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
ctm-login-01:/etc/slurm-llnl$ cat cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
ctm-login-01:/etc/slurm-llnl$ cat cgroup_allowed_devices_file.conf
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*
Compute node
The logs in my compute node are the following.
/var/log/slurm-llnl/slurmd.log
ctm-deep-01:~$ sudo tail /var/log/slurm-llnl/slurmd.log
[2020-12-11T15:54:35.787] Munge credential signature plugin unloaded
[2020-12-11T15:54:35.788] Slurmd shutdown completing
[2020-12-11T15:55:53.433] Message aggregation disabled
[2020-12-11T15:55:53.436] topology NONE plugin loaded
[2020-12-11T15:55:53.436] route default plugin loaded
[2020-12-11T15:55:53.440] task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffff
[2020-12-11T15:55:53.440] Munge credential signature plugin loaded
[2020-12-11T15:55:53.441] slurmd version 19.05.5 started
[2020-12-11T15:55:53.442] slurmd started on Fri, 11 Dec 2020 15:55:53 +0000
[2020-12-11T15:55:53.443] CPUs=24 Boards=1 Sockets=1 Cores=12 Threads=2 Memory=128754 TmpDisk=936355 Uptime=26 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
That CPU mask affinity looks weird...
Notice that I have already called sudo nvidia-smi --persistence-mode=1. Notice also that the aforementioned gres.conf file seems correct:
nvidia-smi topo -m
ctm-deep-01:/etc/slurm-llnl$ sudo nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity
GPU0 X SYS SYS 0-23 N/A
GPU1 SYS X PHB 0-23 N/A
GPU2 SYS PHB X 0-23 N/A
Any other log or configuration I should take a clue from? Thanks!
It was all because of a typo!
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
Obviously, that should be NodeName=ctm-deep-01 which is my compute node! Jeez...

How to make job allocation by node group in a partition in SLURM

I am using slurm job scheduler.
HPC consists of two groups of nodes: ddcd[00-31] and ddcb[00-31]
two groups have different H/W spec. (40 cores and 16 cores) but are in a same partion.
I would like to make slurm allocate job in one of the node groups instead of mixing or spreading the job among two groups.
for instance, a job of 160 cores should be allocated in 10 nodes of ddcb or 4 nodes of ddcd.
I have set node weight on each node groups but it looks not working. some mixed allocation was observed.
Any help would be appreciated.
my slurm.conf is as follows:
SlurmctldHost=mynode
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
JobRequeue=0
# JOB PRIORITY
#PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityCalcPeriod=5
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityUsageResetPeriod=NONE
PriorityWeightAge=10000
PriorityWeightFairshare=0
PriorityWeightJobSize=100000
PriorityWeightPartition=0
PriorityWeightQOS=1000000
#
AuthType=auth/munge
CryptoType=crypto/munge
#
PrologFlags=Alloc
#PrologFlags=x11
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top
SelectType=select/linear
#
PropagateResourceLimitsExcept=MEMLOCK
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=qos,limits,
ClusterName=ssmbhpc
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=ddcd[00-31] Sockets=2 CoresPerSocket=20 ThreadsPercore=1 Weight=10 State=UNKNOWN
NodeName=ddcb[00-31] Sockets=2 CoresPerSocket=8 ThreadsPercore=1 Weight=200 State=UNKNOWN
#
# Partition
PartitionName=debug Nodes=ddcd[00-31] Default=YES MaxTime=INFINITE State=UP
PartitionName=strp Nodes=ddcd[00-31],ddcb[00-31] Default=No MaxTime=INFINITE State=UP QOS=normal
I found that it is achievable with node feature and sbatch --constraint

Resources