error: _slurm_rpc_node_registration node=xxxxx: Invalid argument - slurm
I am trying to setup Slurm - I have only one login node (called ctm-login-01) and one compute node (called ctm-deep-01). My compute node has several CPUs and 3 GPUs.
My compute node keeps being in drain state and I cannot for the life of me figure out where to start...
Login node
sinfo
ctm-login-01:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain ctm-deep-01
The reason?
sinfo -R
ctm-login-01:~$ sinfo -R
REASON USER TIMESTAMP NODELIST
gres/gpu count repor slurm 2020-12-11T15:56:55 ctm-deep-01
Indeed, I keep getting these error messages in /var/log/slurm-llnl/slurmctld.log:
/var/log/slurm-llnl/slurmctld.log
[2020-12-11T16:17:39.857] gres/gpu: state for ctm-deep-01
[2020-12-11T16:17:39.857] gres_cnt found:0 configured:3 avail:3 alloc:0
[2020-12-11T16:17:39.857] gres_bit_alloc:NULL
[2020-12-11T16:17:39.857] gres_used:(null)
[2020-12-11T16:17:39.857] error: _slurm_rpc_node_registration node=ctm-deep-01: Invalid argument
(Notice that I have set slurm.conf Debug to verbose and also set DebugFlags=Gres for more details on the GPU.)
These are the configuration files I have in all nodes and some of their contents...
/etc/slurm-llnl/* files
ctm-login-01:/etc/slurm-llnl$ ls
cgroup.conf cgroup_allowed_devices_file.conf gres.conf plugstack.conf plugstack.conf.d slurm.conf
ctm-login-01:/etc/slurm-llnl$ tail slurm.conf
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=ctm-deep-01 Gres=gpu:3 CPUs=24 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ctm-deep-01 Default=YES MaxTime=INFINITE State=UP
# default
SallocDefaultCommand="srun --gres=gpu:1 $SHELL"
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
ctm-login-01:/etc/slurm-llnl$ cat cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
ctm-login-01:/etc/slurm-llnl$ cat cgroup_allowed_devices_file.conf
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*
Compute node
The logs in my compute node are the following.
/var/log/slurm-llnl/slurmd.log
ctm-deep-01:~$ sudo tail /var/log/slurm-llnl/slurmd.log
[2020-12-11T15:54:35.787] Munge credential signature plugin unloaded
[2020-12-11T15:54:35.788] Slurmd shutdown completing
[2020-12-11T15:55:53.433] Message aggregation disabled
[2020-12-11T15:55:53.436] topology NONE plugin loaded
[2020-12-11T15:55:53.436] route default plugin loaded
[2020-12-11T15:55:53.440] task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffff
[2020-12-11T15:55:53.440] Munge credential signature plugin loaded
[2020-12-11T15:55:53.441] slurmd version 19.05.5 started
[2020-12-11T15:55:53.442] slurmd started on Fri, 11 Dec 2020 15:55:53 +0000
[2020-12-11T15:55:53.443] CPUs=24 Boards=1 Sockets=1 Cores=12 Threads=2 Memory=128754 TmpDisk=936355 Uptime=26 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
That CPU mask affinity looks weird...
Notice that I have already called sudo nvidia-smi --persistence-mode=1. Notice also that the aforementioned gres.conf file seems correct:
nvidia-smi topo -m
ctm-deep-01:/etc/slurm-llnl$ sudo nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity
GPU0 X SYS SYS 0-23 N/A
GPU1 SYS X PHB 0-23 N/A
GPU2 SYS PHB X 0-23 N/A
Any other log or configuration I should take a clue from? Thanks!
It was all because of a typo!
ctm-deep-01:/etc/slurm-llnl$ cat gres.conf
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia0 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia1 CPUs=0-23
NodeName=ctm-login-01 Name=gpu File=/dev/nvidia2 CPUs=0-23
Obviously, that should be NodeName=ctm-deep-01 which is my compute node! Jeez...
Related
SLURM: Cannot use more than 8 CPUs in a job
Sorry for a stupid question. We setup a small cluster using slurm-16.05.9. The sinfo command shows: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON g01 1 batch* idle 40 2:20:2 258120 63995 1 (null) none g02 1 batch* idle 40 2:20:2 103285 64379 1 (null) none g03 1 batch* idle 40 2:20:2 515734 64379 1 (null) none So each node has 2 socket, each socket 20 cores, and totally 40 CPUs. However, we cannot submit a job using more than 8 CPUs. For example, with the following job description file: #!/bin/bash #SBATCH -J Test # Job name #SBATCH -p batch #SBATCH --nodes=1 #SBATCH --tasks=1 #SBATCH --cpus-per-task=10 #SBATCH -o log #SBATCH -e err then submitting this job gave the following error message: sbatch: error: Batch job submission failed: Requested node configuration is not available even there is no jobs at all in the cluster, unless we set --cpus-per-task <= 8. Our slurm.conf has the following contents: ControlMachine=cbc1 ControlAddr=192.168.80.91 AuthType=auth/munge CryptoType=crypto/munge MpiDefault=pmi2 ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 DefMemPerCPU=128000 FastSchedule=0 MaxMemPerCPU=128000 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log GresTypes=gpu NodeName=g01 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN NodeName=g02 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN Gres=gpu:P100:1 NodeName=g03 CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN PartitionName=batch Nodes=g0[1-3] Default=YES MaxTime=UNLIMITED State=UP Could anyone give us a hint how to fixe this problem ? Thank you very much. T.H.Hsieh
The problem is most probably not the number of CPUs but the memory. The job script does not specify memory requirements, and the configuration states DefMemPerCPU=128000 Therefore the job with 10 CPUs is requesting a total of 1 280 000 MB of The memory computation per job in that old version of Slurm is possibly related to cores and not threads so the actual request would be half of that 640 000 RAM on a single node while the maximum available is 515 734 MB. A job requesting 8 CPUs under that hypothesis requests 512 000MB. You can confirm this with scontrol show job <JOBID>
How to change the cpu scaling_governor value via udev at the booting time
everyone. I’m tring to use udev to set my cpu's scaling_governor value from powersave to performance. And here is my udev rule file: [root#node1 ~]$ cat /etc/udev/rules.d/50-scaling-governor.rules SUBSYSTEM=="cpu", KERNEL=="cpu[0-9]|cpu[0-9][0-9]", ACTION=="add", ATTR{cpufreq/scaling_governor}="performance" Before testing my 50-scaling-governor.rules, let's see what the value of scaling_governor is first. [root#node1 ~]$ cat /sys/devices/system/cpu/cpu16/cpufreq/scaling_governor powersave And then I use udevadm command to execute my 50-scaling-governor.rule [root#node1 ~]$ udevadm test --action="add" /devices/system/cpu/cpu16 calling: test version 219 This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. === trie on-disk === tool version: 219 file size: 8873994 bytes header size 80 bytes strings 2300642 bytes nodes 6573272 bytes Load module index Created link configuration context. timestamp of '/etc/udev/rules.d' changed # omit some unrelevant messages ... Reading rules file: /etc/udev/rules.d/50-scaling-governor.rules ... rules contain 49152 bytes tokens (4096 * 12 bytes), 21456 bytes strings 3908 strings (46431 bytes), 2777 de-duplicated (26107 bytes), 1132 trie nodes used no db file to read /run/udev/data/+cpu:cpu16: No such file or directory ATTR '/sys/devices/system/cpu/cpu16/cpufreq/scaling_governor' writing 'performance' /etc/udev/rules.d/50-scaling-governor.rules:1 IMPORT builtin 'hwdb' /usr/lib/udev/rules.d/50-udev-default.rules:11 IMPORT builtin 'hwdb' returned non-zero RUN 'kmod load $env{MODALIAS}' /usr/lib/udev/rules.d/80-drivers.rules:5 RUN '/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --no-block /usr/lib/udev/kdump-udev-throttler'' /usr/lib/udev /rules.d/98-kexec.rules:14 ACTION=add DEVPATH=/devices/system/cpu/cpu16 DRIVER=processor MODALIAS=cpu:type:x86,ven0000fam0006mod004F:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016, 0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0083,008 4,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3,00E4,0 0E6,00E7,00EB,00EC,00F0,00F1,00F3,00F5,00F6,00F9,00FA,00FB,00FD,0100,0101,0102,0103,0104,0111,0120,0121,0123,0124,0125,0127,0128,0129,012A,012B,012C,012D ,012F,0132,0133,0134,0139,0140,0160,0161,0162,0163,0165,01C0,01C1,01C2,01C4,01C5,01C6,024A,025A,025B,025C,025F SUBSYSTEM=cpu USEC_INITIALIZED=184210753630 run: 'kmod load cpu:type:x86,ven0000fam0006mod004F:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,001 5,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0 083,0084,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3 ,00E4,00E6,00E7,00EB,00EC,00F0,00F1,00F3,00F5,00F6,00F9,00FA,00FB,00FD,0100,0101,0102,0103,0104,0111,0120,0121,0123,0124,0125,0127,0128,0129,012A,012B,01 2C,012D,012F,0132,0133,0134,0139,0140,0160,0161,0162,0163,0165,01C0,01C1,01C2,01C4,01C5,01C6,024A,025A,025B,025C,025F' run: '/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --no-block /usr/lib/udev/kdump-udev-throttler'' Unload module index Unloaded link configuration context. And now, the value of cpu16/scaling_governor has changed, so it's nothing wrong with my udev rule. [root#node1 ~]$ cat /sys/devices/system/cpu/cpu16/cpufreq/scaling_governor performance But after rebooting my server, I find that the scaling_governor value of cpu16 is still powersave. I have no idea why my udev rule can work properly by udevadm while failing by rebooting. Some environment information about my machine is as follows: OS: CentOS Linux release 7.9.2009 (Core) kernel version: 5.4.154-1.el7.elrepo.x86_64 udev version: 219 Can anyone give me some hint or advice? Thanks in advance
Slurm says drained Low RealMemory
I want to install slurm on localhost. I already installed slurm on similar machine, and it works fine, but on the other machine i got this: transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST transgen-4-partition* up infinite 1 drain transgen-4 transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo -Nel Fri Jun 25 17:42:56 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON transgen-4 1 transgen-4-partition* drained 48 1:24:2 541008 0 1 (null) Low RealMemory transgen#transgen-4:~/galaxy/tools/melanoma_tools$ srun -n8 sleep 10 srun: Required node not available (down, drained or reserved) srun: job 5 queued and waiting for resources ^Csrun: Job allocation 5 has been revoked srun: Force Terminated job 5 I found the advice to do so: sudo scontrol update NodeName=transgen-4 State=DOWN Reason=hung_completing sudo systemctl restart slurmctld slurmd sudo scontrol update NodeName=transgen-4 State=RESUME , but it had no effect. slurm.conf: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # SlurmctldHost=localhost # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurm.state SwitchType=switch/none TaskPlugin=task/cgroup # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=info #SlurmctldLogFile= #SlurmdDebug=info #SlurmdLogFile= # # # COMPUTE NODES NodeName=transgen-4 NodeAddr=localhost CPUs=48 Sockets=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=541008 State=UNKNOWN PartitionName=transgen-4-partition Nodes=transgen-4 Default=YES MaxTime=INFINITE State=UP cgroup.conf: ### # Slurm cgroup support configuration file. ### CgroupAutomount=yes CgroupMountpoint=/sys/fs/cgroup ConstrainCores=no ConstrainDevices=yes ConstrainKmemSpace=no #avoid known Kernel issues ConstrainRAMSpace=no ConstrainSwapSpace=no TaskAffinity=no #use task/affinity plugin instead How can i get slurm working? Thanks in advance.
This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume. If that works, you could try to raise the value a bit.
Is there a way to read the memory counter used by cgroups to kill processes?
I am running a process under a cgroup with an OOM Killer. When it performs a kill, dmesg outputs messages such as the following. [9515117.055227] Call Trace: [9515117.058018] [<ffffffffbb325154>] dump_stack+0x63/0x8f [9515117.063506] [<ffffffffbb1b2e24>] dump_header+0x65/0x1d4 [9515117.069113] [<ffffffffbb5c8727>] ? _raw_spin_unlock_irqrestore+0x17/0x20 [9515117.076193] [<ffffffffbb14af9d>] oom_kill_process+0x28d/0x430 [9515117.082366] [<ffffffffbb1ae03b>] ? mem_cgroup_iter+0x1db/0x3c0 [9515117.088578] [<ffffffffbb1b0504>] mem_cgroup_out_of_memory+0x284/0x2d0 [9515117.095395] [<ffffffffbb1b0f95>] mem_cgroup_oom_synchronize+0x305/0x320 [9515117.102383] [<ffffffffbb1abf50>] ? memory_high_write+0xc0/0xc0 [9515117.108591] [<ffffffffbb14b678>] pagefault_out_of_memory+0x38/0xa0 [9515117.115168] [<ffffffffbb0477b7>] mm_fault_error+0x77/0x150 [9515117.121027] [<ffffffffbb047ff4>] __do_page_fault+0x414/0x420 [9515117.127058] [<ffffffffbb048022>] do_page_fault+0x22/0x30 [9515117.132823] [<ffffffffbb5ca8b8>] page_fault+0x28/0x30 [9515117.330756] Memory cgroup out of memory: Kill process 13030 (java) score 1631 or sacrifice child [9515117.340375] Killed process 13030 (java) total-vm:18259139756kB, anon-rss:2243072kB, file-rss:30004132kB I would like to be able to tell how much memory the cgroups OOM Killer believes the process is using at any given time. Is there a way to query for this quantity?
I found the following in the official documentation for cgroup-v1, which shows how to query current memory usage, as well as altering limits: a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG c. Enable CONFIG_MEMCG_SWAP (to use swap extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) # mount -t tmpfs none /sys/fs/cgroup # mkdir /sys/fs/cgroup/memory # mount -t cgroup none /sys/fs/cgroup/memory -o memory 3.2. Make the new group and move bash into it # mkdir /sys/fs/cgroup/memory/0 # echo $$ > /sys/fs/cgroup/memory/0/tasks Since now we're in the 0 cgroup, we can alter the memory limit: # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). NOTE: We cannot set limits on the root cgroup any more. # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 4194304 We can check the usage: # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 1216512
How to limit CPU and RAM resources for mongodump?
I have a mongod server running. Each day, I am executing the mongodump in order to have a backup. The problem is that the mongodump will take a lot of resources and it will slow down the server (that by the way already runs some other heavy tasks). My goal is to somehow limit the mongodump which is called in a shell script. Thanks.
You should use cgroups. Mount points and details are different on distros and a kernels. I.e. Debian 7.0 with stock kernel doesn't mount cgroupfs by default and have memory subsystem disabled (folks advise to reboot with cgroup_enabled=memory) while openSUSE 13.1 shipped with all that out of box (due to systemd mostly). So first of all, create mount points and mount cgroupfs if not yet done by your distro: mkdir /sys/fs/cgroup/cpu mount -t cgroup -o cpuacct,cpu cgroup /sys/fs/cgroup/cpu mkdir /sys/fs/cgroup/memory mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory Create a cgroup: mkdir /sys/fs/cgroup/cpu/shell mkdir /sys/fs/cgroup/memory/shell Set up a cgroup. I decided to alter cpu shares. Default value for it is 1024, so setting it to 128 will limit cgroup to 11% of all CPU resources, if there are competitors. If there are still free cpu resources they would be given to mongodump. You may also use cpuset to limit numver of cores available to it. echo 128 > /sys/fs/cgroup/cpu/shell/cpu.shares echo 50331648 > /sys/fs/cgroup/memory/shell/memory.limit_in_bytes Now add PIDs to the cgroup it will also affect all their children. echo 13065 > /sys/fs/cgroup/cpu/shell/tasks echo 13065 > /sys/fs/cgroup/memory/shell/tasks I run couple of tests. Python that tries to allocate bunch of mem was Killed by OOM: myaut#zenbook:~$ python -c 'l = range(3000000)' Killed I've also run four infinite loops and fifth in cgroup. As expected, loop that was run in cgroup got only about 45% of CPU time, while the rest of them got 355% (I have 4 cores). All that changes do not survive reboot! You may add this code to a script that runs mongodump, or use some permanent solution.