Currently, my cluster looks as follows
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 alloc ip-a-b-c-d
debug* up infinite 7 idle <list of ips>
Debug using squeue
The node in allocated state seems to be running bash
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16 debug bash ubuntu R 13:47 1 ip-a-b-c-d
Flush out the queued bash job
scancel -n bash
Verify no queued job
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Now confirm the allocated node is back to idle state
ubuntu:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 8 idle ip-a-b-c-d, ... <list of other idle ips>
Related
I want to install slurm on localhost. I already installed slurm on similar machine, and it works fine, but on the other machine i got this:
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
transgen-4-partition* up infinite 1 drain transgen-4
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ sinfo -Nel
Fri Jun 25 17:42:56 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
transgen-4 1 transgen-4-partition* drained 48 1:24:2 541008 0 1 (null) Low RealMemory
transgen#transgen-4:~/galaxy/tools/melanoma_tools$ srun -n8 sleep 10
srun: Required node not available (down, drained or reserved)
srun: job 5 queued and waiting for resources
^Csrun: Job allocation 5 has been revoked
srun: Force Terminated job 5
I found the advice to do so:
sudo scontrol update NodeName=transgen-4 State=DOWN Reason=hung_completing
sudo systemctl restart slurmctld slurmd
sudo scontrol update NodeName=transgen-4 State=RESUME
, but it had no effect.
slurm.conf:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm.state
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
#SlurmctldLogFile=
#SlurmdDebug=info
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=transgen-4 NodeAddr=localhost CPUs=48 Sockets=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=541008 State=UNKNOWN
PartitionName=transgen-4-partition Nodes=transgen-4 Default=YES MaxTime=INFINITE State=UP
cgroup.conf:
###
# Slurm cgroup support configuration file.
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=no
ConstrainDevices=yes
ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=no
ConstrainSwapSpace=no
TaskAffinity=no #use task/affinity plugin instead
How can i get slurm working?
Thanks in advance.
This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
If that works, you could try to raise the value a bit.
I am having trouble understanding how the threads and resources allocated to a snakejob translate to number of cores allocated per snakejob on my slurm partition. I have set the --cores flag to 46 on my .sh which runs my snakefile,
yet each of 5 snakejobs are concurrently running, with 16 cores provided to each of them. Does a rule specific thread number superceed the --cores flag for snakemake? I thought it was the max cores that all my jobs together had to work with...
Also, are cores allocated based on memory, and does that scale with number of threads specified? For example, my jobs were allocated 10GB a peice of memory, but one thread only. Each job was given two cores according to my SLURM outputs. When I specified 8 threads with 10GB of memory, I was provided 16 cores instead. Does that have to do with the amount of memory I gave to my job, or is it just that an additional core is provided for each thread for memory purposes? Any help would be appreciated.
Here is one of snake job outputs:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 index_genome
1
[Tue Feb 2 10:53:59 2021]
rule index_genome:
input: /mypath/genome/genomex.fna
output: /mypath/genome/genomex.fna.ann
jobid: 0
wildcards: bwa_extension=.ann
threads: 8
resources: mem_mb=10000
Here is my bash command:
module load snakemake/5.6.0
snakemake -s snake_make_x --cluster-config cluster.yaml --default-resources --cores 48 --jobs 47 \
--cluster "sbatch -n {threads} -M {cluster.cluster} -A {cluster.account} -p {cluster.partition}" \
--latency-wait 10
When you use slurm together with snakemake, the --cores flag unfortunately does not mean cores anymore, it means jobs.. So when you set --cores 48 you are actually telling snakemake to use at max 48 parallel jobs.
Related question:
Behaviour of "--cores" when using Snakemake with the slurm profile
I have a Slurm job expensive that is killed due to an out of memory error:
I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
slurmstepd: error: Detected 2 oom-kill event(s) in step expensive.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
The job is defined as:
#!/bin/bash
#SBATCH --job-name="expensive"
#SBATCH --mem=64G
#SBATCH --gres=gpu:rtx2080ti:6
#SBATCH --time=05:00:00
#SBATCH --partition=cpu-part
python expensive.py
It is clearly stated that the job requests 64 GB of RAM. When I run sacct -j {jobid} -o JobID,JobName,MaxRSS,AveRSS,CPUTime to profile it the output is:
JobID JobName MaxRSS AveRSS CPUTime
------------ ---------- ---------- ---------- ----------
{jobid} expensive 00:04:01
{jobid}.bat+ batch 21124408K 21124408K 00:04:01
It basically just ran 4 minutes, consumed about 20GB of ram and then crashed due to OOM. What am I missing here?
As we know squeue returns the status of the running jobs.
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
130 debug run.sh user PD 0:00 1 (Resources)
131 debug run.sh user PD 0:00 1 (Resources)
128 debug 52546914 user R 7:28 1 node1
129 debug run.sh user R 0:02 1 node1
For example my core number is 2.
[Q] Is there any way to return only the unused core number? In the example, unused core number should return 0.
Should I write a parser for this in order to retrieve core number next to each R, add them, and subtract it from total core number as follows:
squeue | grep -P ' R ' | awk '{print $7}' | paste -sd+ - | bc
To know the number of core (CPUs) that are available in your cluster, you can use the sinfo command:
$ sinfo -o%C
CPUS(A/I/O/T)
0/1920/0/1920
You can retrieve the numbers into Bash variables easily with
IFS=/ read A I O T <<<$(sinfo -h -o%C)
After running the above command, A will contain the number of allocated cores, I will be the number of idle cores, O will hold the number of 'other' cores, i.e. drained, down, etc. and T will be the total number of cores in the system.
Note that in your question you talk about cores but actually compute the number of nodes. If what you want is the number of nodes, you can use:
$ sinfo -o%A
NODES(A/I)
0/80
See the sinfo man page for more details.
I have understood that docker run -m 256m --memory-swap 256m will limit a container so that it can use at most 256 MB of memory and no swap. If it allocates more, then a process in the container (not "the container") will be killed. For example:
$ sudo docker run -it --rm -m 256m --memory-swap 256m \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 7 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 1s
Apparently one of the workers allocates more memory than is allowed and receives a SIGKILL. Note that the parent process stays alive.
Now if the effect of -m is to invoke the OOM killer if a process allocates too much memory, then what happens when specifying -m and --oom-kill-disable? Trying it like above has the following result:
$ sudo docker run -it --rm -m 256m --memory-swap 256m --oom-kill-disable \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
(waits here)
In a different shell:
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f5e4c30d75c9 0.00% 256 MiB / 256 MiB 100.00% 0 B / 508 B 0 B / 0 B 2
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19391 root 20 0 2055904 262352 340 D 0.0 0.1 0:00.05 stress
I see the docker stats shows a memory consumption of 256 MB, and top shows a RES of 256 MB and a VIRT of 2000 MB. But, what does that actually mean? What will happen to a process inside the container that tries to use more memory than allowed? In which sense it is constrained by -m?
As i understand the docs --oom-kill-disable is not constrained by -m but actually requires it:
By default, kernel kills processes in a container if an out-of-memory
(OOM) error occurs. To change this behaviour, use the
--oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not
set, this can result in the host running out of memory and require
killing the host’s system processes to free memory.
A developer stated back in 2015 that
The host can run out of memory with or without the -m flag set. But
it's also irrelevant as --oom-kill-disable does nothing unless -m is
passed.
In regard to your update, what happens when OOM-killer is disabled and yet the memory limit is hit (intresting OOM article), id say that new calls to malloc and such will just fail as described here but it also depends on the swap configuration and the hosts available memory. If your -m limit is above the actual available memory, the host will start killing processes, one of which might be the docker daemon (which they try to avoid by changing its OOM priority).
The kernel docs (cgroup/memory.txt) say
If OOM-killer is disabled, tasks under cgroup will hang/sleep in
memory cgroup's OOM-waitqueue when they request accountable memory
For the actual implementation (which docker utilizes as well) of cgroups, youd have to check the sourcecode.
The job of the 'oom killer' in Linux is to sacrifice one or more processes in order to free up memory for the system when all else fails. OOM killer is only enabled if the host has memory overcommit enabled
The setting of --oom-kill-disable will set the cgroup parameter to disable the oom killer for this specific container when a condition specified by -m is met. Without the -m flag, oom killer will be irrelevant.
The -m flag doesn’t mean stop the process when it uses more than xmb of ram, it’s only that you’re ensuring that docker container doesn’t consume all host memory, which can force the kernel to kill its process. With -m flag, the container is not allowed to use more than a given amount of user or system memory.
When container hits OOM, it won’t be killed but it can hang and stay in defunct state hence processes inside the container can’t respond until you manually intervene and do a restart or kill the container. Hope this helps clear your questions.
For more details on how kernel act on OOM, check Linux OOM management and Docker memory Limitations page.