How do I see the memory of the GPUs I have available in a slurm partition/queue? - slurm

I want to see the memory the GPUs have before I submit my jobs. I manage to get slurm to tell me model:
(automl-meta-learning) [miranda9#golubh3 ~]$ sinfo -o %G -p eng-research
GRES
gpu:P100:4
(null)
gpu:V100:2
(automl-meta-learning) [miranda9#golubh3 ~]$ sinfo -o %G -p secondary
GRES
(null)
gpu:V100:2
gpu:V100:1
gpu:K80:4
gpu:TeslaK40M:2
but I want to see the amount of memory. I am aware I could login to the queue with srun and see the resources by using nvidia-smi BUT the queue is so fully it can take up to 16h to give me resources. How do I just tell slurm to tell me the GPU memory these queue GPUs have?

Unless the system administrators have encoded the GPU memory as a node "feature", Slurm currently has no knowledge of the GPU memory. This could change in the future with the works on integrating NVIDIA Management Library (NVML) in Slurm, but until then, you can either ask the system administrators or look out in the documentation of your cluster, or in the specification sheets of the cards: V100 cars have either 16GB or 32GB of memory, K80 have 24GB, K40M have 12GB.

Related

Slurm uses more memory than allocated

As you can see in the picture below, I have made a sbatch script so that 10 job array (with 1GB of memory allocation) to be run. However, when I run it as the second picture shows, the memory used is 3.7% of total memory, which equates to about 18.9GB per job... Could anyone explain why this is happening?
(I did sbatch --nodelist node1 ver_5_FINAL_array_bash on the linux terminal )
Thank you!
For reference, the picture below shows that the amount of allocated memory is indeed 10GB, as specified in the sbatch script
Possibly pertinent information: our servers use both slurm and regular job submissions (without any job submission methods like slurm)
By default, the --mem option gives the minimum memory requirement (see the documentation here: https://slurm.schedmd.com/sbatch.html#OPT_mem)
A hard limit can by set by the Slurm administrator, by using cgroups. It's not something the user can do, I don't think.
A cgroup is created for the job with hard resource limits (CPU, memory, disk, etc), and if the job exceeds any of these limits, the job is terminated.

What is docker --kernel-memory

Good day , I know that Docker containers are using the host's kernel (which is why containers are considered as lightweight vms) Here the the source . However, after reading Runtime Options part of a docker documentation I met an option called --kernel-memory. The doc says
The maximum amount of kernel memory the container can use.
I didn't understand what it does. My guess is every container will allocate some memory in host's kernel space .If so then what is the reason , isn't it vulnerable for a user process to allocate memory in kernel space ?
The whole CPU/Memory Limitation stuff is using cgroups.
You can find all settings performed by docker run (either per args or per default) under /sys/fs/cgroup/memory/docker/<container ID> for memory or /sys/fs/cgroup/cpu/docker/<container ID> for cpu.
So the --kernel-memory:
Reading: cat memory.kmem.limit_in_bytes
Writing: sudo -s echo 2167483648 > memory.kmem.limit_in_bytes
And also the benchmarking memory.kmem.max_usage_in_bytes and memory.kmem.usage_in_bytes which shows (rather selfexplaining) the current usage and the highest usage overall.
CGroup docs about Kernel Memory
For the functionality I will recommend reading Kernel Docs for CGroups V1 instead of the docker docs:
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to
limit the amount of kernel memory used by the system. Kernel memory is
fundamentally different than user memory, since it can't be swapped
out, which makes it possible to DoS the system by consuming too much
of this precious resource.
[..]
The memory used is
accumulated into memory.kmem.usage_in_bytes, or in a separate counter
when it makes sense. (currently only for tcp). The main "kmem" counter
is fed into the main counter, so kmem charges will also be visible
from the user counter.
Currently no soft limit is implemented for kernel memory. It is future
work to trigger slab reclaim when those limits are reached.
and
2.7.2 Common use cases
Because the "kmem" counter is fed to the main user counter, kernel
memory can never be limited completely independently of user memory.
Say "U" is the user limit, and "K" the kernel limit. There are three
possible ways limits can be set:
U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING: In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.
Clumsy Attempt of a Conclusion
Given a running container with --memory="2g" --memory-swap="2g" --oom-kill-disable using
cat memory.kmem.max_usage_in_bytes
10747904
10 MB of Kernel-Memory in normal state. Would make sense to me to limit it, let's say to 20 MB of Kernel-Memory. Then it should kill or limit the container to protect the host. But due to the fact that there is - according to the docs - no possibility to reclaim the memory and the OOM Killer is starting to kill processes on host then even with a plenty of free memory (according to this: https://github.com/docker/for-linux/issues/1001) for me it is rather unpractical to use that.
The quoted option to set it >= memory.limit_in_bytes is not really helpful in that scenario either.
Deprecated
--kernel-memory is deprecated in v20.10, due to the fact someone (=Linux Kernel) realized all that as well..
What we can do then?
ULimit
Docker API exposes HostConfig|Ulimit which writes to /etc/security/limits.conf. For docker run should be --ulimit <type>=<soft>:<hard>. Use cat /etc/security/limits.conf or man setrlimit to see the categories and you can try to protect your system from filling kernel memory by e.g. generate unlimited processes with --ulimit nproc=500:500, but be careful, nproc works for users and not for containers, so count together..
To prevent DDoS (intentionally or unintentionally) i would suggest to limit at least nofile and nproc. Maybe someone can elaborate further..
sysctl:
docker run --sysctl can change kernel variables on message queue and shared memory, also network, e.g. docker run --sysctl net.ipv4.tcp_max_orphans= for orphan tcp connections which defaults on my system to 131072, and by a kernel memory usage of 64 kB each: Bang 8 GB on malfunction or dos. Maybe someone can elaborate further..

kubectl top nodes reporting more memory utilisation than Linux system commands

I've been looking all over StackOverflow for this, but can't find a satisfactory answer.
When running kubectl top nodes <node name> I get a memory utilisation of approx. 69% (Kubernetes showing roughly 21Gi of 32Gi being used). But if I go into the system itself and run the free command, as well as the top command, I see a total of 6GB of used memory (i.e. 20% - this is the information under the used column in the output of free) - way less than 69% of the total system memory of 32GB.
Even accounting for the differences in Gi and GB, there's still more than 40% difference unaccounted for. I know that Kubernetes uses the stats reported by /sys/fs/cgroup/memory/memory.usage_in_bytes to report on memory utilisation, but why would this be different than the utilisation reported by other processes on the system (especially sometimes higher)? Which one should I take as the source of truth?
Found answer to my question here: https://serverfault.com/questions/902009/the-memory-usage-reported-in-cgroup-differs-from-the-free-command. In summary, it seems that Kubernetes uses the cgroup memory utilisation, which is reported in /sys/fs/cgroup/memory/memory.usage_in_bytes. The cgroup memory utilisation calculates not only the currently used memory in RAM, but also the "cached" memory (i.e. any memory no longer required by apps that are subsequently free to be reclaimed by the OS, but hasn't been reclaimed yet). The Linux system commands see "cached" memory as "free" but Kubernetes does not (not sure why).

SLURM: see how many cores per node, and how many cores per job

I have searched google and read the documentation.
My local cluster is using SLURM. I want to check the following things:
How many cores does each node have?
How many cores has each job in the queue reserved?
Any advice would be much appreciated!
in order to see the details of all the nodes you can use:
scontrol show node
For an specific node:
scontrol show node "nodename"
And for the cores of job you can use the format mark %C, for instance:
squeue -o"%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"
More info about format.
You can get most information about the nodes in the cluster with the sinfo command, for instance with:
sinfo --Node --long
you will get condensed information about, a.o., the partition, node state, number of sockets, cores, threads, memory, disk and features. It is slightly easier to read than the output of scontrol show nodes.
As for the number of CPUs for each job, see #Sergio Iserte's answer.
See the manpage here.
To build on #damienfrancois's answer:
I found that sinfo was the most useful, but the command arguments should be different. If you just want to know the cores per node, mem per node, availability, and how much is available per node just do the following.
For quick node status:
sinfo -o "%n %e %m %a %c %C"
Output looks like:
HOSTNAMES FREE_MEM MEMORY AVAIL CPUS CPUS(A/I/O/T)
m-4-06 301585 950000 up 96 88/8/0/96
m-4-07 654944 950000 up 72 71/1/0/72
m-4-09 628696 950000 up 72 49/23/0/72
c-0-02 36741 115000 up 24 24/0/0/24
c-0-03 47512 115000 up 24 24/0/0/24
m-2-01 699025 950000 up 72 72/0/0/72
HOSTNAMES tells you the nodes of the cluster, if you want submit to a specific node that is the one you can say you want to use.
FREE_MEM tells you how much memory that node has free in MB.
MEMORY tells you how much memory that node has by default, when it is unused, in MB.
AVAIL tells you if that node is up or not (if you are having issues).
CPUS tells you the total number of cpus on that node, assuming it is unused.
CPUS(A/I/O/T) tells you the number of allocated/idle/other/total cpus. Allocated cpus are the cores unavailable, and currently being used in jobs. Idle cpus are immediately available for use, other means they could be down or in some different mid-run state, and total just reiterates that total number of cpus.
More details on the output of this command and how to format it can be found here.

SLURM: After allocating all GPUs no more cpu job can be submitted

We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. ​I have configured gres.conf and srun works.
The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible.
The error message says that it could not allocate required resources (even though there are 24 CPU cores).
This is my gres.conf:
Name=gpu Type=titanx File=/dev/nvidia0
Name=gpu Type=titanx File=/dev/nvidia1
NodeName=ubuntu Name=gpu Type=titanx File=/dev/nvidia[0-1]
I appreciate any help. Thank you.
Make sure that SelectType in your configuration is CR_CPU or CR_Core and that the shared option of the partition is not set to exclusive. Otherwise Slurm allocates full nodes to jobs.

Resources