Slurm memory resource management - slurm

Despite a thorough read of https://slurm.schedmd.com/slurm.conf.html, there is several things I don't understand regarding how Slurm manages the memory resource. My slurm.conf contains
DefMemPerCPU=1000
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=orion Gres=gpu:RTX2080Ti:4 MemSpecLimit=2048 RealMemory=95232
When not specifying --mem, jobs are launched with MIN_MEMORY=0 and don't seem to fail when allocating memory. What is the maximum memory job can use and how to display it?
When specifying --mem=0, jobs are pending waiting for resources. How can this be?
The value provided in DefMemPerCPU=1000 doesn't seem to have an effect. Is it related to SelectTypeParameters=CR_Core_Memory? If so, what is the equivalent for CPU cores?
Ultimately, what should be the configuration for having a default memory limit?

Related

Slurm uses more memory than allocated

As you can see in the picture below, I have made a sbatch script so that 10 job array (with 1GB of memory allocation) to be run. However, when I run it as the second picture shows, the memory used is 3.7% of total memory, which equates to about 18.9GB per job... Could anyone explain why this is happening?
(I did sbatch --nodelist node1 ver_5_FINAL_array_bash on the linux terminal )
Thank you!
For reference, the picture below shows that the amount of allocated memory is indeed 10GB, as specified in the sbatch script
Possibly pertinent information: our servers use both slurm and regular job submissions (without any job submission methods like slurm)
By default, the --mem option gives the minimum memory requirement (see the documentation here: https://slurm.schedmd.com/sbatch.html#OPT_mem)
A hard limit can by set by the Slurm administrator, by using cgroups. It's not something the user can do, I don't think.
A cgroup is created for the job with hard resource limits (CPU, memory, disk, etc), and if the job exceeds any of these limits, the job is terminated.

Apache Yarn - Allocating more memory than the Physical memory or RAM

I was considering changing yarn.nodemanager.resource.memory-mb to a value higher than the RAM available on my machine. Doing a quick search revealed that not many people are doing this.
Many long lived applications on yarn, are bound to have a jvm heap space allocation in which some of their memory is more frequently used and some of it is rarely used. In this case, it would make perfect sense for such applications to have some of their infrequently used memory portions swapped to disk and reallocating the available physical memory to other applications that need it.
Given the above background, can someone either please corroborate my reasoning or offer an alternate perspective? Also, can you please also clarify how the parameter yarn.nodemanager.vmem-pmem-ratio would work in the above case?
This is not a good idea. Trying to use more memory than what is available will eventually crash your Node Manager hosts.
There already is a feature called opportunistic containers which uses spare memory not used by the NMs and adds more containers to those hosts. Refer to:
YARN-1011 [Umbrella] Schedule containers based on utilization of currently allocated containers
In addition, Pepperdata has a product that does almost the same thing if you can't wait for YARN-1011.
https://www.pepperdata.com/products/capacity-optimizer/
As for yarn.nodemanager.vmem-pmem-ratio, don't enable this as it's not recommended anymore.
YARN-782 vcores-pcores ratio functions differently from vmem-pmem ratio in misleading way

How does slurm determine memory usage of jobs

Recently a user was running an interactive job on our cluster. We use slurm as the workload manager. He got his allocation via :
salloc --cpus-per-task=48 --time=14-0 --partition=himem
This requests an entire high memory (1.5TB) machine on our cluster. He ran his job. While it was running, on his screen he got the error message (or something like this):
salloc: Error memory limit exceeded
I logged into the node and, using top, his job was only taking 310GB in RES. However within the slurmd.log there is a slew of errors (spanning 8 hours!) like this:
[2017-08-03T23:21:55.200] [398692.4294967295] Step 398692.4294967295 exceeded memory limit (1588997632 > 1587511296), being killed
QUESTION: Why does top think that he's using 310GB while slurm thinks he is using 1.58TB?
To answer the question, Slurm uses /proc/<pid>/stat to get the memory values. In your case, you were not able to witness the incriminated process probably as it was killed by Slurm, as suggested by #Dmitri Chubarov.
Another possibility is that you have met a Slurm bug which was corrected just recently in version 17.2.7. From the change log:
-- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more
memory than requested.
The fact that Slurm repeatedly tried to kill the process (you mentioned several occurrences of the entry in the logs) indicates that the machine was running low on RAM and the slurmd was facing issues while trying to kill the process. I suggest you activate cgroups for task control ; it is much more robust.

Behaviour of Mesos when using cgroups for Spark

I was wondering what the behaviour of Spark in fine-grained mode on Mesos would be, when cgroups are enabled.
One concern is the following: when I use Mesos+spark without cgroups, it already shows that the actual spark executor process uses at least 10% more memory, than what it promised to Mesos it would use. When enabling cgroups, would it kill the Spark-executors?
Second, how is file-cache handled? Spark relies heavily on file-cache. Is file-cache accounted to the amount of memory in Mesos? Probably not, but could we influence this? So for example, ideally I want Spark to use 8GB in total, of which 5GB should be used for the java process -- assuming that spark plays nicely and does not grow beyond 5GB -- and 3GB should be used as file-cache (max).
I hope someone has experience with this, because in order to test these things myself I would have to go through a lot of support requests from our cluster sysadmin, as cgroups rely on root credentials at one point - and I'd hate it to be in vain without having asked others.
To answer your first question, it seems you've got something mixed up with how cgroups work. The executor simply would not (,and it indeed does, as I can confirm) be able to allocate more memory than the cgroups would allow. So Mesos would not actually act as an process killer or anything*. But, some types of programs would indeed break on being unable to allocate more memory and it depends on the program if it then quits, or is able to run fine, but perhaps with less memory and/or performance.
For your second question, there don't seem to be any configuration settings in order to influence the actual cgroup memory amount. There seems to be a 1-to-1 mapping between the executor memory setting and what Spark gets from Mesos. However, I do think there is a hidden factor, because I can see Spark asks for roughly 5.8GB, but actually I set the executor memory to 5GB. (I'll update the ticket once I can find this hidden factor of probably 15% in the source code.)
Update, the setting you'd want is spark.mesos.executor.memoryOverhead. You can give a number in MegaBytes which is added to the executor memory as the total memory which will be used as Mesos resource, and thus as a cgroup memory limit.
*=Update2, actually cgroups by default does kill processes which grow beyond the control group's limit. I can confirm that the memory.oom_control in /cgroups/memory/x/ is set to '0' (which counter-intuitively is enabled). However in the case of Spark, it is the aformentioned 10-15% overhead which gives enough leeway to not encounter OOM.

Allocate memory to a process that other process can not use in linux

To limit memory resource for particular process we can use ulimit as well as cgroup.
I want to understand that if using cgroup, I have allocated say ~700 MB of memory to process A, on system having 1 GB of RAM, and some other process say B, requires ~400 MB of memory. What will happen in this case?
If process A is allocated ~750 MB of memory but using only 200 MB of memory, will process B can use memory that allocated to A?
If no then how to achieve the scenario when "fix amount of memory is assigned to a process that other process can not use"?
EDIT
Is it possible to lock physical memory for process? Or only VM can be locked so that no other process can access it?
There is one multimedia application that must remain alive and can use maximum system resource in terms of memory, I need to achieve this.
Thanks.
Processes are using virtual memory (not RAM) so they have a virtual address space. See also setrlimit(2) (called by ulimit shell builtin). Perhaps RLIMIT_RSS & RLIMIT_MEMLOCK are relevant. Of course, you could limit some other process e.g. using RLIMIT_AS or RLIMIT_DATA, perhaps thru pam_limits(8) & limits.conf(5)
You could lock some virtual memory into RAM using mlock(2), this ensures that the RAM is kept for the calling process.
If you want to improve performance, you might also use madvise(2) & posix_fadvise(2).
See also ionice(1) & renice(1)
BTW, you might consider using hypervisors like Xen, they are able to reserve RAM.
At last, you might be wrong in believing that your manual tuning could do better than a carefully configured kernel scheduler.
What other processes will run on the same system, and what kind of thing do you want to happen if the other multimedia program needs memory that other processes are using?
You could weight the multimedia process so the OOM killer only picks it as a last choice after every other non-essential process. You might see a dropped frame if the kernel takes some time killing something to free up memory.
According to this article, adjust the oom-killer weight of a process by writing to /proc/pid/oom_adj. e.g. with
echo -17 > /proc/2592/oom_adj

Resources