Slurm uses more memory than allocated - slurm

As you can see in the picture below, I have made a sbatch script so that 10 job array (with 1GB of memory allocation) to be run. However, when I run it as the second picture shows, the memory used is 3.7% of total memory, which equates to about 18.9GB per job... Could anyone explain why this is happening?
(I did sbatch --nodelist node1 ver_5_FINAL_array_bash on the linux terminal )
Thank you!
For reference, the picture below shows that the amount of allocated memory is indeed 10GB, as specified in the sbatch script
Possibly pertinent information: our servers use both slurm and regular job submissions (without any job submission methods like slurm)

By default, the --mem option gives the minimum memory requirement (see the documentation here: https://slurm.schedmd.com/sbatch.html#OPT_mem)

A hard limit can by set by the Slurm administrator, by using cgroups. It's not something the user can do, I don't think.
A cgroup is created for the job with hard resource limits (CPU, memory, disk, etc), and if the job exceeds any of these limits, the job is terminated.

Related

Is there a way to increase memory allocation for running jobs through "srun, sbatch, or salloc"?

I use srun, salloc, or sbatch with slurm when I want to execute my Job.
srun -p PALL --cpus-per-task=2 --mem=8G --pty --x11 ./my_job --job-name=my_job_1
I don't know how much memory I should allocate for the first job.
There are times when memory allocation is insufficient during running, and I want to prevent it from being 'out of memory exit'
Is there a way to increase memory allocation for jobs running through slurm?
In the example above, if you are getting a memory error, try increasing your --mem allocation to more than 8G.
If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:
#SBATCH --mem-per-cpu=<value bigger than you've requested before>
If you are using srun: srun python3 your_script.py add this parameter like this:
srun --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py
No, generally speaking, you cannot increase the amount of resources allocated to a running job (except some cases where nodes can be added from another job).
There is no easy way to know how much memory a specific experience will require. It depends mostly on the data that are consumed/produced.
Some tips:
in Python, you can use sys.getsizeof(object) to get the size of an object in memory (e.g. a Panda data frame)
you can also use a memory profiler such as https://pypi.org/project/memory-profiler/ to get an overview of the overall memory consumption of the script
you can use the top command while the script is running and look at the RSS column while running in an interactive Slurm session, or on your laptop, or other machine where you can test the script
you can use the sacct command to get the actual memory usage of the job afterwards and possibly use that information to better estimate future, similar-looking, jobs.

Slurm memory resource management

Despite a thorough read of https://slurm.schedmd.com/slurm.conf.html, there is several things I don't understand regarding how Slurm manages the memory resource. My slurm.conf contains
DefMemPerCPU=1000
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=orion Gres=gpu:RTX2080Ti:4 MemSpecLimit=2048 RealMemory=95232
When not specifying --mem, jobs are launched with MIN_MEMORY=0 and don't seem to fail when allocating memory. What is the maximum memory job can use and how to display it?
When specifying --mem=0, jobs are pending waiting for resources. How can this be?
The value provided in DefMemPerCPU=1000 doesn't seem to have an effect. Is it related to SelectTypeParameters=CR_Core_Memory? If so, what is the equivalent for CPU cores?
Ultimately, what should be the configuration for having a default memory limit?

Optimisation of RAM usage for cluster jobs impacts data transfer in HPC environment. Any way to turn off caching to RAM in cp or rsync commands?

My problem is related to performing computer simulations on the large scale HPC cluster. I have a program that does MC simulation. After some part of the simulation is passed, the results are saved into files, and the simulation continues writing to the same part of memory as from the start. Thus, the program doesn't need all that much RAM to run (and we are talking about really low memory usage, like ~25MB). However, the total data generated over time are 10 or 100 times greater than that. The jobs are handled in the normal fashion: job data is copied to scratch partition, program runs on the node, results are returned from scratch to jobdir.
Now, everything would be dandy if it wasn't for the fact that when submitting a job to SLURM, I have to declare the amount o RAM assigned for the job. If I declare something around the real program usage, say 50MB, I have a problem with getting back the results. After a week-long simulation, data are copied from scratch to job directory, and the copy operation is cached to RAM, violating the job RAM setting. Ultimately, the job is killed by SLURM. I have to manually look for this data on scratch and copy them back. Obviously, this is not feasible for few thousands jobs.
The command used for copying is:
cp -Hpv ${CONTENTS} ${TMPDIR}
and if the copied content is larger than specified MB's, the job is killed with a message:
/var/lib/slurm-llnl/slurmd/job24667725/slurm_script: line 78: 46020 Killed cp -Hpv ${CONTENTS} ${TMPDIR}
slurmstepd-e1370: error: Detected 1 oom-kill event(s) in StepId=(jobid).batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I've contacted cluster admins in that regard, and they just replied to reserve more RAM for the job. However, this results in an absurd amount of ram locked (basically wasted) for a week during the simulation, and used only for the moment when the results are copied back. Keeping in mind that I can (and often do) submit up to 5000 jobs at a time, I'm looking for some kind of hack to cp or rsync commands to force them not to cache to RAM or not to cache at all, while copying.
Will be glad for any comments.
Best regards.

How does slurm determine memory usage of jobs

Recently a user was running an interactive job on our cluster. We use slurm as the workload manager. He got his allocation via :
salloc --cpus-per-task=48 --time=14-0 --partition=himem
This requests an entire high memory (1.5TB) machine on our cluster. He ran his job. While it was running, on his screen he got the error message (or something like this):
salloc: Error memory limit exceeded
I logged into the node and, using top, his job was only taking 310GB in RES. However within the slurmd.log there is a slew of errors (spanning 8 hours!) like this:
[2017-08-03T23:21:55.200] [398692.4294967295] Step 398692.4294967295 exceeded memory limit (1588997632 > 1587511296), being killed
QUESTION: Why does top think that he's using 310GB while slurm thinks he is using 1.58TB?
To answer the question, Slurm uses /proc/<pid>/stat to get the memory values. In your case, you were not able to witness the incriminated process probably as it was killed by Slurm, as suggested by #Dmitri Chubarov.
Another possibility is that you have met a Slurm bug which was corrected just recently in version 17.2.7. From the change log:
-- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more
memory than requested.
The fact that Slurm repeatedly tried to kill the process (you mentioned several occurrences of the entry in the logs) indicates that the machine was running low on RAM and the slurmd was facing issues while trying to kill the process. I suggest you activate cgroups for task control ; it is much more robust.

Behaviour of Mesos when using cgroups for Spark

I was wondering what the behaviour of Spark in fine-grained mode on Mesos would be, when cgroups are enabled.
One concern is the following: when I use Mesos+spark without cgroups, it already shows that the actual spark executor process uses at least 10% more memory, than what it promised to Mesos it would use. When enabling cgroups, would it kill the Spark-executors?
Second, how is file-cache handled? Spark relies heavily on file-cache. Is file-cache accounted to the amount of memory in Mesos? Probably not, but could we influence this? So for example, ideally I want Spark to use 8GB in total, of which 5GB should be used for the java process -- assuming that spark plays nicely and does not grow beyond 5GB -- and 3GB should be used as file-cache (max).
I hope someone has experience with this, because in order to test these things myself I would have to go through a lot of support requests from our cluster sysadmin, as cgroups rely on root credentials at one point - and I'd hate it to be in vain without having asked others.
To answer your first question, it seems you've got something mixed up with how cgroups work. The executor simply would not (,and it indeed does, as I can confirm) be able to allocate more memory than the cgroups would allow. So Mesos would not actually act as an process killer or anything*. But, some types of programs would indeed break on being unable to allocate more memory and it depends on the program if it then quits, or is able to run fine, but perhaps with less memory and/or performance.
For your second question, there don't seem to be any configuration settings in order to influence the actual cgroup memory amount. There seems to be a 1-to-1 mapping between the executor memory setting and what Spark gets from Mesos. However, I do think there is a hidden factor, because I can see Spark asks for roughly 5.8GB, but actually I set the executor memory to 5GB. (I'll update the ticket once I can find this hidden factor of probably 15% in the source code.)
Update, the setting you'd want is spark.mesos.executor.memoryOverhead. You can give a number in MegaBytes which is added to the executor memory as the total memory which will be used as Mesos resource, and thus as a cgroup memory limit.
*=Update2, actually cgroups by default does kill processes which grow beyond the control group's limit. I can confirm that the memory.oom_control in /cgroups/memory/x/ is set to '0' (which counter-intuitively is enabled). However in the case of Spark, it is the aformentioned 10-15% overhead which gives enough leeway to not encounter OOM.

Resources