How does slurm determine memory usage of jobs - slurm

Recently a user was running an interactive job on our cluster. We use slurm as the workload manager. He got his allocation via :
salloc --cpus-per-task=48 --time=14-0 --partition=himem
This requests an entire high memory (1.5TB) machine on our cluster. He ran his job. While it was running, on his screen he got the error message (or something like this):
salloc: Error memory limit exceeded
I logged into the node and, using top, his job was only taking 310GB in RES. However within the slurmd.log there is a slew of errors (spanning 8 hours!) like this:
[2017-08-03T23:21:55.200] [398692.4294967295] Step 398692.4294967295 exceeded memory limit (1588997632 > 1587511296), being killed
QUESTION: Why does top think that he's using 310GB while slurm thinks he is using 1.58TB?

To answer the question, Slurm uses /proc/<pid>/stat to get the memory values. In your case, you were not able to witness the incriminated process probably as it was killed by Slurm, as suggested by #Dmitri Chubarov.
Another possibility is that you have met a Slurm bug which was corrected just recently in version 17.2.7. From the change log:
-- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more
memory than requested.
The fact that Slurm repeatedly tried to kill the process (you mentioned several occurrences of the entry in the logs) indicates that the machine was running low on RAM and the slurmd was facing issues while trying to kill the process. I suggest you activate cgroups for task control ; it is much more robust.

Related

Is there a way to increase memory allocation for running jobs through "srun, sbatch, or salloc"?

I use srun, salloc, or sbatch with slurm when I want to execute my Job.
srun -p PALL --cpus-per-task=2 --mem=8G --pty --x11 ./my_job --job-name=my_job_1
I don't know how much memory I should allocate for the first job.
There are times when memory allocation is insufficient during running, and I want to prevent it from being 'out of memory exit'
Is there a way to increase memory allocation for jobs running through slurm?
In the example above, if you are getting a memory error, try increasing your --mem allocation to more than 8G.
If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:
#SBATCH --mem-per-cpu=<value bigger than you've requested before>
If you are using srun: srun python3 your_script.py add this parameter like this:
srun --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py
No, generally speaking, you cannot increase the amount of resources allocated to a running job (except some cases where nodes can be added from another job).
There is no easy way to know how much memory a specific experience will require. It depends mostly on the data that are consumed/produced.
Some tips:
in Python, you can use sys.getsizeof(object) to get the size of an object in memory (e.g. a Panda data frame)
you can also use a memory profiler such as https://pypi.org/project/memory-profiler/ to get an overview of the overall memory consumption of the script
you can use the top command while the script is running and look at the RSS column while running in an interactive Slurm session, or on your laptop, or other machine where you can test the script
you can use the sacct command to get the actual memory usage of the job afterwards and possibly use that information to better estimate future, similar-looking, jobs.

Optimisation of RAM usage for cluster jobs impacts data transfer in HPC environment. Any way to turn off caching to RAM in cp or rsync commands?

My problem is related to performing computer simulations on the large scale HPC cluster. I have a program that does MC simulation. After some part of the simulation is passed, the results are saved into files, and the simulation continues writing to the same part of memory as from the start. Thus, the program doesn't need all that much RAM to run (and we are talking about really low memory usage, like ~25MB). However, the total data generated over time are 10 or 100 times greater than that. The jobs are handled in the normal fashion: job data is copied to scratch partition, program runs on the node, results are returned from scratch to jobdir.
Now, everything would be dandy if it wasn't for the fact that when submitting a job to SLURM, I have to declare the amount o RAM assigned for the job. If I declare something around the real program usage, say 50MB, I have a problem with getting back the results. After a week-long simulation, data are copied from scratch to job directory, and the copy operation is cached to RAM, violating the job RAM setting. Ultimately, the job is killed by SLURM. I have to manually look for this data on scratch and copy them back. Obviously, this is not feasible for few thousands jobs.
The command used for copying is:
cp -Hpv ${CONTENTS} ${TMPDIR}
and if the copied content is larger than specified MB's, the job is killed with a message:
/var/lib/slurm-llnl/slurmd/job24667725/slurm_script: line 78: 46020 Killed cp -Hpv ${CONTENTS} ${TMPDIR}
slurmstepd-e1370: error: Detected 1 oom-kill event(s) in StepId=(jobid).batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I've contacted cluster admins in that regard, and they just replied to reserve more RAM for the job. However, this results in an absurd amount of ram locked (basically wasted) for a week during the simulation, and used only for the moment when the results are copied back. Keeping in mind that I can (and often do) submit up to 5000 jobs at a time, I'm looking for some kind of hack to cp or rsync commands to force them not to cache to RAM or not to cache at all, while copying.
Will be glad for any comments.
Best regards.

Slurm uses more memory than allocated

As you can see in the picture below, I have made a sbatch script so that 10 job array (with 1GB of memory allocation) to be run. However, when I run it as the second picture shows, the memory used is 3.7% of total memory, which equates to about 18.9GB per job... Could anyone explain why this is happening?
(I did sbatch --nodelist node1 ver_5_FINAL_array_bash on the linux terminal )
Thank you!
For reference, the picture below shows that the amount of allocated memory is indeed 10GB, as specified in the sbatch script
Possibly pertinent information: our servers use both slurm and regular job submissions (without any job submission methods like slurm)
By default, the --mem option gives the minimum memory requirement (see the documentation here: https://slurm.schedmd.com/sbatch.html#OPT_mem)
A hard limit can by set by the Slurm administrator, by using cgroups. It's not something the user can do, I don't think.
A cgroup is created for the job with hard resource limits (CPU, memory, disk, etc), and if the job exceeds any of these limits, the job is terminated.

Netlogo HPC CPU Percentage Use Increase

I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;
A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.
1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)
So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.

Linux memory usage history

I had a problem in which my server began failing some of its normal processes and checks because the server's memory was completely full and taken.
I looked in the logging history and found that what it killed were some Java processes.
I used the "top" command to see what processes were taking up the most memory right now(after the issue was fixed) and it was a Java process. So in essence, I can tell what processes are taking up the most memory right now.
What I want to know is if there is a way to see what processes were taking up the most memory at the time when the failures started happening? Perhaps Linux keeps track or a log of the memory usage at particular times? I really have no idea but it would be great if I could see that kind of detail.
#Andy has answered your question. However, I'd like to add that for future reference use a monitoring tool. Something like these. These will give you what happened during a crash since you obviously cannot monitor all your servers all the time. Hope it helps.
Are you saying the kernel OOM killer went off? What does the log in dmesg say? Note that you can constrain a JVM to use a fixed heap size, which means it will fail affirmatively when full instead of letting the kernel kill something else. But the general answer to your question is no: there's no way to reliably run anything at the time of an OOM failure, because the system is out of memory! At best, you can use a separate process to poll the process table and log process sizes to catch memory leak conditions, etc...
There is no history of memory usage in linux be default, but you can achieve it with some simple command-line tool like sar.
Regarding your problem with memory:
If it was OOM-killer that did some mess on machine, then you have one great option to ensure it won't happen again (of course after reducing JVM heap size).
By default linux kernel allocates more memory than it has really. This, in some cases, can lead to OOM-killer killing the most memory-consumptive process if there is no memory for kernel tasks.
This behavior is controlled by vm.overcommit sysctl parameter.
So, you can try setting it to vm.overcommit = 2 is sysctl.conf and then run sysctl -p.
This will forbid overcommiting and make possibility of OOM-killer doing nasty things very low. Also you can think about adding a little-bit of swap space (if you don't have it already) and setting vm.swappiness to some really low value (like 5, for example. default value is 60), so in normal workflow your application won't go into swap, but if you'll be really short on memory, it will start using it temporarily and you will be able to see it even with df
WARNING this can lead to processes receiving "Cannot allocate memory" error if you have your server overloaded by memory. In this case:
Try to restrict memory usage by applications
Move part of them to another machine

Resources