Optimisation of RAM usage for cluster jobs impacts data transfer in HPC environment. Any way to turn off caching to RAM in cp or rsync commands? - linux

My problem is related to performing computer simulations on the large scale HPC cluster. I have a program that does MC simulation. After some part of the simulation is passed, the results are saved into files, and the simulation continues writing to the same part of memory as from the start. Thus, the program doesn't need all that much RAM to run (and we are talking about really low memory usage, like ~25MB). However, the total data generated over time are 10 or 100 times greater than that. The jobs are handled in the normal fashion: job data is copied to scratch partition, program runs on the node, results are returned from scratch to jobdir.
Now, everything would be dandy if it wasn't for the fact that when submitting a job to SLURM, I have to declare the amount o RAM assigned for the job. If I declare something around the real program usage, say 50MB, I have a problem with getting back the results. After a week-long simulation, data are copied from scratch to job directory, and the copy operation is cached to RAM, violating the job RAM setting. Ultimately, the job is killed by SLURM. I have to manually look for this data on scratch and copy them back. Obviously, this is not feasible for few thousands jobs.
The command used for copying is:
cp -Hpv ${CONTENTS} ${TMPDIR}
and if the copied content is larger than specified MB's, the job is killed with a message:
/var/lib/slurm-llnl/slurmd/job24667725/slurm_script: line 78: 46020 Killed cp -Hpv ${CONTENTS} ${TMPDIR}
slurmstepd-e1370: error: Detected 1 oom-kill event(s) in StepId=(jobid).batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I've contacted cluster admins in that regard, and they just replied to reserve more RAM for the job. However, this results in an absurd amount of ram locked (basically wasted) for a week during the simulation, and used only for the moment when the results are copied back. Keeping in mind that I can (and often do) submit up to 5000 jobs at a time, I'm looking for some kind of hack to cp or rsync commands to force them not to cache to RAM or not to cache at all, while copying.
Will be glad for any comments.
Best regards.

Related

Parallel file writing performance on NFS under Linux

I have an issue regarding the performance of a EFS filesystem from Amazon, but I suspect the issue is with the Linux configuration.
My setup is a m4.large machine (2 cores, 8GB RAM) in AWS and the EFS drive is mounted as NFS4.1 mount type with standard setup.
I have a script that is creating unique small 1 kB files (see bellow). I'm running the script in parallel using GNU parallel utility that helps me run under a different number of parallel jobs.
The tests I've done shows that when I run 1 job only, the speed is 60kB/sec, 2 job in parallel, overall speed is almost 120kB/sec, but after that when run 3,4,10 jobs in parallel, the overall speed remains still around 120 kB/sec.
I've increased the default values of file-descriptors and open files to huge values but had no impact. The CPU is barely utilized and also memory is not very used. The network should be able to sustain up to 45MB/sec according to specs so I'm very far away from that limit too. Also the EFS limit of max throughput is around 105 MB/sec.
What else can I setup to allow more file to get written in parallel except increasing the number of cores on the machine? (guess file writes transforms to tcp connections for NFS mounts)
The script used:
#!/bin/bash
value="$(<source1k.txt)"
host="$(hostname)"
client=$1
mkdir output4/"$host"
for i in {0..5000}
do
echo "$value" > "output4/$host/File_$(printf "%s_%03d" "$client" "$i").txt"
done
and it is called like bellow to run on 4 parallel jobs
parallel -j 4 sh writefiles.sh {} ::: 1 2 3 4
EDIT: I tested iozone utility using 4 kB as file size (it doesn't accept 1) and the throughput test give a result saying that Children see 240MB, while Parent see 500kB (I couldn't find what this means actually, but those 500kB are close to what I measured).
After multiple tests and discussions with Amazon support, it seems my bottleneck was the fact that I was writing all files to the same folder (and probably there is a lock for naming purposes). If I changed the test to write file to different folders, the speed increased a lot.

How does slurm determine memory usage of jobs

Recently a user was running an interactive job on our cluster. We use slurm as the workload manager. He got his allocation via :
salloc --cpus-per-task=48 --time=14-0 --partition=himem
This requests an entire high memory (1.5TB) machine on our cluster. He ran his job. While it was running, on his screen he got the error message (or something like this):
salloc: Error memory limit exceeded
I logged into the node and, using top, his job was only taking 310GB in RES. However within the slurmd.log there is a slew of errors (spanning 8 hours!) like this:
[2017-08-03T23:21:55.200] [398692.4294967295] Step 398692.4294967295 exceeded memory limit (1588997632 > 1587511296), being killed
QUESTION: Why does top think that he's using 310GB while slurm thinks he is using 1.58TB?
To answer the question, Slurm uses /proc/<pid>/stat to get the memory values. In your case, you were not able to witness the incriminated process probably as it was killed by Slurm, as suggested by #Dmitri Chubarov.
Another possibility is that you have met a Slurm bug which was corrected just recently in version 17.2.7. From the change log:
-- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more
memory than requested.
The fact that Slurm repeatedly tried to kill the process (you mentioned several occurrences of the entry in the logs) indicates that the machine was running low on RAM and the slurmd was facing issues while trying to kill the process. I suggest you activate cgroups for task control ; it is much more robust.

Netlogo HPC CPU Percentage Use Increase

I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;
A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.
1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)
So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.

linux CPU cache slowdown

We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.

Memory of type "root-set" reallocation Error - Erlang

I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.

Resources