Parallel file writing performance on NFS under Linux

Parallel file writing performance on NFS under Linux - linux

I have an issue regarding the performance of a EFS filesystem from Amazon, but I suspect the issue is with the Linux configuration.
My setup is a m4.large machine (2 cores, 8GB RAM) in AWS and the EFS drive is mounted as NFS4.1 mount type with standard setup.
I have a script that is creating unique small 1 kB files (see bellow). I'm running the script in parallel using GNU parallel utility that helps me run under a different number of parallel jobs.
The tests I've done shows that when I run 1 job only, the speed is 60kB/sec, 2 job in parallel, overall speed is almost 120kB/sec, but after that when run 3,4,10 jobs in parallel, the overall speed remains still around 120 kB/sec.
I've increased the default values of file-descriptors and open files to huge values but had no impact. The CPU is barely utilized and also memory is not very used. The network should be able to sustain up to 45MB/sec according to specs so I'm very far away from that limit too. Also the EFS limit of max throughput is around 105 MB/sec.
What else can I setup to allow more file to get written in parallel except increasing the number of cores on the machine? (guess file writes transforms to tcp connections for NFS mounts)
The script used:
#!/bin/bash
value="$(<source1k.txt)"
host="$(hostname)"
client=$1
mkdir output4/"$host"
for i in {0..5000}
do
echo "$value" > "output4/$host/File_$(printf "%s_%03d" "$client" "$i").txt"
done
and it is called like bellow to run on 4 parallel jobs
parallel -j 4 sh writefiles.sh {} ::: 1 2 3 4
EDIT: I tested iozone utility using 4 kB as file size (it doesn't accept 1) and the throughput test give a result saying that Children see 240MB, while Parent see 500kB (I couldn't find what this means actually, but those 500kB are close to what I measured).

After multiple tests and discussions with Amazon support, it seems my bottleneck was the fact that I was writing all files to the same folder (and probably there is a lock for naming purposes). If I changed the test to write file to different folders, the speed increased a lot.

Related

fwrite becomes slow after long uptime

Recently we had a production server up for 50+ days exhibit slow fwrite times. Sporadically, a single fwrite() would take 50 to 300 msec to complete (typically 300 to 2400 bytes). We spent a few days investigating, collecting stats, trying a number of things. Finally after rebooting the system the problem is gone and the server is running normal, as-expected operation. Here are some notes:
-the system is a Xeon 2660 16-core with one HDD and one SSD, Ubuntu 12.04, 3.2.0-49-generic. The HDD is about 88% full and the SSD 75%. fstat() shows optimal HDD blocksize of 4096
-the application software running on the system is two different executables that open, run, and close repeatedly, running for intervals from a minute to several hours, writing numerous wav files of various sizes on a continuous basis while they are running
-both the HDD and SSD exhibited the issue. Writes to ramdisk were Ok
My question: is there any known issue where the Linux I/O interface can reach a point, over time, where a single flush or other I/O operation takes 50 or even 300+ msec to complete ?
We tried defragmenting both drives, setvbuf() variations, and non-blocking file descriptors (fcntl), without any change. After reboot we see wav file extents the same as before, ranging from 1 to 10 typically, depending on file size. The only hint seemed to be that we could occasionally catch a thread briefly showing long I/O wait time or in "uninterruptible sleep" state. For that we used htop (turning on Detailed CPU Usage) and this command:
for x in `seq 1 1 100`; do ps -eo state,tid,pid,cmd | grep "^D"; echo "----"; sleep 0.25; done
which would (occasionally) show something like "flush-252:0"
We looked through this thread on slow fwrites along with many other discussions but did not find anything that helped other than the usual "probably if you reboot it will go away". Which of course is good advice, but doesn't avoid the next occurrence.
After the reboot, we went on a hunt for any left-over file handles not being closed by those two (2) apps before terminating, and did find one case. My understanding is that should not have an effect.

Optimisation of RAM usage for cluster jobs impacts data transfer in HPC environment. Any way to turn off caching to RAM in cp or rsync commands?

My problem is related to performing computer simulations on the large scale HPC cluster. I have a program that does MC simulation. After some part of the simulation is passed, the results are saved into files, and the simulation continues writing to the same part of memory as from the start. Thus, the program doesn't need all that much RAM to run (and we are talking about really low memory usage, like ~25MB). However, the total data generated over time are 10 or 100 times greater than that. The jobs are handled in the normal fashion: job data is copied to scratch partition, program runs on the node, results are returned from scratch to jobdir.
Now, everything would be dandy if it wasn't for the fact that when submitting a job to SLURM, I have to declare the amount o RAM assigned for the job. If I declare something around the real program usage, say 50MB, I have a problem with getting back the results. After a week-long simulation, data are copied from scratch to job directory, and the copy operation is cached to RAM, violating the job RAM setting. Ultimately, the job is killed by SLURM. I have to manually look for this data on scratch and copy them back. Obviously, this is not feasible for few thousands jobs.
The command used for copying is:
cp -Hpv ${CONTENTS} ${TMPDIR}
and if the copied content is larger than specified MB's, the job is killed with a message:
/var/lib/slurm-llnl/slurmd/job24667725/slurm_script: line 78: 46020 Killed cp -Hpv ${CONTENTS} ${TMPDIR}
slurmstepd-e1370: error: Detected 1 oom-kill event(s) in StepId=(jobid).batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I've contacted cluster admins in that regard, and they just replied to reserve more RAM for the job. However, this results in an absurd amount of ram locked (basically wasted) for a week during the simulation, and used only for the moment when the results are copied back. Keeping in mind that I can (and often do) submit up to 5000 jobs at a time, I'm looking for some kind of hack to cp or rsync commands to force them not to cache to RAM or not to cache at all, while copying.
Will be glad for any comments.
Best regards.

How to optimize pigz?

I am using pigz to compress a large directory, which is nearly 50GB, I have an ec2 instance, with RedHat, the instance type is m4.xlarge, which has 4 CPUs, I am expecting the compression will eat up all my CPUs and have a better performance. but it didn't meet my expectation.
the command I am using:
tar -cf - lager-dir | pigz > dest.tar.gz
But when the compress is running, I use mpstat -P ALL to check my CPU status, the result shows a lot of %idle for other 3 CPUs, only nearly 2% are used by user space process for each CPU.
Also tried to use top to check that pigz only use less than 10% of the CPU.
Tried with -p 10 to increase the processes count, then it has a high usage for a few minutes, but dropped down when the output file reach to 2.7 GB.
So I have all CPU only used for the compression, I want to fully utilize all of my resources to gain the best performance, how can I get there?

If file compression apps aren't CPU bound, they are most likely sequential I/O bound.
You can investigate this further by using mpstat to look at the % of time the system is spending in iowait ('wa') using top or mpstat (check manpage for options if it isn't part of the default output).
If I'm right, most of the time the system isn't executing pigz is spent waiting on I/O.
You can also investigate this further using iostat, which can show disk IO. The ratio between reads and writes will vary over time depending on how compressible the input is at that moment, but combined IO should be fairly consistent. This assumes that amazon's storage provisioning provides consistent I/O now, something that didn't used to be the case.

Netlogo HPC CPU Percentage Use Increase

I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;

A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.

1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)

So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.

Performance Problems with Node.js (Mac OSX) - Processes

i hope to find some little help here.
we are using node, mongodb, supertest, mocha and spawn in our test env.
we've tried to improve our mocha test env to run tests in parallel, because our test cases run now for almost 5minutes! (600 cases)
we are spawning for example 4 processes and run tests in parallel. it is very successful, but just on linux.
on my mac the tests still run very slow. it seems like the different processes are not really running in parallel.
test times:
macosx:
- running 9 tests in parallel: 37s
- running 9 tests not in parallel: 41s
linux:
- running 9 tests in parallel: 16s
- running 9 tests not in parallel: 25s
maxosx early 2011:
10.9.2
16gb ram
core i7 2,2ghz
physical processors: 1
cores: 4
threads: 8
linux dell:
ubuntu
8gb ram
core i5-2520M 2.5ghz
physical processors: 1
cores: 2
threads: 4
my questions are:
are there any tips improving the process performance on macosx?
(except of ulimit, launchctl (maxfiles)?)
why do the tests running much faster on linux?
thanks,
kate

I copied my comment here, as it describes most of the answer.
Given the specs of your machines, I doubt the extra 8 gigs of ram, and processing power effect things much, especially given nodes single process model, and that you're only launching 4 processes. I doubht for the linux machine that 8 gigs, 2.5 ghz, and 4 threads is a bottleneck at all. As such, I would actually expect the time the processor spends running your tests to be roughly equivalent for both machines. I'd be more interested in your Disk I/O performance, given you're running mongo. Your Disk I/O has the most potential to slow stuff down. What are the specs there?
Your specs:
macosx: Toshiba 5400RPM 8MB
linux: Seagate 7200 rpm 16mb
Your Linux drive is significantly, 1.33X faster, than your mac drive, as well as having a significantly larger cache. For database based applications hard drive performance is crucial. Most of the time spent in your application will be waiting for I/O, particularly in Nodes single process method of doing work. I would suggest this as the culprit for 90% of the performance difference, and chalk the rest up to the fact that linux probably has less crap going on in the background, further exacerbating your Mac Disk Drive's performance issues.
Furthermore, launching multiple node processes isn't likely to help this. Since processor time isn't your bottleneck, launching too many processes will just slow your disk down. Another proof that this is the problem is that the performance of multiple processes on linux is proportionally better than the performance of multiple processes on mac. 1 process is nearly maxing out the performance of your 5400 drive, and so you don't see significant performance increase from running multiple processes. Whereas the multiple linux Node processes use the disk to it's full potential. You would likely see diminishing returns on the linux OS if you were to launch many more processes, unless of course you were to upgrade to a SSD.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string