Recently we had a production server up for 50+ days exhibit slow fwrite times. Sporadically, a single fwrite() would take 50 to 300 msec to complete (typically 300 to 2400 bytes). We spent a few days investigating, collecting stats, trying a number of things. Finally after rebooting the system the problem is gone and the server is running normal, as-expected operation. Here are some notes:
-the system is a Xeon 2660 16-core with one HDD and one SSD, Ubuntu 12.04, 3.2.0-49-generic. The HDD is about 88% full and the SSD 75%. fstat() shows optimal HDD blocksize of 4096
-the application software running on the system is two different executables that open, run, and close repeatedly, running for intervals from a minute to several hours, writing numerous wav files of various sizes on a continuous basis while they are running
-both the HDD and SSD exhibited the issue. Writes to ramdisk were Ok
My question: is there any known issue where the Linux I/O interface can reach a point, over time, where a single flush or other I/O operation takes 50 or even 300+ msec to complete ?
We tried defragmenting both drives, setvbuf() variations, and non-blocking file descriptors (fcntl), without any change. After reboot we see wav file extents the same as before, ranging from 1 to 10 typically, depending on file size. The only hint seemed to be that we could occasionally catch a thread briefly showing long I/O wait time or in "uninterruptible sleep" state. For that we used htop (turning on Detailed CPU Usage) and this command:
for x in `seq 1 1 100`; do ps -eo state,tid,pid,cmd | grep "^D"; echo "----"; sleep 0.25; done
which would (occasionally) show something like "flush-252:0"
We looked through this thread on slow fwrites along with many other discussions but did not find anything that helped other than the usual "probably if you reboot it will go away". Which of course is good advice, but doesn't avoid the next occurrence.
After the reboot, we went on a hunt for any left-over file handles not being closed by those two (2) apps before terminating, and did find one case. My understanding is that should not have an effect.
Related
On Ubuntu 14.04:
$ cat /proc/sys/fs/inotify/max_queued_events
16384
$ cat /proc/sys/fs/inotify/max_user_instances
128
$ cat /proc/sys/fs/inotify/max_user_watches
1048576
Right after computer restart I had 1GB of RAM consumed. After 20-30 minutes (having just 1 terminal opened) I had 6GB RAM used and growing, however none of the processes seemed to be using so much memory (according to htop and top). When I've killed inotifywait process memory was not freed but stopped growing. Then I've restarted PC, killed inotifywait right away and memory usage stopped at 1GB.
I have 2 hard drives, one is 1TB and second is 2TB. Was inotifywait somehow caching those or in general is it normal that it caused such behavior?
This is the Linux disk cache at work, it is not a memory leak in inotifywait or anything else.
I've accepted previous answer because it explains what's going on. However I have some second thoughts regarding this topic. What the page is basically saying is "caching doesn't occupy memory because this memory can at any point be taken back, so you should not worry, you should not panic, you should be grateful!". Well ... I'm not. I believe there should be some decent but at the same time hard limit for caching.
The idea behind this process was good -> "let's not waste user time and cache everything till we have space". However what this process is actually doing in my case is wasting my time. Currently I'm working on Linux which is running in virtual machine. Since I have a lot jira pages opened, a lot terminals in many desktops, some tools opened and etc I don't want to open all that stuff every day, so I just save virtual machine state instead of turning it off at the end of the day. Now let's say that my stuff occupies 4GB RAM. What will happen is that 4GB will be written into my hard drive after I save state and then 4GB will have to be loaded into RAM when I start virtual machine. Unfortunately that's only theory. In practice due to inotifywait which will happily fill my 32 GB RAM I have to wait 8 times longer for saving and restoring virtual machine. Yes my RAM is "fine" as the page is saying, yes I can open different app and I will not hit OOM but at the same time caching is wasting my time.
If the limit was decent let's say 1GB for caching then it would not be so painful. I think if I would have VM on HDD instead of SSD it would took forever to save the state and I would probably not use it at all.
I have an issue regarding the performance of a EFS filesystem from Amazon, but I suspect the issue is with the Linux configuration.
My setup is a m4.large machine (2 cores, 8GB RAM) in AWS and the EFS drive is mounted as NFS4.1 mount type with standard setup.
I have a script that is creating unique small 1 kB files (see bellow). I'm running the script in parallel using GNU parallel utility that helps me run under a different number of parallel jobs.
The tests I've done shows that when I run 1 job only, the speed is 60kB/sec, 2 job in parallel, overall speed is almost 120kB/sec, but after that when run 3,4,10 jobs in parallel, the overall speed remains still around 120 kB/sec.
I've increased the default values of file-descriptors and open files to huge values but had no impact. The CPU is barely utilized and also memory is not very used. The network should be able to sustain up to 45MB/sec according to specs so I'm very far away from that limit too. Also the EFS limit of max throughput is around 105 MB/sec.
What else can I setup to allow more file to get written in parallel except increasing the number of cores on the machine? (guess file writes transforms to tcp connections for NFS mounts)
The script used:
#!/bin/bash
value="$(<source1k.txt)"
host="$(hostname)"
client=$1
mkdir output4/"$host"
for i in {0..5000}
do
echo "$value" > "output4/$host/File_$(printf "%s_%03d" "$client" "$i").txt"
done
and it is called like bellow to run on 4 parallel jobs
parallel -j 4 sh writefiles.sh {} ::: 1 2 3 4
EDIT: I tested iozone utility using 4 kB as file size (it doesn't accept 1) and the throughput test give a result saying that Children see 240MB, while Parent see 500kB (I couldn't find what this means actually, but those 500kB are close to what I measured).
After multiple tests and discussions with Amazon support, it seems my bottleneck was the fact that I was writing all files to the same folder (and probably there is a lock for naming purposes). If I changed the test to write file to different folders, the speed increased a lot.
I am using pigz to compress a large directory, which is nearly 50GB, I have an ec2 instance, with RedHat, the instance type is m4.xlarge, which has 4 CPUs, I am expecting the compression will eat up all my CPUs and have a better performance. but it didn't meet my expectation.
the command I am using:
tar -cf - lager-dir | pigz > dest.tar.gz
But when the compress is running, I use mpstat -P ALL to check my CPU status, the result shows a lot of %idle for other 3 CPUs, only nearly 2% are used by user space process for each CPU.
Also tried to use top to check that pigz only use less than 10% of the CPU.
Tried with -p 10 to increase the processes count, then it has a high usage for a few minutes, but dropped down when the output file reach to 2.7 GB.
So I have all CPU only used for the compression, I want to fully utilize all of my resources to gain the best performance, how can I get there?
If file compression apps aren't CPU bound, they are most likely sequential I/O bound.
You can investigate this further by using mpstat to look at the % of time the system is spending in iowait ('wa') using top or mpstat (check manpage for options if it isn't part of the default output).
If I'm right, most of the time the system isn't executing pigz is spent waiting on I/O.
You can also investigate this further using iostat, which can show disk IO. The ratio between reads and writes will vary over time depending on how compressible the input is at that moment, but combined IO should be fairly consistent. This assumes that amazon's storage provisioning provides consistent I/O now, something that didn't used to be the case.
We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.
My hosting provider (pairNetworks) has certain rules for scripts run on the server. I'm trying to compress a file for backup purposes, and would ideally like to use bzip2 to take advantage of its AWESOME compression rate. However, when trying to compress this 90 MB file, the process sometimes runs upwards of 1.5 minutes. One of the resource rules is that a script may only execute for 30 CPU seconds.
If I use the nice command to 'nicefy' the process, does that break up the amount of total CPU processing time? Is there a different command I could use in place of nice? Or will I have to use a different compression utility that doesn't take as long?
Thanks!
EDIT: This is what their support page says:
Run any process that requires more
than 16MB of memory space.
Run any
program that requires more than 30
CPU seconds to complete.
EDIT: I run this in a bash script from the command line
nice will change the process' priority, and thus will get its CPU seconds sooner (or later), so if the rule is really about CPU seconds as you state in your question, nice will not serve you at all, it'll just be killed at a different time.
As for a solution, you may try splitting the file in three 30 MB pieces (see split(1)) which you can compress in the allotted time. Then you uncompress and use cat to put the pieces together. Depending on if it's a binary or text you can use the -l or -b arguments to split.
nice won't help you - the amount of CPU seconds will still be the same, no matter how many actual seconds it takes.
You have to find the compromiss between compression ratio and CPU consumption. There are -1 ... -9 options to bzip2 - try to "tune" it (-1 is a fastest). Another way is to consult with your provider - may be it is possible to grant a special permissions to your script to run longer.
No, nice will only affect how your process is scheduled. Put simply, a process that takes 30 CPU seconds will always take 30 CPU seconds even if it's preempted for hours.
I always get a thrill when I load up all the cores of my machine with some hefty processing but have them all niced. I love seeing the CPU monitor maxed out while I surf the web without any noticeable lag.