How to stop page cache for disk I/O in my linux system? - linux

Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.

I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.

To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.

Related

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

linux CPU cache slowdown

We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.

Linux: how to detect if a process is thrashing too much?

Is there a way to detect programmatically?
Also, what would be the linux commands to detect which processes are thrashing?
I'm assuming that "thrashing" here refers to a situation where the active memory set of all processes is too big to fit into memory. In such a situation every context switch causes reading and writing to disk, and eventually the server may become so thrashed that hardware reboot is the only option to regain control of the box.
There are global counters swin and swout in /proc/vmstat - if both of them increases during some short time interval the box is probably experiencing thrashing problems.
At the process level it's non-trivial AFAIK. /proc/$pid/status contains some useful stuff, but not swin and swout. From 2.6.34 there is a VmSwap entry, total amount of swap used, and the variable #12 in /proc/$pid/state is the number of major page faults. /proc/$pid/oom_score is also worth looking into. If VmSwap is increasing and/or the number of major page faults is increasing and/or oom_score is spectacularly high, then the process is likely to cause thrashing.
I wrote up a script thrash-protect - it's available at https://github.com/tobixen/thrash-protect - it attempts to figure out what processes are causing the thrashing and temporary suspends processes. It has worked out great for me and rescued me from some server reboots eventually.
Update: newer versions of the kernel has useful statistics under /proc/pressure. Also, a computer set up without swap will also start "thrashing" as the memory is getting full, as lack of buffer space tends to cause excessive read operations on the disk.

Reclaim memory after program exit

Here is my problem: after running a suite of programs, free tells me that after execution there is about 1 GB less memory free. After some searches I found SO: What really happens when you dont free after malloc which (as I understand it) makes clear that missing memory deallocations should not be the problem... (is that correct?)
top does not show any processes that use significant amounts of memory.
How can I find out 'what happend' to the memory, i.e. which program allocated it and why it is not free after program execution?
Where does free collect its information?
(I am running a recent Ubuntu version)
Yes, memory used by your program is freed after your program exits.
The statistics in "free" are confusing, but the fact is that the memory IS available to other programs:
http://kevinclosson.wordpress.com/2009/11/17/linux-free-memory-is-it-free-or-reclaimable-yes-when-i-want-free-memory-i-want-free-memory/
http://sourcefrog.net/weblog/software/linux-kernel/free-mem.html
Here's an event better link:
http://www.linuxatemyram.com/
free (1) is a misnomer, it should more correctly be called unused, because that's what it shows. Or maybe it should be called physicalfree (or, more precisely, the "free" column in the output should be named "unused").
You'll note that "buffers" and "cached" tends to go up as "free" goes down. Memory does not disappear, it just gets assigned to a different "bucket".
The difference between free memory and unused memory is that while both are "free", the unused memory is truly so (no physical memory in use) whereas the simply "free" memory is often moved into the buffer cache. That is for example the case for all executable images and libraries, anything that is read-only or read-execute. If the same file is loaded again later, the "free" page is mapped into the process again and no data must be loaded.
Note that "unused" is actually a bad thing, although it is not immediately obvious (it sounds good, doesn't it?). Free (but physically used) memory serves a purpose, whereas free (unused) memory means you could as well have saved on money for RAM. Therefore, having unused memory (e.g. by purging pages) is exactly what you don't want.
Stunningly, under Windows there exists a lot of "memory optimizer" tools which cost real money and which do just that...
About reclaiming memory, the way this works is easy: The OS simply removes the references to all pages in the working set. If a page is shared with another process, nothing spectacular happens. If it belongs to a non-anonymous mapping and is not writeable (or writeable and not written), it goes into the buffer cache. Otherwise, it goes zap poof.
This removes any memory allocated with malloc as well as the memory used by executables and file mappings, and (since all memory is based on pages) everything else.
It is probably your OS using up that space for its own purposes.
For example, many modern OS's will keep programs loaded in memory after they terminate, in case you want to start them up again. If their guess is right, it saves a lot of time at the cost of some memory that wasn't being used anyway. Some OS's will even speculatively load some commonly used programs.
CPU utilization works the same way. Often your OS will speculatively do some work when the CPU would otherwise be "idle".

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Resources