How do I limit the amount of page cache on a system?

How do I limit the amount of page cache on a system? - linux

I am running HPC workloads and my systems have half a TB of RAM. I need to prevent the kernel from taking up too much page cache because when it flushes data, it is flushing data so fast that my drives are locking up.
How do I cap the amount of page cache that the kernel will use?
I tried everything I found on google, which are, limiting it in
vm.dirty_background_ratio
vm.dirty_background_bytes
vm.dirty_ratio
vm.dirty_bytes
and also
vm.vfs_cache_preassure
Nothing seems to take effect and the kernel keeps increasing the page cache usage and I need to periodically echo 3 > /proc/sys/vm/drop_caches to workaround this

Take a look at tuning page cache. You want to do something like this
echo "40" > /proc/sys/vm/pagecache
I suggest don't change any thing because LRU/MRU caching algorithms shipped with the kernel are very efficient. Much more than what you can tune.

Related

Linux using swap instead of RAM with large image processing

I'm processing large images on a Linux server using the R programming language, so I expect much of the RAM to be used in the image processing and file writing process.
However, the server is using swap memory long before it appears to need to, thus slowing down the processing time significantly. See following image:
This shows I am using roughly 50% of the RAM for the image processing, about 50% appears to be reserved for disk cache (yellow) and yet 10Gb of swap is being used!
I was watching the swap being eaten up, and it didn't happen when the RAM was any higher in use than is being shown in this image. The swap appears to be eaten up during the processed data being written to a GeoTiff file.
My working theory is that the disk writing process is using much of the disk cache area (yellow region), and therefore the yellow isn't actually available to the server (as is often assumed of disk cache RAM)?
Does that sound reasonable? Is there another reason for swap being used when RAM is apparently available?

I believe you may be affected by swappiness kernel parameter:
When an application needs memory and all the RAM is fully occupied, the kernel has two ways to free some memory at its disposal: it can either reduce the disk cache in the RAM by eliminating the oldest data or it may swap some less used portions (pages) of programs out to the swap partition on disk. It is not easy to predict which method would be more efficient. The kernel makes a choice by roughly guessing the effectiveness of the two methods at a given instant, based on the recent history of activity.
Swappiness takes a value between 0 and 100 to change the balance between swapping applications and freeing cache. At 100, the kernel will always prefer to find inactive pages and swap them out. A value of 0 gives something close to the old behavior where applications that wanted memory could shrink the cache to a tiny fraction of RAM.
If you want to force the kernel to avoid swapping whenever possible and give the RAM from device buffers and disk cache to your application, you can set swappiness to zero:
echo 0 > /proc/sys/vm/swappiness
Note that you may actually worsen the performace with this setting, because your disk cache may shrink to a tiny fraction of what it is now, making disk access slower.

linux CPU cache slowdown

We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.

How to stop page cache for disk I/O in my linux system?

Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.

I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.

To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.

How can I limit the cache used by copying so there is still memory available for other caches?

Basic situation:
I am copying some NTFS disks in openSUSE. Each one is 2 TB. When I do this, the system runs slow.
My guesses:
I believe it is likely due to caching. Linux decides to discard useful caches (for example, KDE 4 bloat, virtual machine disks, LibreOffice binaries, Thunderbird binaries, etc.) and instead fill all available memory (24 GB total) with stuff from the copying disks, which will be read only once, then written and never used again. So then any time I use these applications (or KDE 4), the disk needs to be read again, and reading the bloat off the disk again makes things freeze/hiccup.
Due to the cache being gone and the fact that these bloated applications need lots of cache, this makes the system horribly slow.
Since it is USB, the disk and disk controller are not the bottleneck, so using ionice does not make it faster.
I believe it is the cache rather than just the motherboard going too slow, because if I stop everything copying, it still runs choppy for a while until it recaches everything.
And if I restart the copying, it takes a minute before it is choppy again. But also, I can limit it to around 40 MB/s, and it runs faster again (not because it has the right things cached, but because the motherboard busses have lots of extra bandwidth for the system disks). I can fully accept a performance loss from my motherboard's I/O capability being completely consumed (which is 100% used, meaning 0% wasted power which makes me happy), but I can't accept that this caching mechanism performs so terribly in this specific use case.
# free
total used free shared buffers cached
Mem: 24731556 24531876 199680 0 8834056 12998916
-/+ buffers/cache: 2698904 22032652
Swap: 4194300 24764 4169536
I also tried the same thing on Ubuntu, which causes a total system hang instead. ;)
And to clarify, I am not asking how to leave memory free for the "system", but for "cache". I know that cache memory is automatically given back to the system when needed, but my problem is that it is not reserved for caching of specific things.
Is there some way to tell these copy operations to limit memory usage so some important things remain cached, and therefore any slowdowns are a result of normal disk usage and not rereading the same commonly used files? For example, is there a setting of max memory per process/user/file system allowed to be used as cache/buffers?

The nocache command is the general answer to this problem! It is also in Debian and Ubuntu 13.10 (Saucy Salamander).
Thanks, Peter, for alerting us to the --drop-cache" option in rsync. But that was rejected upstream (Bug 9560 – drop-cache option), in favor of a more general solution for this: the new "nocache" command based on the rsync work with fadvise.
You just prepend "nocache" to any command you want. It also has nice utilities for describing and modifying the cache status of files. For example, here are the effects with and without nocache:
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]
$ ./nocache cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]\
$ cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 1945/1945 (100.0%) [filesize=7776.2K, pagesize=4K]
So hopefully that will work for other backup programs (rsnapshot, duplicity, rdiff-backup, amanda, s3sync, s3ql, tar, etc.) and other commands that you don't want trashing your cache.

Kristof Provost was very close, but in my situation, I didn't want to use dd or write my own software, so the solution was to use the "--drop-cache" option in rsync.
I have used this many times since creating this question, and it seems to fix the problem completely. One exception was when I am using rsync to copy from a FreeBSD machine, which doesn't support "--drop-cache". So I wrote a wrapper to replace the /usr/local/bin/rsync command, and remove that option, and now it works copying from there too.
It still uses huge amount of memory for buffers and seems to keep almost no cache, but it works smoothly anyway.
$ free
total used free shared buffers cached
Mem: 24731544 24531576 199968 0 15349680 850624
-/+ buffers/cache: 8331272 16400272
Swap: 4194300 602648 3591652

You have practically two choices:
Limit the maximum disk buffer size: the problem you're seeing is probably caused by default kernel configuration that allows using huge piece of RAM for disk buffering and, when you try to write lots of stuff to a really slow device, you'll end up lots of your precious RAM for disk caching to that slow the device.
The kernel does this because it assumes that the processes can continue to do stuff when they are not slowed down by the slow device and that RAM can be automatically freed if needed by simply writing the pages on storage (the slow USB stick - but the kernel doesn't consider the actual performance of that process). The quick fix:
# Wake up background writing process if there's more than 50 MB of dirty memory
echo 50000000 > /proc/sys/vm/dirty_background_bytes
# Limit background dirty bytes to 200 MB (source: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages)
echo 200000000 > /proc/sys/vm/dirty_bytes
Adjust the numbers to match the RAM you're willing to spend on disk write cache. A sensible value depends on your actual write performance, not the amount of RAM you have. You should target on having barely enough RAM for caching to allow full write performance for your devices. Note that this is a global setting, so you have to set this according to the slowest devices you're using.
Reserve a minimum memory size for each task you want to keep going fast. In practice this means creating cgroups for stuff you care about and defining the minimum memory you want to have for any such group. That way, the kernel can use the remaining memory as it sees fit. For details, see this presentation: SREcon19 Asia/Pacific - Linux Memory Management at Scale: Under the Hood
Update year 2022:
You can also try creating new file /etc/udev/rules.d/90-set-default-bdi-max_ratio-and-min_ratio.rules with the following contents:
# For every BDI device, set max cache usage to 30% and min reserved cache to 2% of the whole cache
# https://unix.stackexchange.com/a/481356/20336
ACTION=="add|change", SUBSYSTEM=="bdi", ATTR{max_ratio}="30", ATTR{min_ratio}="2"
The idea is to put limit per device for maximum cache utilization. With the above limit (30%) you can have two totally stalled devices and still have 40% of the disk cache available for the rest of the system. If you have 4 or more stalled devices in parallel, even this workaround cannot help alone. That's why I have also added minimum cache space of 2% for every device but I don't know how to check if this actually effective. I've been running with this config for about half a year and I think it's working nicely.
See https://unix.stackexchange.com/a/481356/20336 for details.

The kernel can not know that you won't use the cached data from copying again. This is your information advantage.
But you could set the swapiness to 0: sudo sysctl vm.swappiness=0. This will cause Linux to drop the cache before libraries, etc. are written to the swap.
It works nice for me too, especially very performant in combination with huge amount of RAM (16-32 GB).

It's not possible if you're using plain old cp, but if you're willing to reimplement or patch it yourself, setting posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE) on both input and output file will probably help.
posix_fadvise() tells the kernel about your intended access pattern. In this case, you'd only use the data once, so there isn't any point in caching it.
The Linux kernel honours these flags, so it shouldn't be caching the data any more.

Try using dd instead of cp.
Or mount the filesystem with the sync flag.
I'm not completely sure if these methods bypass the swap, but it may be worth giving a try.

I am copying some NTFS disks [...] the system runs slow. [...]
Since it is USB [...]
The slowdown is a known memory management issue.
Use a newer Linux Kernel. The older ones have a problem with USB data and "Transparent Huge Pages". See this LWN article. Very recently this issue was addressed - see "Memory Management" in LinuxChanges.

Ok, now that I know that you're using rsync and I could dig a bit more:
It seems that rsync is ineffective when used with tons of files at the same time. There's an entry in their FAQ, and it's not a Linux/cache problem. It's an rsync problem eating too much RAM.
Googling around someone recommended to split the syncing in multiple rsync invocations.

How to clean caches used by the Linux kernel

I want to force the Linux kernel to allocate more memory to applications after the cache starts taking up too much memory (as can be seen by the output of 'free').
I've run
sudo sync; sudo sysctl -w vm.drop_caches=3; free
(to free both disc dentry/inode cache and page cache) and I see that only about half of the used cache was freed - the rest remains. How can I tell what is taking up the rest of the cache and force it to be freed?

You may want to increase vfs_cache_pressure as well as set swappiness to 0.
Doing that will make the kernel reclaim cache faster, while giving processes equal or more favor when deciding what gets paged out.
You may only want to do this if processes you care about do very little disk I/O.
If a network I/O bound process has to swap in to serve requests, that's a problem and the real solution is to put it on a less competitive server.
With the default swappiness setting, the kernel is almost always going to favour keeping FS related cache in real memory.
As such, if you increase the cache pressure, be sure to equally adjust swappiness.

The contents of /proc/meminfo tell you what the kernel uses RAM for.
You can use /proc/sys/vm/vfs_cache_pressure to force the kernel to reclaim memory that is used for filesystem-related caches more lazily or eagerly.
Note that your application may only benefit from tuning this parameter if it does little or no disk I/O.

You might find John Nilsson's answer to my Question useful for purging the cache in order to test whether that is related to your problem:
sync && echo 1 > /proc/sys/vm/drop_caches
Though I'm guessing the only real difference is 1 vs 3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string