Prioritize write cache over read cache on Linux

Prioritize write cache over read cache on Linux - linux

My pc (with 4 GB of RAM) is running several IO bound applications, and I want to avoid as many writes as possible on my SSD.
In /etc/sysctl.conf file I have set:
vm.dirty_background_ratio = 75
vm.dirty_ratio = 90
vm.dirty_expire_centisecs = 360000
vm.swappiness = 0
And in /etc/fstab I added the commit=3600 parameter.
According to free command, my pc usually stays on with 1 GB of RAM used by applications and about 2500 of available ram. So with my settings I should be able to write at least about 1500-2000 MB of data without writing actually on the disk.
I have done some tests with moderate writes (300MB - 1000MB) and with free and cat /proc/meminfo | grep Dirty commands I noticed that often a few time later these writes (far less that dirty_expire_centisecs time), the dirty bytes go down to a value next to 0.
I suspect that the subsequent read operations fill the cache until the machine is near a OOM condition and is forced to flush dirty writes ignoring my sysctl.conf settings (correct me if my hypothesis is wrong).
So the question is: is it possible disabling only read caching (AFAIK not possible), or at least change pagecache replace policy, giving more priority to write cache, so that read cache can not force a writes flushing (maybe tweaking kernel source code...)? I know that I can solve easily this problem using tmpfs or union-fs like AUFS or OverlayFS, but for many reason I would like to avoid them.
Sorry for my bad english, I hope you understand my question. Thank you.

Related

Linux using swap instead of RAM with large image processing

I'm processing large images on a Linux server using the R programming language, so I expect much of the RAM to be used in the image processing and file writing process.
However, the server is using swap memory long before it appears to need to, thus slowing down the processing time significantly. See following image:
This shows I am using roughly 50% of the RAM for the image processing, about 50% appears to be reserved for disk cache (yellow) and yet 10Gb of swap is being used!
I was watching the swap being eaten up, and it didn't happen when the RAM was any higher in use than is being shown in this image. The swap appears to be eaten up during the processed data being written to a GeoTiff file.
My working theory is that the disk writing process is using much of the disk cache area (yellow region), and therefore the yellow isn't actually available to the server (as is often assumed of disk cache RAM)?
Does that sound reasonable? Is there another reason for swap being used when RAM is apparently available?

I believe you may be affected by swappiness kernel parameter:
When an application needs memory and all the RAM is fully occupied, the kernel has two ways to free some memory at its disposal: it can either reduce the disk cache in the RAM by eliminating the oldest data or it may swap some less used portions (pages) of programs out to the swap partition on disk. It is not easy to predict which method would be more efficient. The kernel makes a choice by roughly guessing the effectiveness of the two methods at a given instant, based on the recent history of activity.
Swappiness takes a value between 0 and 100 to change the balance between swapping applications and freeing cache. At 100, the kernel will always prefer to find inactive pages and swap them out. A value of 0 gives something close to the old behavior where applications that wanted memory could shrink the cache to a tiny fraction of RAM.
If you want to force the kernel to avoid swapping whenever possible and give the RAM from device buffers and disk cache to your application, you can set swappiness to zero:
echo 0 > /proc/sys/vm/swappiness
Note that you may actually worsen the performace with this setting, because your disk cache may shrink to a tiny fraction of what it is now, making disk access slower.

Drop cache does not work

I am currently working on optimizing the memory management of a large program. For some pupose, I want to drop the page cache in my main memory.
I used sync && echo 3 > /proc/sys/vm/drop_caches as widely suggested by the internet, but it does not drop the cache to the level where it was before the program starts. This means there are some undroppable cache in the main memory after the program starts.
But isn't echo 3 means to free pagecache, dentries and inodes in cache memory? Is there any other kinds of cache that cannot be freed by this command?

Yes, there are some types of caches that cannot be dropped. For instance, tmpfs filesystems are stored in page cache. But these could not be flushed while in use. You can get better picture of how much memory you really have available by using free command, and checking available column. You'll notice that available memory is smaller than free + buffers + caches. Sometimes much smaller.
For more information on tmpfs using caches see this answer.

Collect output of cat /proc/vmstat before and after you issue drop cache.
It will give nr_inactive_file,nr_active_file ,nr_file_pages,nr_isolated_file. If drop cache works then total of above 4 should be less than before issuing drop cache.

Do I need to tune sysctl.conf under linux when running MongoDB?

We are seeing occational huge writes to disk in the MongoDB log, effectively locking MongoDB for a long time. Many people are reporting similar issues on the net, but I have found no good answers so far.
Tue Mar 11 09:42:49.818 [DataFileSync] flushing mmaps took 75264ms for 46 files
The average mmap flush on my server is around 100 ms according to the mongo statistics.
A large percentage of our MongDB data is updated within a few hours. This leads me to speculate whether we need to tune the Linux sysctl virtual memory parameters as described in the performance guide for Neo4J, another memory mapped tool: http://docs.neo4j.org/chunked/stable/linux-performance-guide.html
There are a lot of blocks going out to IO, way more than expected for the write speed we
are seeing in the benchmark. Another observation that can be made is that the Linux kernel
has spawned a process called "flush-x:x" (run top) that seems to be consuming a lot of
resources.
The problem here is that the Linux kernel is trying to be smart and write out dirty pages
from the virtual memory. As the benchmark will memory map a 1GB file and do random writes
it is likely that this will result in 1/4 of the memory pages available on the system to
be marked as dirty. The Neo4j kernel is not sending any system calls to the Linux kernel to
write out these pages to disk however the Linux kernel decided to start doing so and it
is a very bad decision. The result is that instead of doing sequential like writes down
to disk (the logical log file) we are now doing random writes writing regions of the
memory mapped file to disk.
TOP shows that we indeed have a flush process that has been running a very long time, so this seems to match.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28352 mongod 20 0 153g 3.2g 3.1g S 3.3 42.3 299:18.36 mongod
3678 root 20 0 0 0 0 S 0.3 0.0 26:27.88 flush-253:1
The recommended Neo4J sysctl settings are
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
Does these settings have any relevance for a MongoDB installation at all?

The short answer is "yes". What values to choose depends very much on your write patterns. This gives background on exactly how MongoDB manages its mappings - it's not anything unexpected.
One wrinkle is that in a web-facing database application, you may care about latency more than throughput. vm.dirty_background_ratio gives the threshold for starting to write dirty pages, and vm.dirty_ratio tells when to stop accepting new writes (ie, block) until all writes have been flushed.
If you are hammering a relatively small working set, you can be OK with setting both of those values fairly high, and relying on Mongo's (or the OS's) periodic time-based flush-to-disk to commit the writes.
If you're conducting a high volume of inserts and also some modifications, which sounds like it might be your situation, it's a balancing act that depends on inserts vs. rewrites - starting to flush too early will cause writes that will be re-written soon, "wasting" io. Starting to flush too late will result in pauses as you flush huge writes.
If you're doing mostly inserts, then you may very well want a large dirty_ratio (to avoid blocking) and a relatively small dirty_background_ratio (small enough to always be writing as you're inserting to reduce latency, and just large enough to linearize some of the writes).
The correct solution is to replay some dummy data with various options for those sysctl parameters, and optimize it by brute force, bearing in mind your average latency / total throughput objectives.

How can I limit the cache used by copying so there is still memory available for other caches?

Basic situation:
I am copying some NTFS disks in openSUSE. Each one is 2 TB. When I do this, the system runs slow.
My guesses:
I believe it is likely due to caching. Linux decides to discard useful caches (for example, KDE 4 bloat, virtual machine disks, LibreOffice binaries, Thunderbird binaries, etc.) and instead fill all available memory (24 GB total) with stuff from the copying disks, which will be read only once, then written and never used again. So then any time I use these applications (or KDE 4), the disk needs to be read again, and reading the bloat off the disk again makes things freeze/hiccup.
Due to the cache being gone and the fact that these bloated applications need lots of cache, this makes the system horribly slow.
Since it is USB, the disk and disk controller are not the bottleneck, so using ionice does not make it faster.
I believe it is the cache rather than just the motherboard going too slow, because if I stop everything copying, it still runs choppy for a while until it recaches everything.
And if I restart the copying, it takes a minute before it is choppy again. But also, I can limit it to around 40 MB/s, and it runs faster again (not because it has the right things cached, but because the motherboard busses have lots of extra bandwidth for the system disks). I can fully accept a performance loss from my motherboard's I/O capability being completely consumed (which is 100% used, meaning 0% wasted power which makes me happy), but I can't accept that this caching mechanism performs so terribly in this specific use case.
# free
total used free shared buffers cached
Mem: 24731556 24531876 199680 0 8834056 12998916
-/+ buffers/cache: 2698904 22032652
Swap: 4194300 24764 4169536
I also tried the same thing on Ubuntu, which causes a total system hang instead. ;)
And to clarify, I am not asking how to leave memory free for the "system", but for "cache". I know that cache memory is automatically given back to the system when needed, but my problem is that it is not reserved for caching of specific things.
Is there some way to tell these copy operations to limit memory usage so some important things remain cached, and therefore any slowdowns are a result of normal disk usage and not rereading the same commonly used files? For example, is there a setting of max memory per process/user/file system allowed to be used as cache/buffers?

The nocache command is the general answer to this problem! It is also in Debian and Ubuntu 13.10 (Saucy Salamander).
Thanks, Peter, for alerting us to the --drop-cache" option in rsync. But that was rejected upstream (Bug 9560 – drop-cache option), in favor of a more general solution for this: the new "nocache" command based on the rsync work with fadvise.
You just prepend "nocache" to any command you want. It also has nice utilities for describing and modifying the cache status of files. For example, here are the effects with and without nocache:
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]
$ ./nocache cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]\
$ cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 1945/1945 (100.0%) [filesize=7776.2K, pagesize=4K]
So hopefully that will work for other backup programs (rsnapshot, duplicity, rdiff-backup, amanda, s3sync, s3ql, tar, etc.) and other commands that you don't want trashing your cache.

Kristof Provost was very close, but in my situation, I didn't want to use dd or write my own software, so the solution was to use the "--drop-cache" option in rsync.
I have used this many times since creating this question, and it seems to fix the problem completely. One exception was when I am using rsync to copy from a FreeBSD machine, which doesn't support "--drop-cache". So I wrote a wrapper to replace the /usr/local/bin/rsync command, and remove that option, and now it works copying from there too.
It still uses huge amount of memory for buffers and seems to keep almost no cache, but it works smoothly anyway.
$ free
total used free shared buffers cached
Mem: 24731544 24531576 199968 0 15349680 850624
-/+ buffers/cache: 8331272 16400272
Swap: 4194300 602648 3591652

You have practically two choices:
Limit the maximum disk buffer size: the problem you're seeing is probably caused by default kernel configuration that allows using huge piece of RAM for disk buffering and, when you try to write lots of stuff to a really slow device, you'll end up lots of your precious RAM for disk caching to that slow the device.
The kernel does this because it assumes that the processes can continue to do stuff when they are not slowed down by the slow device and that RAM can be automatically freed if needed by simply writing the pages on storage (the slow USB stick - but the kernel doesn't consider the actual performance of that process). The quick fix:
# Wake up background writing process if there's more than 50 MB of dirty memory
echo 50000000 > /proc/sys/vm/dirty_background_bytes
# Limit background dirty bytes to 200 MB (source: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages)
echo 200000000 > /proc/sys/vm/dirty_bytes
Adjust the numbers to match the RAM you're willing to spend on disk write cache. A sensible value depends on your actual write performance, not the amount of RAM you have. You should target on having barely enough RAM for caching to allow full write performance for your devices. Note that this is a global setting, so you have to set this according to the slowest devices you're using.
Reserve a minimum memory size for each task you want to keep going fast. In practice this means creating cgroups for stuff you care about and defining the minimum memory you want to have for any such group. That way, the kernel can use the remaining memory as it sees fit. For details, see this presentation: SREcon19 Asia/Pacific - Linux Memory Management at Scale: Under the Hood
Update year 2022:
You can also try creating new file /etc/udev/rules.d/90-set-default-bdi-max_ratio-and-min_ratio.rules with the following contents:
# For every BDI device, set max cache usage to 30% and min reserved cache to 2% of the whole cache
# https://unix.stackexchange.com/a/481356/20336
ACTION=="add|change", SUBSYSTEM=="bdi", ATTR{max_ratio}="30", ATTR{min_ratio}="2"
The idea is to put limit per device for maximum cache utilization. With the above limit (30%) you can have two totally stalled devices and still have 40% of the disk cache available for the rest of the system. If you have 4 or more stalled devices in parallel, even this workaround cannot help alone. That's why I have also added minimum cache space of 2% for every device but I don't know how to check if this actually effective. I've been running with this config for about half a year and I think it's working nicely.
See https://unix.stackexchange.com/a/481356/20336 for details.

The kernel can not know that you won't use the cached data from copying again. This is your information advantage.
But you could set the swapiness to 0: sudo sysctl vm.swappiness=0. This will cause Linux to drop the cache before libraries, etc. are written to the swap.
It works nice for me too, especially very performant in combination with huge amount of RAM (16-32 GB).

It's not possible if you're using plain old cp, but if you're willing to reimplement or patch it yourself, setting posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE) on both input and output file will probably help.
posix_fadvise() tells the kernel about your intended access pattern. In this case, you'd only use the data once, so there isn't any point in caching it.
The Linux kernel honours these flags, so it shouldn't be caching the data any more.

Try using dd instead of cp.
Or mount the filesystem with the sync flag.
I'm not completely sure if these methods bypass the swap, but it may be worth giving a try.

I am copying some NTFS disks [...] the system runs slow. [...]
Since it is USB [...]
The slowdown is a known memory management issue.
Use a newer Linux Kernel. The older ones have a problem with USB data and "Transparent Huge Pages". See this LWN article. Very recently this issue was addressed - see "Memory Management" in LinuxChanges.

Ok, now that I know that you're using rsync and I could dig a bit more:
It seems that rsync is ineffective when used with tons of files at the same time. There's an entry in their FAQ, and it's not a Linux/cache problem. It's an rsync problem eating too much RAM.
Googling around someone recommended to split the syncing in multiple rsync invocations.

How to clean caches used by the Linux kernel

I want to force the Linux kernel to allocate more memory to applications after the cache starts taking up too much memory (as can be seen by the output of 'free').
I've run
sudo sync; sudo sysctl -w vm.drop_caches=3; free
(to free both disc dentry/inode cache and page cache) and I see that only about half of the used cache was freed - the rest remains. How can I tell what is taking up the rest of the cache and force it to be freed?

You may want to increase vfs_cache_pressure as well as set swappiness to 0.
Doing that will make the kernel reclaim cache faster, while giving processes equal or more favor when deciding what gets paged out.
You may only want to do this if processes you care about do very little disk I/O.
If a network I/O bound process has to swap in to serve requests, that's a problem and the real solution is to put it on a less competitive server.
With the default swappiness setting, the kernel is almost always going to favour keeping FS related cache in real memory.
As such, if you increase the cache pressure, be sure to equally adjust swappiness.

The contents of /proc/meminfo tell you what the kernel uses RAM for.
You can use /proc/sys/vm/vfs_cache_pressure to force the kernel to reclaim memory that is used for filesystem-related caches more lazily or eagerly.
Note that your application may only benefit from tuning this parameter if it does little or no disk I/O.

You might find John Nilsson's answer to my Question useful for purging the cache in order to test whether that is related to your problem:
sync && echo 1 > /proc/sys/vm/drop_caches
Though I'm guessing the only real difference is 1 vs 3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string