stop file caching for a process and its children - linux

I have a process that reads thousands of small files ONE TIME. The cached data is not needed after this. The process proceeds at full speed until most memory is consumed by the file cache and then it slows down. I don't understand the slowdown, since freeing cache memory and allocating space for the next file should be a matter of microseconds. Hard page faults also increase when this threshold is reached. The OS is vanilla Ubuntu 16.04.
I would like to limit the file caching for this process only.
This is a user process, so using a privileged shell command to purge the cache is not a solution. Using fadvise on a per-file level is not a solution, since the files are being read my multiple library programs depending on the file type.
What I need is a process-level option: do not cache, or set a low size limit like 100 MB. I have searched for this and found nothing. Is this really the case? Seems like something big that is missing.
Any insight on the apparent memory management performance issue?

Here's the strict answer to your question. If you are mmap-ing your files, the way to do this is using madvise() and MADV_DONTNEED:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the ker‐
nel can free resources associated with it.) Subsequent
accesses of pages in this range will succeed, but will result
either in reloading of the memory contents from the underlying
mapped file (see mmap(2)) or zero-fill-on-demand pages for
mappings without an underlying file.
There is to my knowledge no way of doing it with files that are simply opened, read (using read() or similar) and closed.
However, it sounds to me like this is not in fact the issue. Are you sure it's buffer / cache that is growing here, and not something else? (e.g. perhaps you are reading them into RAM and not freeing that RAM, or not closing them, or similar)
You can tell by doing:
echo 3 > /proc/sys/vm/drop_caches
if you don't get all the memory back, then it's your program which is leaking something.

I am convinced there is no way to stop file caching on a per-process level. The program must have direct control over file I/O, with access to the file descriptors so that madvise() can be used. You cannot do this if library functions are doing all the file reading and you are not willing to modify them. This does look like a design gap that should be filled.
HOWEVER: My assertion of some performance issue with memory management was wrong. The reason for the process slow-down as the file cache grows and free memory shrinks was something else: disk seek distances were growing during the process. Other tests have verified that allocating memory does not significantly slow down as the file cache grows and free memory shrinks.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

How to prioritize write() over mmap updates (or delay mmap page cache flush)

I'm running a specialized DB daemon on a debian-64 with 64G of RAM and lots of disk space. It uses an on-disk hashtable (mmaped) and writes the actual data into a file with regular write() calls. When doing really a lot of updates, a big part of the mmap gets dirty and the page cache tries to flush it to disk, producing lots of random writes which in turn slows down the performance of the regular (sequential) writes to the data file.
If it were possible to delay the page cache flush of the mmaped area performance would improve (I assume), since several (or all) changes to the dirty page would be written at once instead of once for every update (worst case, in reality of course it aggregates a lot of changes anyway).
So my question: Is it possible to delay page cache flush for a memory-mapped area? Or is it possible to prioritze the regular write? Or does anyone have any other ideas? madvise and posix_fadvise don't seem to make any difference...
You could play with the tuneables in /proc/sys/vm. For example, increase the value in dirty_writeback_centisecs to make pdflush wake up somewhat less often, increase dirty_expire_centiseconds so data is allowed to stay dirty for longer until it must be written out, and increase dirty_background_ratio to allow more dirty pages to stay in RAM before something must be done.
See here for a somewhat comprehensive description of what all the values do.
Note that this will affect every process on your machine, but seeing how you're running a huge database server, chances are that this is no problem since you don't want anything else to run on the same machine anyway.
Now of course this delays writes, but it still doesn't fully solve the problem of dirty page writebacks competing with write (though it will likely collapse a few writes if there are many updates).
But: You can use the sync_file_range syscall to force beginning write-out of pages in a given range on your "write" file descriptor (SYNC_FILE_RANGE_WRITE). So while the dirty pages will be written back at some unknown time later (and with greater grace periods), you manually kick off writeback on the ones you're interested.
This doesn't give any guarantees, but it should just work.
Be sure to absolutely positively read the documentation, better read it twice. sync_file_range can very easily corrupt or lose data if you use it wrong. In particular, you must be sure metadata is up-to-date and flushed if you appended to a file, or data that has been "successfully written" will just be "gone" in case of a crash.
I would try mlock. If you mlock the relevant memory range, it may prevent the flush from occurring. You could munlock when you're done.

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Why do discussions of "swappiness" act like information can only be in one place at a time?

I've been reading up on Linux's "swappiness" tuneable, which controls how aggressive the kernel is about swapping applications' memory to disk when they're not being used. If you Google the term, you get a lot of pages like this discussing the pros and cons. In a nutshell, the argument goes like this:
If your swappiness is too low, inactive applications will hog all the system memory that other programs might want to use.
If your swappiness is too high, when you wake up those inactive applications, there's going to be a big delay as their state is read back off the disk.
This argument doesn't make sense to me. If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory? This seems to give the best of both worlds: if another application needs that memory, it can immediately claim the physical RAM and start writing to it, since another copy of it is on disk and can be swapped back in when the inactive application is woken up. And when the original app wakes up, any of its pages that are still in RAM can be used as-is, without having to pull them off the disk.
Or am I missing something?
If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory?
Lets say we did it. We wrote the page to disk, but left it in memory. A while later another process needs memory, so we want to kick out the page from the first process.
We need to know with absolute certainty whether the first process has modified the page since it was written out to disk. If it has, we have to write it out again. The way we would track this is to take away the process's write permission to the page back when we first wrote it out to disk. If the process tries to write to the page again there will be a page fault. The kernel can note that the process has dirtied the page (and will therefore need to be written out again) before restoring the write permission and allowing the application to continue.
Therein lies the problem. Taking away write permission from the page is actually somewhat expensive, particularly in multiprocessor machines. It is important that all CPUs purge their cache of page translations to make sure they take away the write permission.
If the process does write to the page, taking a page fault is even more expensive. I'd presume that a non-trivial number of these pages would end up taking that fault, which eats into the gains we were looking for by leaving it in memory.
So is it worth doing? I honestly don't know. I'm just trying to explain why leaving the page in memory isn't so obvious a win as it sounds.
(*) This whole thing is very similar to a mechanism called Copy-On-Write, which is used when a process fork()s. The child process is very likely going to execute just a few instructions and call exec(), so it would be silly to copy all of the parents pages. Instead the write permission is taken away and the child simply allowed to run. Copy-On-Write is a win because the page fault is almost never taken: the child almost always calls exec() immediately.
Even if you page the apps memory to disk and keep it in memory, you would still have to decide when should an application be considered "inactive" and that's what swapiness controls. Paging to disk is expensive in terms of IO and you don't want to do it too often. There is also another variable on this equation, and that is the fact that Linux uses of remaining memory as disk buffers/cache.
According to this 1 that is exactly what Linux does.
I'm still trying to make sense of a lot of this, so any authoritative links would be appreciated.
The first thing the VM does is clean pages and move them to the clean list.
When cleaning anonymous memory (things which do not have an actual file backing store, you can see the segments in /proc//maps which are anonymous and have no filesystem vnode storage behind them), the first thing the VM is going to do is take the "dirty" pages and "clean" then by writing the contents of the page out to swap. Now when the VM has a shortage of completely free memory and is worried about its ability to grant new free pages to be used, it can go through the list of 'clean' pages and based on how recently they were used and what kind of memory they are it will move those pages to the free list.
Once the memory pages are placed on the free list, they no longer are associated with the contents they had before. If a program comes along a references the memory location the page was serving previously the program will take a major fault and a (most likely completely different) page will be grabbed from the free list and the data will be read into the page from disk. Once this is done, the page is actually still 'clean' since it has not been modified. If the VM chooses to use that page on swap for a different page in RAM then the page would be again 'dirtied', or if the app wrote to that page it would be 'dirtied'. And then the process begins again.
Also, swappinness is pretty horrible for server applications in a business/transactional/online/latency-sensitive environment. When I've got 16GB RAM boxes where I'm not running a lot of browsers and GUIs, I typically want all my apps nearly pinned in memory. The bulk of my RAM tends to be 8-10GB java heaps that I NEVER want paged to disk, ever, and the cruft that is available are processes like mingetty (but even there the glibc pages in those apps are shared by other apps and actually used, so even the RSS size of those useless processes are mostly shared, used pages). I normally don't see more than a few 10MBs of the 16GB actually cleaned to swap. I would advise very, very low swappiness numbers or zero swappiness for servers -- the unused pages should be a small fraction of the overall RAM and trying to reclaim that relatively tiny amount of RAM for buffer cache risks swapping application pages and taking latency hits in the running app.

Resources