How can I shrink the Linux page cache from within kernel space? - linux

I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent and then map them contiguously into user space using remap_pfn_range. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!

TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
How to write kernel code to dynamically shrink the fs page-cache,
to recover just enough space so that a subsequent call to dma_alloc_coherent() succeeds?
To answer this question, let us take a look at what the "drop_caches operation" did to reduce the fs page-cache from 27GB to 94MB on your system.
echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes iterate_supers() and
passes it the pointer to the function drop_pagecache_sb().
What happens next is that iterate_supers() scans for active superblocks and everytime it finds one, it calls drop_pagecache_sb(), passing it a reference to the active superblock.
This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run sync first to flush them out to disk, the "drop_caches operation" tends to free more memory.
Since you are interested in running this process to reclaim a limited/known amount of memory i.e. what is soon going to be requested using dma_alloc_coherent(), you simply need to implement the above functionality with an additional check at the end of each iteration and abort the superblock scan immediately once the amount of free system memory crosses the desired level.
A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.

Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).

Related

Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?
Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

What is memory reclaim in linux

I am very new to Linux memory management. While reading some documents on the topic, I had some basic questions.
Below is my config:
vm.swappiness=10
vm.vfs_cache_pressure=140
vm.min_free_kbytes=2013265
My understanding is, if free memory falls below vm.min_free_kbytes, then the OS will reclaim memory.
Is Memory reclaim a deletion of unwanted files or copying to Swap memory from RAM?
If it's copying to Swap memory from RAM, then if I am not using Swap memory, what will happen?
Is swappiness always greater than the vm.min_free_kbytes?
What is the significance of vm.vfs_cache_pressure?
Memory reclaim is the mechanism of creating more free RAM pages, by throwing somewhere else the data residing in them. It has nothing to do with files. When more RAM is needed, data is dropped from RAM (trashed away, if it can be refetched) or copied to the swap file (so the data will be refetchable).
If there is not a swap file, but some data should be saved to the (non existent) swap area, then an out-of-memory error happens. Typically, this is notified to the process which is trying to get the memory (via alloc() and similar) - the alloc() fails and returns NULL. The process can choose what to do, or even crash. If the memory is needed by the kernel itself (normally quite rare), a PANIC happens and the system locks completely.
swappiness is, in percentage, the tendency of the kernel to use the swap, even if not strictly needed, in order to have plenty of ram ready for memory requests. Simply put, a 100% swappiness means the kernel tries to always swap, a swappiness of 0 means the kernel tries to not do swap (there are some special values however). min_free_kbytes indicates real kilobytes, it is not a percentage, and it is the minimum amount that should always be free in order to let the kernel to work well. Even starting a memory reclaim could require some more ram to do the job: it would be catastrophic if, to get some memory, you need just a little memory but you don't have it! :-)
vfs_cache_pressure is again a percentage. It indicates how much the kernel tries to get rid of (memory) cache used for the file system (vfs=virtual file system). The cache for the filesystem is quite a good candidate to throw away, because it keeps information easily readable from the disk. Unfortunately, if the computer needs frequently to use the file system, it has to read, and read again, and read again always the same data. Caching is a big performance boost. Of course, if a system does little disk I/O, then this cache is the best candidate to throw away when memory hungry.
All this things are succintly explained here: https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Does binary stay in memory after program exits?

I know when a program first starts, it has massive page faults in the beginning since the code is not in memory, and thus need to load code from disk.
What happens when a program exits? Does the binary stay in memory? Would subsequent invocations of the program find that the code is already in memory and thus not have page faults (assuming nothing runs in between and pages stuff out to disk)?
It seems like the answer is no from running some experiments on my Linux machine. I ran some program over and over again, and observed the same number of page faults every time. It's a relatively quiet machine so I doubt stuff is getting paged out in between invocations. So, why is that? Why doesn't executable get to stay in memory?
There are two things to consider here:
1) The content of the executable file is likely kept in the OS cache (disk cache). While that data is still in the OS cache, every read for that data will hit the cache and the OS will honor the request without needing to re-read the file from disk
2) When a process exits, the OS unmaps every memory page mapped to a file, frees any memory (in general, releases every resource allocated by the process, including other resources, such as sockets, and so on). Strictly speaking, the physical memory may be zeroed, but not quite required (still, the security level of the OS may require to zero a page that is not used anymore - probably Windows NT, 2K, XP, etc, do that - see this Does Windows clear memory pages?). Another invocation of the same executable will create a brand new process which will map the same file in the memory, but the first access to those pages will still trigger page faults because, in the end, it is a new process, a different memory mapping. So yes, the page faults occur, but they are a lot cheaper for the second instance of the same executable compared to the first.
Of course, this is only about the read-only parts of the executable (the segments/modules containing the code and read-only data).
One may consider another scenario: forking. In this case, every page is marked as copy-on-write. When the first write occurs on each memory page, a hardware exception is triggered and intercepted by the OS memory manager. The OS determines if the page in question is allowed to be written (eg: if it is the stack, heap or any writable page in general) and if so, it allocates memory and copies the original content before allowing the process to modify the page - in order to preserve the original data in the other process. And yes, there is still another case - shared memory, where the exact physical memory is mapped to two or more processes. In this case, the copy-on-write flag is, of course, not set on the memory pages.
Hope this clarifies what is going on with the memory pages.
What I highly suspect is that parts, information blobs are not promptly erased from RAM unless there's a new request for more RAM from actually running code. For that part what probably happens is OS reusing OS dependent bits from RAM, on a next execution e.g. I think this is true for OS initiated resources (and probably not for all resources but some).
Actually most of your questions are highly implementation-dependant. But for most used OS:
What happens when a program exits? Does the binary stay in memory?
Yes, but the memory blocks are marked as unused (and thus could be allocated to other processes).
Would subsequent invocations of the program find that the code is
already in memory and thus not have page faults (assuming nothing runs
in between and pages stuff out to disk)?
No, those blocks are considered empty. Some/all blocks might have been overwritten already.
Why doesn't executable get to stay in memory?
Why would it stay? When a process is finished, all of its allocated resources are freed.
One of the reasons is that one generally wants to clear everything out on a subsequent invocation in case their was a problem in the previous.
Plus, the writeable data must be moved out.
That said, some systems do have mechanisms for keeping executable and static data in memory (possibly not linux). For example, the VMS operating system allows the system manager to install executables and shared libraries so that they remain in memory (paging allowed). The same system can be used to create create writeable shared memory allowing interprocess communication and for modifications to the memory to remain in memory (possibly paged out).

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Resources