Do Linux and macOS have an `OfferVirtualMemory` counterpart? - linux

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?

Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

Related

How can I shrink the Linux page cache from within kernel space?

I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent and then map them contiguously into user space using remap_pfn_range. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!
TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
How to write kernel code to dynamically shrink the fs page-cache,
to recover just enough space so that a subsequent call to dma_alloc_coherent() succeeds?
To answer this question, let us take a look at what the "drop_caches operation" did to reduce the fs page-cache from 27GB to 94MB on your system.
echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes iterate_supers() and
passes it the pointer to the function drop_pagecache_sb().
What happens next is that iterate_supers() scans for active superblocks and everytime it finds one, it calls drop_pagecache_sb(), passing it a reference to the active superblock.
This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run sync first to flush them out to disk, the "drop_caches operation" tends to free more memory.
Since you are interested in running this process to reclaim a limited/known amount of memory i.e. what is soon going to be requested using dma_alloc_coherent(), you simply need to implement the above functionality with an additional check at the end of each iteration and abort the superblock scan immediately once the amount of free system memory crosses the desired level.
A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.
Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).

Does binary stay in memory after program exits?

I know when a program first starts, it has massive page faults in the beginning since the code is not in memory, and thus need to load code from disk.
What happens when a program exits? Does the binary stay in memory? Would subsequent invocations of the program find that the code is already in memory and thus not have page faults (assuming nothing runs in between and pages stuff out to disk)?
It seems like the answer is no from running some experiments on my Linux machine. I ran some program over and over again, and observed the same number of page faults every time. It's a relatively quiet machine so I doubt stuff is getting paged out in between invocations. So, why is that? Why doesn't executable get to stay in memory?
There are two things to consider here:
1) The content of the executable file is likely kept in the OS cache (disk cache). While that data is still in the OS cache, every read for that data will hit the cache and the OS will honor the request without needing to re-read the file from disk
2) When a process exits, the OS unmaps every memory page mapped to a file, frees any memory (in general, releases every resource allocated by the process, including other resources, such as sockets, and so on). Strictly speaking, the physical memory may be zeroed, but not quite required (still, the security level of the OS may require to zero a page that is not used anymore - probably Windows NT, 2K, XP, etc, do that - see this Does Windows clear memory pages?). Another invocation of the same executable will create a brand new process which will map the same file in the memory, but the first access to those pages will still trigger page faults because, in the end, it is a new process, a different memory mapping. So yes, the page faults occur, but they are a lot cheaper for the second instance of the same executable compared to the first.
Of course, this is only about the read-only parts of the executable (the segments/modules containing the code and read-only data).
One may consider another scenario: forking. In this case, every page is marked as copy-on-write. When the first write occurs on each memory page, a hardware exception is triggered and intercepted by the OS memory manager. The OS determines if the page in question is allowed to be written (eg: if it is the stack, heap or any writable page in general) and if so, it allocates memory and copies the original content before allowing the process to modify the page - in order to preserve the original data in the other process. And yes, there is still another case - shared memory, where the exact physical memory is mapped to two or more processes. In this case, the copy-on-write flag is, of course, not set on the memory pages.
Hope this clarifies what is going on with the memory pages.
What I highly suspect is that parts, information blobs are not promptly erased from RAM unless there's a new request for more RAM from actually running code. For that part what probably happens is OS reusing OS dependent bits from RAM, on a next execution e.g. I think this is true for OS initiated resources (and probably not for all resources but some).
Actually most of your questions are highly implementation-dependant. But for most used OS:
What happens when a program exits? Does the binary stay in memory?
Yes, but the memory blocks are marked as unused (and thus could be allocated to other processes).
Would subsequent invocations of the program find that the code is
already in memory and thus not have page faults (assuming nothing runs
in between and pages stuff out to disk)?
No, those blocks are considered empty. Some/all blocks might have been overwritten already.
Why doesn't executable get to stay in memory?
Why would it stay? When a process is finished, all of its allocated resources are freed.
One of the reasons is that one generally wants to clear everything out on a subsequent invocation in case their was a problem in the previous.
Plus, the writeable data must be moved out.
That said, some systems do have mechanisms for keeping executable and static data in memory (possibly not linux). For example, the VMS operating system allows the system manager to install executables and shared libraries so that they remain in memory (paging allowed). The same system can be used to create create writeable shared memory allowing interprocess communication and for modifications to the memory to remain in memory (possibly paged out).

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

What happens in paginated (virtual memory) systems when a process is started up?

I'm studying through Tanenbaum's "Modern Operating Systems" book and just read the following paragraph in the book:
When a process is started up, all of its page table entries are marked as not in memory. As soon as any page is referenced, a page fault will occur. The operating system then sets the R bit (in its internal tables), changes the page tables entry to point to the correct page, with mode READ ONLY, and restarts the instruction. If the page is subsequently modified, another page fault will occur, allowing the operating system to set the M bit and change the page's mode to READ/WRITE.
It seems to be extremely inneficient for me. He suggests that when a process is started up a lot of page faults must occur and the real memory is being filled up as the instructions are being executed.
It appears more logical to me that at least the text of the process is put in memory at the beginning, instead of it being put at every instruction execution (with a page fault per instruction execution).
Could someone explain me what is the advantage of this method that the book explains?
Tanenbaum describese two techniques in this paragraph:
When a process is started up, all of its page table entries are marked as not in memory. As > soon as any page is referenced, a page fault will occur. The operating system then sets the > R bit (in its internal tables), changes the page tables entry to point to the correct page, > with mode READ ONLY, and restarts the instruction.
This technique is also called demand-paging (the pages are loaded from disk to memory on-demand, if a page-fault occurs). I can think of at least two reasons why you would want to do this:
Memory consumption: Only the pages that are really needed are loaded from disk into main memory, there might be parts of your program you never execute or parts in your data section you never write to during the execution. In that case, these parts are never loaded in the first place, which means you have more RAM available for other processes. Nowadays, with huge amounts of memory you can of course debate if this is still a valid argument.
Speed: Loading from disk is slow and was much slower a decade ago. Doing the pagetable setup on-demand in a lazy fashion allows to defer the block fetching from the disk. Loading everything at once might delay the execution of your program. Again, disks are now a lot faster and SSDs make this argument even more void. On the other hand, because of dynamic libraries, binaries are not that big and usually require only a few page-faults until they are loaded in RAM.
If the page is subsequently modified, another page fault will occur, allowing the
operating system to set the M bit and change the page's mode to READ/WRITE.
Again, the reason for this is memory consumption. In the old days, where memory was scarce, swapping (moving pages back to disk again if the memory became full) was the solution to provide you with an illusion of a much larger working set of pages. If a page was already swapped out before and never modified inbetween you could just get rid of the page by removing the present bit in the pagetable, thus freeing up the memory the page previosuly occupied to load another frame. The modified bit helps you to detect if you need to write a new version of the page back out to disk, or if you can actually leave the old version as is and swap it back in again once it is needed.
The method you mention where you setup a process with all page table entries prepopulated (also known as pre-paging) is perfectly valid. You are trading memory consumption for speed. The page-table walk and also setting the modified bit is implemented in hardware (on x86) which means it performs not that bad. However, pre-population saves you from executing the page-fault handler, which altough usually heavily optimized, is implemented in software.

Reclaim memory after program exit

Here is my problem: after running a suite of programs, free tells me that after execution there is about 1 GB less memory free. After some searches I found SO: What really happens when you dont free after malloc which (as I understand it) makes clear that missing memory deallocations should not be the problem... (is that correct?)
top does not show any processes that use significant amounts of memory.
How can I find out 'what happend' to the memory, i.e. which program allocated it and why it is not free after program execution?
Where does free collect its information?
(I am running a recent Ubuntu version)
Yes, memory used by your program is freed after your program exits.
The statistics in "free" are confusing, but the fact is that the memory IS available to other programs:
http://kevinclosson.wordpress.com/2009/11/17/linux-free-memory-is-it-free-or-reclaimable-yes-when-i-want-free-memory-i-want-free-memory/
http://sourcefrog.net/weblog/software/linux-kernel/free-mem.html
Here's an event better link:
http://www.linuxatemyram.com/
free (1) is a misnomer, it should more correctly be called unused, because that's what it shows. Or maybe it should be called physicalfree (or, more precisely, the "free" column in the output should be named "unused").
You'll note that "buffers" and "cached" tends to go up as "free" goes down. Memory does not disappear, it just gets assigned to a different "bucket".
The difference between free memory and unused memory is that while both are "free", the unused memory is truly so (no physical memory in use) whereas the simply "free" memory is often moved into the buffer cache. That is for example the case for all executable images and libraries, anything that is read-only or read-execute. If the same file is loaded again later, the "free" page is mapped into the process again and no data must be loaded.
Note that "unused" is actually a bad thing, although it is not immediately obvious (it sounds good, doesn't it?). Free (but physically used) memory serves a purpose, whereas free (unused) memory means you could as well have saved on money for RAM. Therefore, having unused memory (e.g. by purging pages) is exactly what you don't want.
Stunningly, under Windows there exists a lot of "memory optimizer" tools which cost real money and which do just that...
About reclaiming memory, the way this works is easy: The OS simply removes the references to all pages in the working set. If a page is shared with another process, nothing spectacular happens. If it belongs to a non-anonymous mapping and is not writeable (or writeable and not written), it goes into the buffer cache. Otherwise, it goes zap poof.
This removes any memory allocated with malloc as well as the memory used by executables and file mappings, and (since all memory is based on pages) everything else.
It is probably your OS using up that space for its own purposes.
For example, many modern OS's will keep programs loaded in memory after they terminate, in case you want to start them up again. If their guess is right, it saves a lot of time at the cost of some memory that wasn't being used anyway. Some OS's will even speculatively load some commonly used programs.
CPU utilization works the same way. Often your OS will speculatively do some work when the CPU would otherwise be "idle".

Resources