Memory capacity saturation and minor page faults - linux

In USE Method: Linux Performance Checklist it is mentioned that
The goal is a measure of memory capacity saturation - the degree to which a process is driving the system beyond its ability (and causing paging/swapping). [...] Another metric that may serve a similar goal is minor-fault rate by process, which could be watched from /proc/PID/stat.
I'm not sure I understand what minor-faults have to do with memory saturation.
Quoting wikipedia for reference
If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault.

I think what the book is referring to is the following OS behaviour that could make soft page faults increase with memory pressure. But there are other reasons for soft page faults (allocating new pages with mmap(MAP_ANONYMOUS) and then freeing them again; every first-touch of a new page will cost a soft page fault, although fault-around for a group of contiguous pages can reduce that to one fault per N pages for some small N when iterating through a new large allocation.)
When approaching memory pressure limits, Linux (like many other OSes) will un-wire a page in the HW page tables to see if a soft page-fault happens very soon. If no, then it may actually evict that page from memory1.
But if it does soft-pagefault before being evicted, the kernel just has to wire it back in to the page table, having saved a hard page-fault. (And the I/O to write it out in the first place.)
Footnote 1: Writing it to disk if dirty, either in swap space or a file-backed mapping if not anonymous; otherwise just dropping it. The kernel could start this disk I/O while waiting to see if it gets faulted back in; IDK if Linux does this or not.

Related

Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?
Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

How to measure minor page fault cost?

I want to verify transparent huge page(THP) would cause large page fault latency, because Linux must zero pages before returning them to user. THP is 512x larger than 4KB pages, thus slower to clear. When memory is fragmented, the OS often compact memory to generate THP.
So I want to measure minor page fault latency(cost), but I still have no idea.
Check https://www.kernel.org/doc/Documentation/vm/transhuge.txt documentation and search LWN & RedHat docs for THP latency and THP faults.
https://www.kernel.org/doc/Documentation/vm/transhuge.txt says about zero THP:
By default kernel tries to use huge zero page on read page fault to
anonymous mapping. It's possible to disable huge zero page by writing 0
or enable it back by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
You can vary the setting (introduced around 2012: https://lwn.net/Articles/517465/ Adding a huge zero page) and do measurements of page mapping and access latency. Just read some system time with rdtsc/rdtscp/CLOCK_MONOTONIC, do accesses to the page, reread time; record stats about the time difference, like min/max/avg; draw a histogram - count how many differences were in 0..100, 101..300, 301..600 ... ranges and how many were bigger than some huge value. Array to count histogram many be small enough.
You may try mmap() with MAP_POPULATE flag - (http://d3s.mff.cuni.cz/teaching/advanced_operating_systems/slides/10_huge_pages.pdf page 17)
RedHat blog has post about THP & page fault latency (with help of their stap SystemTap tracing): https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/
To prevent information leakage from the previous user of the page the kernel writes zeros in the entire page. For a 4096 byte page this is a relatively short operation and will only take a couple of microseconds. The x86 hugepages are 2MB in size, 512 times larger than the normal page. Thus, the operation may take hundreds of microseconds and impact the operation of latency sensitive code. Below is a simple SystemTap command line script to show which applications have huge pages zeroed out and how long those operations take. It will run until cntl-c is pressed.
stap -e 'global huge_clear probe kernel.function("clear_huge_page").return {
huge_clear [execname(), pid()] <<< (gettimeofday_us() - #entry(gettimeofday_us()))}'
Also, I'm not sure about this, but in theory, Linux Kernel may have some kernel thread to do prezeroing of huge pages before they will be required by any application.

Why does dereferencing pointer from mmap cause memory usage reported by top to increase?

I am calling mmap() with MAP_SHARED and PROT_READ to access a file which is about 25 GB in size. I have noticed that advancing the returned pointer has no effect to %MEM in top for the application, but once I start dereferencing the pointer at different locations, memory wildly increases and caps at 55%. That value goes back down to 0.2% once munmap is called.
I don't know if I should trust that 55% value top reports. It doesn't seem like it is actually using 8 GB of the available 16. Should I be worried?
When you first map the file, all it does is reserve address space, it doesn't necessarily read anything from the file if you don't pass MAP_POPULATE (the OS might do a little prefetch, it's not required to, and often doesn't until you begin reading/writing).
When you read from a given page of memory for the first time, this triggers a page fault. This "invalid page fault" most people think of when they hear the name, it's either:
A minor fault - The data is already loaded in the kernel, but the userspace mapping for that address to the loaded data needs to be established (fast)
A major fault - The data is not loaded at all, and the kernel needs to allocate a page for the data, populate it from the disk (slow), then perform the same mapping to userspace as in the minor fault case
The behavior you're seeing is likely due to the mapped file being too large to fit in memory alongside everything else that wants to stay resident, so:
When first mapped, the initial pages aren't already mapped to the process (some of them might be in the kernel cache, but they're not charged to the process unless they're linked to the process's address space by minor page faults)
You read from the file, causing minor and major faults until you fill main RAM
Once you fill main RAM, faulting in a new page typically leads to one of the older pages being dropped (you're not using all the pages as much as the OS and other processes are using theirs, so the low activity pages, especially ones that can be dropped for free rather than written to the page/swap file, are ideal pages to discard), so your memory usage steadies (for every page read in, you drop another)
When you munmap, the accounting against your process is dropped. Many of the pages are likely still in the kernel cache, but unless they're remapped and accessed again soon, they're likely first on the chopping block to discard if something else requests memory
And as commenters noted, shared memory mapped file accounting gets weird; every process is "charged" for the memory, but they'll all report it as shared even if no other processes map it, so it's not practical to distinguish "shared because it's MAP_SHARED and backed by kernel cache, but no one else has it mapped so it's effectively uniquely owned by this process" from "shared because N processes are mapping the same data, reporting shared_amount * N usage cumulatively, but actually only consuming shared_amount memory total (plus a trivial amount to maintain the per-process page tables for each mapping). There's no reason to be worried if the tallies don't line up.

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

What happens in paginated (virtual memory) systems when a process is started up?

I'm studying through Tanenbaum's "Modern Operating Systems" book and just read the following paragraph in the book:
When a process is started up, all of its page table entries are marked as not in memory. As soon as any page is referenced, a page fault will occur. The operating system then sets the R bit (in its internal tables), changes the page tables entry to point to the correct page, with mode READ ONLY, and restarts the instruction. If the page is subsequently modified, another page fault will occur, allowing the operating system to set the M bit and change the page's mode to READ/WRITE.
It seems to be extremely inneficient for me. He suggests that when a process is started up a lot of page faults must occur and the real memory is being filled up as the instructions are being executed.
It appears more logical to me that at least the text of the process is put in memory at the beginning, instead of it being put at every instruction execution (with a page fault per instruction execution).
Could someone explain me what is the advantage of this method that the book explains?
Tanenbaum describese two techniques in this paragraph:
When a process is started up, all of its page table entries are marked as not in memory. As > soon as any page is referenced, a page fault will occur. The operating system then sets the > R bit (in its internal tables), changes the page tables entry to point to the correct page, > with mode READ ONLY, and restarts the instruction.
This technique is also called demand-paging (the pages are loaded from disk to memory on-demand, if a page-fault occurs). I can think of at least two reasons why you would want to do this:
Memory consumption: Only the pages that are really needed are loaded from disk into main memory, there might be parts of your program you never execute or parts in your data section you never write to during the execution. In that case, these parts are never loaded in the first place, which means you have more RAM available for other processes. Nowadays, with huge amounts of memory you can of course debate if this is still a valid argument.
Speed: Loading from disk is slow and was much slower a decade ago. Doing the pagetable setup on-demand in a lazy fashion allows to defer the block fetching from the disk. Loading everything at once might delay the execution of your program. Again, disks are now a lot faster and SSDs make this argument even more void. On the other hand, because of dynamic libraries, binaries are not that big and usually require only a few page-faults until they are loaded in RAM.
If the page is subsequently modified, another page fault will occur, allowing the
operating system to set the M bit and change the page's mode to READ/WRITE.
Again, the reason for this is memory consumption. In the old days, where memory was scarce, swapping (moving pages back to disk again if the memory became full) was the solution to provide you with an illusion of a much larger working set of pages. If a page was already swapped out before and never modified inbetween you could just get rid of the page by removing the present bit in the pagetable, thus freeing up the memory the page previosuly occupied to load another frame. The modified bit helps you to detect if you need to write a new version of the page back out to disk, or if you can actually leave the old version as is and swap it back in again once it is needed.
The method you mention where you setup a process with all page table entries prepopulated (also known as pre-paging) is perfectly valid. You are trading memory consumption for speed. The page-table walk and also setting the modified bit is implemented in hardware (on x86) which means it performs not that bad. However, pre-population saves you from executing the page-fault handler, which altough usually heavily optimized, is implemented in software.

Resources