What happens in paginated (virtual memory) systems when a process is started up? - pagination

I'm studying through Tanenbaum's "Modern Operating Systems" book and just read the following paragraph in the book:
When a process is started up, all of its page table entries are marked as not in memory. As soon as any page is referenced, a page fault will occur. The operating system then sets the R bit (in its internal tables), changes the page tables entry to point to the correct page, with mode READ ONLY, and restarts the instruction. If the page is subsequently modified, another page fault will occur, allowing the operating system to set the M bit and change the page's mode to READ/WRITE.
It seems to be extremely inneficient for me. He suggests that when a process is started up a lot of page faults must occur and the real memory is being filled up as the instructions are being executed.
It appears more logical to me that at least the text of the process is put in memory at the beginning, instead of it being put at every instruction execution (with a page fault per instruction execution).
Could someone explain me what is the advantage of this method that the book explains?

Tanenbaum describese two techniques in this paragraph:
When a process is started up, all of its page table entries are marked as not in memory. As > soon as any page is referenced, a page fault will occur. The operating system then sets the > R bit (in its internal tables), changes the page tables entry to point to the correct page, > with mode READ ONLY, and restarts the instruction.
This technique is also called demand-paging (the pages are loaded from disk to memory on-demand, if a page-fault occurs). I can think of at least two reasons why you would want to do this:
Memory consumption: Only the pages that are really needed are loaded from disk into main memory, there might be parts of your program you never execute or parts in your data section you never write to during the execution. In that case, these parts are never loaded in the first place, which means you have more RAM available for other processes. Nowadays, with huge amounts of memory you can of course debate if this is still a valid argument.
Speed: Loading from disk is slow and was much slower a decade ago. Doing the pagetable setup on-demand in a lazy fashion allows to defer the block fetching from the disk. Loading everything at once might delay the execution of your program. Again, disks are now a lot faster and SSDs make this argument even more void. On the other hand, because of dynamic libraries, binaries are not that big and usually require only a few page-faults until they are loaded in RAM.
If the page is subsequently modified, another page fault will occur, allowing the
operating system to set the M bit and change the page's mode to READ/WRITE.
Again, the reason for this is memory consumption. In the old days, where memory was scarce, swapping (moving pages back to disk again if the memory became full) was the solution to provide you with an illusion of a much larger working set of pages. If a page was already swapped out before and never modified inbetween you could just get rid of the page by removing the present bit in the pagetable, thus freeing up the memory the page previosuly occupied to load another frame. The modified bit helps you to detect if you need to write a new version of the page back out to disk, or if you can actually leave the old version as is and swap it back in again once it is needed.
The method you mention where you setup a process with all page table entries prepopulated (also known as pre-paging) is perfectly valid. You are trading memory consumption for speed. The page-table walk and also setting the modified bit is implemented in hardware (on x86) which means it performs not that bad. However, pre-population saves you from executing the page-fault handler, which altough usually heavily optimized, is implemented in software.

Related

Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?
Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

Does binary stay in memory after program exits?

I know when a program first starts, it has massive page faults in the beginning since the code is not in memory, and thus need to load code from disk.
What happens when a program exits? Does the binary stay in memory? Would subsequent invocations of the program find that the code is already in memory and thus not have page faults (assuming nothing runs in between and pages stuff out to disk)?
It seems like the answer is no from running some experiments on my Linux machine. I ran some program over and over again, and observed the same number of page faults every time. It's a relatively quiet machine so I doubt stuff is getting paged out in between invocations. So, why is that? Why doesn't executable get to stay in memory?
There are two things to consider here:
1) The content of the executable file is likely kept in the OS cache (disk cache). While that data is still in the OS cache, every read for that data will hit the cache and the OS will honor the request without needing to re-read the file from disk
2) When a process exits, the OS unmaps every memory page mapped to a file, frees any memory (in general, releases every resource allocated by the process, including other resources, such as sockets, and so on). Strictly speaking, the physical memory may be zeroed, but not quite required (still, the security level of the OS may require to zero a page that is not used anymore - probably Windows NT, 2K, XP, etc, do that - see this Does Windows clear memory pages?). Another invocation of the same executable will create a brand new process which will map the same file in the memory, but the first access to those pages will still trigger page faults because, in the end, it is a new process, a different memory mapping. So yes, the page faults occur, but they are a lot cheaper for the second instance of the same executable compared to the first.
Of course, this is only about the read-only parts of the executable (the segments/modules containing the code and read-only data).
One may consider another scenario: forking. In this case, every page is marked as copy-on-write. When the first write occurs on each memory page, a hardware exception is triggered and intercepted by the OS memory manager. The OS determines if the page in question is allowed to be written (eg: if it is the stack, heap or any writable page in general) and if so, it allocates memory and copies the original content before allowing the process to modify the page - in order to preserve the original data in the other process. And yes, there is still another case - shared memory, where the exact physical memory is mapped to two or more processes. In this case, the copy-on-write flag is, of course, not set on the memory pages.
Hope this clarifies what is going on with the memory pages.
What I highly suspect is that parts, information blobs are not promptly erased from RAM unless there's a new request for more RAM from actually running code. For that part what probably happens is OS reusing OS dependent bits from RAM, on a next execution e.g. I think this is true for OS initiated resources (and probably not for all resources but some).
Actually most of your questions are highly implementation-dependant. But for most used OS:
What happens when a program exits? Does the binary stay in memory?
Yes, but the memory blocks are marked as unused (and thus could be allocated to other processes).
Would subsequent invocations of the program find that the code is
already in memory and thus not have page faults (assuming nothing runs
in between and pages stuff out to disk)?
No, those blocks are considered empty. Some/all blocks might have been overwritten already.
Why doesn't executable get to stay in memory?
Why would it stay? When a process is finished, all of its allocated resources are freed.
One of the reasons is that one generally wants to clear everything out on a subsequent invocation in case their was a problem in the previous.
Plus, the writeable data must be moved out.
That said, some systems do have mechanisms for keeping executable and static data in memory (possibly not linux). For example, the VMS operating system allows the system manager to install executables and shared libraries so that they remain in memory (paging allowed). The same system can be used to create create writeable shared memory allowing interprocess communication and for modifications to the memory to remain in memory (possibly paged out).

Allocating "temporary" memory (in Linux)

I'm trying to find any system functionality that would allow a process to allocate "temporary" memory - i.e. memory that is considered discardable by the process, and can be take away by the system when memory is needed, but allowing the process to benefit from available memory when possible. In other words, the process tells the system it's OK to sacrifice the block of memory when the process is not using it. Freeing the block is also preferable to swapping it out (it's more expensive, or as expensive, to swap it out rather then re-constitute its contents).
Systems (e.g. Linux), have those things in the kernel, like F/S memory cache. I am looking for something like this, but available to the user space.
I understand there are ways to do this from the program, but it's really more of a kernel job to deal with this. To some extent, I'm asking the kernel:
if you need to reduce my, or another process residency, take these temporary pages off first
if you are taking these temporary pages off, don't swap them out, just unmap them
Specifically, I'm interested on a solution that would work on Linux, but would be interested to learn if any exist for any other O/S.
UPDATE
An example on how I expect this to work:
map a page (over swap). No difference to what's available right now.
tell the kernel that the page is "temporary" (for the lack of a better name), meaning that if this page goes away, I don't want it paged in.
tell the kernel that I need the temporary page "back". If the page was unmapped since I marked it "temporary", I am told that happened. If it hasn't, then it starts behaving as a regular page.
Here are the problems to have that done over existing MM:
To make pages not being paged in, I have to allocate them over nothing. But then, they can get paged out at any time, without notice. Testing with mincore() doesn't guarantee that the page will still be there by the time mincore() finishes. Using mlock() requires elevated privileges.
So, the closest I can get to this is by using mlock(), and anonymous pages. Following the expectations I outlined earlier, it would be:
map an anonymous, locked page. (MAP_ANON|MAP_LOCKED|MAP_NORESERVE). Stamp the page with magic.
for making page "temporary", unlock the page
when needing the page, lock it again. If the magic is there, it's my data, otherwise it's been lost, and I need to reconstitute it.
However, I don't really need for pages to be locked in RAM when I'm using them. Also, MAP_NORESERVE is problematic if memory is overcommitted.
This is what the VmWare ESXi server aka the Virtual Machine Monitor (VMM) layer implements. This is used in the Virtual Machines and is a way to reclaim memory from the virtual machine guests. Virtual machines that have more memory allocated than they actually are using/require are made to release/free it to the VMM so that it can assign it back to the Virtual Machines guests that are in need of it.
This technique of Memory Reclamation is mentioned in this paper: http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf
On similar lines, something similar you can implement in your kernel.
I'm not sure to understand exactly your needs. Remember that processes run in virtual memory (their address space is virtual), that the kernel is dealing with virtual to physical address translation (using the MMU) and with paging. So page fault can happen at any time. The kernel will choose to page-in or page-out at arbitrary moments - and will choose which page to swap (only the kernel care about RAM, and it can page-out any physical RAM page at will). Perhaps you want the kernel to tell you when a page is genuinely discarded. How would the kernel take away temporary memory from your process without your process being notified ? The kernel could take away and later give back some RAM.... (so you want to know when the given back memory is fresh)
You might use mmap(2) with MAP_NORESERVE first, then again (on the same memory range) with MAP_FIXED|MAP_PRIVATE. See also mincore(2) and mlock(2)
You can also later use madvise(2) with MADV_WONTNEED or MADV_WILLNEED etc..
Perhaps you want to mmap some device like /dev/null, /dev/full, /dev/zero or (more likely) write your own kernel module providing a similar device.
GNU Hurd has an external pager mechanism... You cannot yet get exactly that on Linux. (Perhaps consider mmap on some FUSE mounted file).
I don't understand what you want to happen when the kernel is paging out your memory, and what you want to happen when the kernel is paging in again such a page because your process is accessing it. Do you want to get a zero-ed page, or a SIGSEGV ?

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

Why do discussions of "swappiness" act like information can only be in one place at a time?

I've been reading up on Linux's "swappiness" tuneable, which controls how aggressive the kernel is about swapping applications' memory to disk when they're not being used. If you Google the term, you get a lot of pages like this discussing the pros and cons. In a nutshell, the argument goes like this:
If your swappiness is too low, inactive applications will hog all the system memory that other programs might want to use.
If your swappiness is too high, when you wake up those inactive applications, there's going to be a big delay as their state is read back off the disk.
This argument doesn't make sense to me. If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory? This seems to give the best of both worlds: if another application needs that memory, it can immediately claim the physical RAM and start writing to it, since another copy of it is on disk and can be swapped back in when the inactive application is woken up. And when the original app wakes up, any of its pages that are still in RAM can be used as-is, without having to pull them off the disk.
Or am I missing something?
If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory?
Lets say we did it. We wrote the page to disk, but left it in memory. A while later another process needs memory, so we want to kick out the page from the first process.
We need to know with absolute certainty whether the first process has modified the page since it was written out to disk. If it has, we have to write it out again. The way we would track this is to take away the process's write permission to the page back when we first wrote it out to disk. If the process tries to write to the page again there will be a page fault. The kernel can note that the process has dirtied the page (and will therefore need to be written out again) before restoring the write permission and allowing the application to continue.
Therein lies the problem. Taking away write permission from the page is actually somewhat expensive, particularly in multiprocessor machines. It is important that all CPUs purge their cache of page translations to make sure they take away the write permission.
If the process does write to the page, taking a page fault is even more expensive. I'd presume that a non-trivial number of these pages would end up taking that fault, which eats into the gains we were looking for by leaving it in memory.
So is it worth doing? I honestly don't know. I'm just trying to explain why leaving the page in memory isn't so obvious a win as it sounds.
(*) This whole thing is very similar to a mechanism called Copy-On-Write, which is used when a process fork()s. The child process is very likely going to execute just a few instructions and call exec(), so it would be silly to copy all of the parents pages. Instead the write permission is taken away and the child simply allowed to run. Copy-On-Write is a win because the page fault is almost never taken: the child almost always calls exec() immediately.
Even if you page the apps memory to disk and keep it in memory, you would still have to decide when should an application be considered "inactive" and that's what swapiness controls. Paging to disk is expensive in terms of IO and you don't want to do it too often. There is also another variable on this equation, and that is the fact that Linux uses of remaining memory as disk buffers/cache.
According to this 1 that is exactly what Linux does.
I'm still trying to make sense of a lot of this, so any authoritative links would be appreciated.
The first thing the VM does is clean pages and move them to the clean list.
When cleaning anonymous memory (things which do not have an actual file backing store, you can see the segments in /proc//maps which are anonymous and have no filesystem vnode storage behind them), the first thing the VM is going to do is take the "dirty" pages and "clean" then by writing the contents of the page out to swap. Now when the VM has a shortage of completely free memory and is worried about its ability to grant new free pages to be used, it can go through the list of 'clean' pages and based on how recently they were used and what kind of memory they are it will move those pages to the free list.
Once the memory pages are placed on the free list, they no longer are associated with the contents they had before. If a program comes along a references the memory location the page was serving previously the program will take a major fault and a (most likely completely different) page will be grabbed from the free list and the data will be read into the page from disk. Once this is done, the page is actually still 'clean' since it has not been modified. If the VM chooses to use that page on swap for a different page in RAM then the page would be again 'dirtied', or if the app wrote to that page it would be 'dirtied'. And then the process begins again.
Also, swappinness is pretty horrible for server applications in a business/transactional/online/latency-sensitive environment. When I've got 16GB RAM boxes where I'm not running a lot of browsers and GUIs, I typically want all my apps nearly pinned in memory. The bulk of my RAM tends to be 8-10GB java heaps that I NEVER want paged to disk, ever, and the cruft that is available are processes like mingetty (but even there the glibc pages in those apps are shared by other apps and actually used, so even the RSS size of those useless processes are mostly shared, used pages). I normally don't see more than a few 10MBs of the 16GB actually cleaned to swap. I would advise very, very low swappiness numbers or zero swappiness for servers -- the unused pages should be a small fraction of the overall RAM and trying to reclaim that relatively tiny amount of RAM for buffer cache risks swapping application pages and taking latency hits in the running app.

Resources