Virtually contiguous vs. physically contiguous memory - linux

Is virtually contiguous memory also always physically contiguous? If not, how is virtually continuous memory allocated and memory-mapped over physically non-contiguous RAM blocks? A detailed answer is appreciated.

Short answer: You need not care (unless you're a kernel/driver developer). It is all the same to you.
Longer answer: On the contrary, virtually contiguous memory is usually not physically contiguous (only in very small amounts). Except by coincidence, or shortly after the machine has just booted. That isn't necessary, however.
The only way of allocating larger amounts of physically contiguous RAM is by using large pages (since the memory within one page needs to be contiguous). It is however a useless endeavor, since there is no observable difference for your process whether memory of which you think that it is contiguous is actually contiguous, but there are disadvantages to using large pages.
Memory mapping over phyically non-contiuous RAM works in no particularly "special" way. It follows the same method which all memory management follows.
The OS divides virtual memory in "pages" and creates page table entries for your process. When you access a memory in some location, either the corresponding page does not exist at all, or it exists and corresponds to a real page in RAM, or it exists but doesn't correspond to a real page in RAM.
If the page exists in RAM, nothing happens at all1. Otherwise a fault is generated and some operating system code is run. If it turns out the page doesn't exist at all (or does not have the correct access rights), your process is killed with a segmentation fault.
Otherwise, the OS chooses an arbitrary page that isn't used (or it swaps out the one it thinks is the least important one), and loads the data from disk into that page. In the case of a memory mapping, the data comes from the mapped file, otherwise it comes from swap (and for completely new allocated memory, the zero page is copied). The OS then returns control back to your process. You never know this happened.
If you access another location in a "contiguous" (or so you think!) memory area which lies in a different page, the exact same procedure runs.
1 In reality, it is a little more complicated, since a page may exist in RAM but not exist "officially", being part of a list of pages that are to be recycled or such. But this gets too complicated.

No, it doesn't have to. Any page of the virtual memory can be mapped to an arbitrary physical page. Therefore you can have adjacent pages of your virtual memory pointing to non-adjacent physical pages. This mapping is maintained by the OS and is used by the MMU unit of CPU.


Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?
Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

Why does dereferencing pointer from mmap cause memory usage reported by top to increase?

I am calling mmap() with MAP_SHARED and PROT_READ to access a file which is about 25 GB in size. I have noticed that advancing the returned pointer has no effect to %MEM in top for the application, but once I start dereferencing the pointer at different locations, memory wildly increases and caps at 55%. That value goes back down to 0.2% once munmap is called.
I don't know if I should trust that 55% value top reports. It doesn't seem like it is actually using 8 GB of the available 16. Should I be worried?
When you first map the file, all it does is reserve address space, it doesn't necessarily read anything from the file if you don't pass MAP_POPULATE (the OS might do a little prefetch, it's not required to, and often doesn't until you begin reading/writing).
When you read from a given page of memory for the first time, this triggers a page fault. This "invalid page fault" most people think of when they hear the name, it's either:
A minor fault - The data is already loaded in the kernel, but the userspace mapping for that address to the loaded data needs to be established (fast)
A major fault - The data is not loaded at all, and the kernel needs to allocate a page for the data, populate it from the disk (slow), then perform the same mapping to userspace as in the minor fault case
The behavior you're seeing is likely due to the mapped file being too large to fit in memory alongside everything else that wants to stay resident, so:
When first mapped, the initial pages aren't already mapped to the process (some of them might be in the kernel cache, but they're not charged to the process unless they're linked to the process's address space by minor page faults)
You read from the file, causing minor and major faults until you fill main RAM
Once you fill main RAM, faulting in a new page typically leads to one of the older pages being dropped (you're not using all the pages as much as the OS and other processes are using theirs, so the low activity pages, especially ones that can be dropped for free rather than written to the page/swap file, are ideal pages to discard), so your memory usage steadies (for every page read in, you drop another)
When you munmap, the accounting against your process is dropped. Many of the pages are likely still in the kernel cache, but unless they're remapped and accessed again soon, they're likely first on the chopping block to discard if something else requests memory
And as commenters noted, shared memory mapped file accounting gets weird; every process is "charged" for the memory, but they'll all report it as shared even if no other processes map it, so it's not practical to distinguish "shared because it's MAP_SHARED and backed by kernel cache, but no one else has it mapped so it's effectively uniquely owned by this process" from "shared because N processes are mapping the same data, reporting shared_amount * N usage cumulatively, but actually only consuming shared_amount memory total (plus a trivial amount to maintain the per-process page tables for each mapping). There's no reason to be worried if the tallies don't line up.

In Linux, physical memory pages belong to the kernel data segment are swappable or not?

I'm asking because I remember that all physical pages belong to the kernel are pinned in memory and thus are unswappable, like what is said here:
However, I'm reading a research paper and feel confused as it says,
"(physical) pages frequently move between the kernel data segment and user space."
It also mentions that, in contrast, physical pages do not move between the kernel code segment and user space.
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
So, physical pages belong to the kernel data segment are swappable? unswappable?
P.S. The research paper is available here:
Please search "move between" and you will find it.
P.S. again, a virtual memory area ranging from [3G + 896M] to 4G belongs to the kernel and is used for mapping physical pages in ZONE_HIGHMEM (x86 32-bit Linux, 3G + 1G setting). In such a case, the kernel may first map some virtual pages in the area to the physical pages that host the current process's page table, modify some page table entries, and unmap the virtual pages. This way, the physical pages may sometimes belong to the kernel and sometimes belong to user space, because they do not belong to the kernel after the unmapping and thus become swappable. Is this the reason?
tl;dr - the memory pools and swapping are different concepts. You can not make any deductions from one about the other.
kmalloc() and other kernel data allocation come from slab/slub, etc. The same place that the kernel gets data for user-space. Ergo pages frequently move between the kernel data segment and user space. This is correct. It doesn't say anything about swapping. That is a separate issue and you can not deduce anything.
The kernel code is typically populated at boot and marked read-only and never changes after that. Ergo physical pages do not move between the kernel code segment and user space.
Why do you think because something comes from the same pool, it is the same? The network sockets also come from the same memory pool. It is a seperation of concern. The linux-mm (memory management system) handles swap. A page can be pinned (unswappable). The check for static kernel memory (this may include .bss and .data) is a simple range check. The memory is normally pinned and marked unswappable at the linux-mm layer. The user data (whos allocation come from the same pool) can be marked as swappable by the linux-mm. For instance, even without swap, user-space text is still swappable because it is backed by an inode. Caching is much simpler for read-only data. If data is swapped, it is marked as such in the MMU tables and a fault handler must distinguish between swap and a SIGBUS; which is part of the linux-mm.
There are also versions of Linux with no-mm (or no MMU) and these will never swap anything. In theory someone might be able to swap kernel data; but the why is it in the kernel? The Linux way would be to use a module and only load them as needed. Certainly, the linux-mm data is kernel data and hopefully, you can see a problem with swapping that.
The problem with conceptual questions like this,
It can differ with Linux versions.
It can differ with Linux configurations.
The advice can change as Linux evolves.
For certain, the linux-mm code can not be swappable, nor any interrupt handler. It is possible that at some point in time, kernel code and/or data could be swapped. I don't think that this is ever the current case outside of module loading/unloading (and it is rather pedantic/esoteric as to whether you call this swapping or not).
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
there is no connection between swappable memory and page movement between user space and kernel space. whether a page can be swapped or not depends totally on whether it is pinned or not. Pinned pages are not swapped so their mapping is considered permanent.
So, physical pages belong to the kernel data segment are swappable? unswappable?
usually pages used by kernel are pinned and so are meant not to be swappable.
However, I'm reading a research paper and feel confused as it says, "(physical) pages frequently move between the kernel data segment and user space."
Could you please give a link of this research papaer?
As far as I known, (just from UNIX lectures and labs at school) the pages for kernel space has been allocated for kernel, with a simple, fixed mapping algorithm, and they are all pinned. After kernel turn on the paging mode, (bits operation of CR0&CR3 for x86) there will be the first user mode process, and the pages which has been allocated for kernel will not be in the available set of pages for user space.

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

Where does virtual memory exist in linux?

As program is stored on flash/disk. For it execution, program is loaded into virtual memory and is mapped to RAM by virtual manager. During its execution process is in RAM. Then where does virtual memory exist (where it has all .text, .data, .stack, .heap)?
The virtual memory is a view of the RAM plus maybe some swap space provided by a virtual memory manager. Modern OSs have virtual memory managers and provide virtual memory to processes so that the executing program can behave as if it had a contiguous address space whose size is not limited by the actual RAM. The pages or blocks making up the virtual memory can be mapped anywhere in the RAM, so that contiguos virtual pages need to be stored in contiguos RAM areas. Or they can be swapped out to page space or swap space, waiting there until needed, whereupon they're read by the OS and mapped to some RAM page.
When you say
During its execution process is in RAM.
This is not entirely correct. Some or all memory pages that belong to the process may be swapped out, as explained.
One more word concerning the answers and comments that say that "virtual" means it doesn't exist. This makes no sense. On the contrary, according to Webster:
being such in essence or effect ...
Hence virtual memory is something (therefore, it exists!) that behaves as if it were memory.
Virtual memory is just like an illusion of RAM. It uses paging to acquire additional RAM that could be used by the processes in operating system.
Virtual memory means memory you can access with "normal" momory access methods, although it isn't clear where the data is actually stored.
It may be
actually in RAM
in a swap area
in another file (memory mapped file)
and access to it will be handled appropriately.
It is a layer of, well, virtualization so that you as a programmer don't have to worry about where the data is actually put.
The original purpose was mainly to be able to provide more memory to processes than we actually have and to extend it with means of swap space, but there are even more:
The OS is free to use the RAM for whatever it seems necessary, e. g. caching. Under some circumstances, it may be more effective to use RAM for cache than for holding parts of a program which hasn't been used for a long time.
Provide additional memory to a program when it requests it: if you call malloc(), the program's library may request the OS to provide a part of memory which can be attached seamlessly into the address space.
Avoid stack overflow: if the stack grows larger and larger, the respective memory section may be extended as well transparently so that the program won't have to worry about it.
A system can even do "overcommitment" of memory: if a process requests a large amount of memory, the OS may say "yes, ok", i. e. provide the memory to the program. That means in the first place "allow the program to access a certain address space area", but this address space is not immediately backed by memory. Only as soon as the program accesses this memory the mapping will be done, and if this cannot be fulfilled, the program is crashed by the Out of emory killer (at least, under Linux).
All this works by page-wise (1 page = 4 kiB) assignment of physical memory to a program, viewed via the program's address space, and this in the amount and frequency as it is needed.
