When and how is mmap'ed memory swapped in and out? - linux

In my understanding, mmap'ing a file that fits into RAM will be like having the file in memory.
Say that we have 16G of RAM, and we first mmap a 10G file that we use for a while. This should be fairly efficient in terms of access. If we then mmap a second 10G file, will that cause the first one be swapped out? Or parts of it? If so, when will this happen? At the mmap call, or on accessing the memory area of the newly loaded file?
And if we want to access the memory of the pointer for the first file again, will that make it load the swap the file in again? So, say we alternate reading between memory corresponding to the first file and the second file, will that lead to disastrous performance?
Lastly, if any of this is true, would it be better to mmap several smaller files?

As has been discussed, your file will be accessed in pages; on x86_64 (and IA32) architectures, a page is typically 4096 bytes. So, very little if any of the file will be loaded at mmap time. The first time you access some page in either file, then the kernel will generate a page fault and load some of your file. The kernel may prefetch pages, so more than one page may be loaded. Whether it does this depends on your access pattern.
In general, your performance should be good if your working set fits in memory. That is, if you're only regularly accesning 3G of file across the two files, so long as you have 3G of RAM available to your process, things should generally be fine.
On a 64-bit system there's no reason to split the files, and you'll be fine if the parts you need tend to fit in RAM.
Note that if you mmap an existing file, swap space will not be required to read that file. When an object is backed by a file on the filesystem, the kernel can read from that file rather than swap space. However, if you specify MMAP_PRIVATE in your call to mmap, swap space may be required to hold changed pages until you call msync.

Your question does not have a definitive answer, as swapping in/out is handled by your kernel, and each kernel will have a different implementation (and linux itself offers different profiles depending on your usage, RT, desktop, server…)
Generally speaking, though, whatever you load in memory is done using pages, so your mmap'ed file in memory is loaded (and offloaded) by pages between all the levels of memory (the caches, RAM and swap).
Then if you load two 10GB data into memory, you'll have parts of both between the RAM and your Swap, and the kernel will try to keep in RAM the pages you're likely to use now and guess what you'll load next.
What it means is that if you do truly random access to a few bytes of data in both files alternatively, you should expect awful performance, if you access contiguous chunks sequentially from both files alternatively, you should expect decent performance.
You can read some more details about kernel paging into virtual memory theory:
https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html
https://en.wikipedia.org/wiki/Paging

Related

Reserve physical memory for memory mapped files in Linux

I am using Lucene's MMapDirectory to keep index files in memory mapped files. According to the output of pmap command, only part of these files actually reside in the physical memory. I would like to put more RAM to the server and make sure that the all files are completely in the memory. However, the extra RAM will be proportionally distributed. In a Linux system, can I reserve some physical memory specifically for these files (which are about 5GB in total).
Thanks in advance for your answers.
I'm afraid there aren't any decent out-of-the-box solutions available at the moment.
You could use RAMDirectory which is closest to what you are looking for, but it is not efficient (too much ram gets allocated, GC slowness, etc.). There's LUCENE-3659 to improve this, but it's not ready yet.
You could look at ramfs/tmpfs, but both of them are volatile (index goes away after switching the machine off), also you could end up in situations where RAM gets cached in another RAM.
More info:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

does linux load program-pages on demand?

I wrote a program daemon that starts some sort of self-healing procedure when the hard drive controller of a computer crashes. This program already works fine, but I have concerns that the program (about 18KB compiled file size) may not be fully loaded into RAM by the operating system and that - when I'm really unlucky - some program pages have to be loaded from the disk exactly when the program has to come active and disk accesses are no longer possible.
After all, most of the time the program stays in an endless loop checking if everything is okay and 95% of the program code isn't used. So, I think, the Kernel may optimize the RAM usage by removing unused program pages from RAM.
So, my question: does Linux load and keep all program code pages into memory, making it unnecessary to access the hard disk again to run the program code itself, once the program has started?
Technical details: Linux Kernel 2.6.36+, about 1 GB of RAM, Debian 5, no swap space active
I already learned that I can prevent swapping by calling mlockall(MCL_CURRENT | MCL_FUTURE);, but wondering if I really need to update my machines.
No, the program code pages are memory mapped into the address space of the process, not so differently than any other mmap(), so that if you don't access these pages in a long time, they can eventually be removed from RAM. To avoid it, just use the mlockall() call.
From mlockall manual
mlockall() locks all pages mapped into the address space of the calling process.
This includes the pages of the code, data and stack segment, as well as shared
libraries, user space kernel data, shared memory, and memory-mapped files. All
mapped pages are guaranteed to be resident in RAM when the call returns success‐
fully; the pages are guaranteed to stay in RAM until later unlocked.
So, if locked, pages will be here. However, modifying mounted hard disk partition is always great risk, regardless of any kind of locks.

Why does MongoDB's memory mapped files cause programs like top to show larger numbers than normal?

I am trying to wrap my head around the internals of mongodb, and I keep reading about this
http://www.theroadtosiliconvalley.com/technology/mongodb-mongo-nosql-db/
Why does this happen?
So the way memorry mapped files work is that the addresses in memory are mapped byte for byte with a file on disk. This makes it really fast and but really large. Imagine a file on disk for your data taking up that size of memory.
Why it's awesome
In practice, this rocks because writing and reading from memory directly instead of issuing a system call (think context switch) is fast. Also, in practice, the fact that this huge memory mapped chunk doesn't fit in your physical ram is fine. Why? You only need the working set of data to fit in ram because the non-used pages are not loaded and just kept on disk. If they are needed a page fault happens and it gets loaded up. (I believe the portion that has been loaded is referred to as resident memory)
Why it it kind of sucks
Files mapped in memory needs to be page aligned so if you don't use up the memory space on the page boundary exactly you waste space (small tradoff)
Summary (tldnr)
It may look like its taking up a lot of resources because its mapping the entirety of your data to memory addresses but it doesn't really matter as that data isn't actually all being held in RAM. Mongo will pull in data as it needs it and use memory effectively to maintain a performant working set.

How can I allocate memory in Linux that meets paging and cacheability requirements?

I want to allocate space for a large array that will be write-only until the very end of the program. For that reason, I don't care if it's it cached.
I also want to access this very frequently, so I don't want to have to do a page walk more than once. For that reason I want it to be allocated in a large a page (e.g. 4M).
So how can I...
...request the memory to be either uncacheable or write-through?
...request the memory to be placed in a large page?
I am working in Linux.
Disabling caching sounds like it would make your writes slower if it forces a write all the way through to the RAM. I'm not sure I'd attempt that at all.
To actually use large pages, I suggest following HugeTLB - Large Page Support in the Linux Kernel. It contains an example of how you can use large pages via a shared memory segment.
With transparent hugepages, simply allocating a 4M-aligned buffer will work. Use aligned_alloc or posix_memalign to get a pointer you can free. (Note that aligned_alloc is required to fail if the buffer size isn't a multiple of the alignment. /facepalm).
Depending on your setting for /sys/kernel/mm/transparent_hugepage/defrag, you may need to use madvise(MADV_HUGEPAGE) on the buffer to strongly encourage the kernel to use hugepages.
Also note that x86-64 uses 2M hugepages. x86-32 uses 4M hugepages. Aligning to 4M is fine if you want the easy solution for both.
request the memory to be either uncacheable or write-through?
AFAIK, you can't easily do that through normal Linux APIs. NT stores work to normal write-back memory, so use that instead. (They over-ride the memory type and are weakly-ordered cache-bypassing).
But if you're not writing full cache-lines at a time, you definitely want cached writes. Especially if there's any spatial or temporal locality, but even if not then letting the store buffer do its job (hiding the latency of cache-miss stores) is a good thing.

Resources