Linux MMAP internals

Linux MMAP internals - linux

I have several questions regarding the mmap implementation in Linux systems which don't seem to be very much documented:
When mapping a file to memory using mmap, how would you handle prefetching the data in such file?
I.e. what happens when you read data from the mmaped region? Is that data moved to the L1/L2 caches? Is it read directly from disk cache? Does the prefetchnta and similar ASM instructions work on mmaped zones?
What's the overhead of the actual mmap call? Is it relative to the amount of mapped data, or constant?
Hope somebody has some insight into this. Thanks in advance.

mmap is basically programmatic access to the Virtual Memory sub system.
When you have, say, 1G file, and you mmap it, you get a pointer to "the entire" file as if it were in memory.
However, at this stage nothing has happened save the actual mapping operation of reserving pages for the file in the VM. (The large the file, the longer the mapping operation, of course.)
In order to start reading data from the file, you simply access it through the pointer you were returned in the mmap call.
If you wish to "preload" parts of the file, just visit the area you'd like to preload. Make sure you visit ALL of the pages you want to load, since the VM will only load the pages you access. For example, say within your 1G file, you have a 10MB "index" area that you'd like to map in. The simplest way would be to just "walk your index", or whatever data structure you have, letting the VM page in data as necessary. Or, if you "know" that it's the "first 10MB" of the file, and that your page size for your VM is, say, 4K, then you can just cast the mmap pointer to a char pointer, and just iterate through the pages.
void load_mmap(char *mmapPtr) {
// We'll load 10MB of data from mmap
int offset = 0;
for(int offset = 0; offset < 10 * 1024 * 1024; offset += 4 * 1024) {
char *p = mmapPtr + offset;
// deref pointer to force mmap load
char c = *p;
}
}
As for L1 and L2 caches, mmap has nothing to do with that, that's all about how you access the data.
Since you're using the underlying VM system, anything that addresses data within the mmap'd block will work (ever from assembly).
If you don't change any of the mmap'd data, the VM will automatically flush out old pages as new pages are needed If you actually do change them, then the VM will write those pages back for you.

It's nothing to do with the CPU caches; it maps it into virtual address space, and if it's subsequently accessed, or locked with mlock(), then it brings it physically into memory. What CPU caches it's in or not in is nothing you really have control over (at least, not via mmap).
Normally touching the pages is necessary to cause it to be mapped in, but if you do a mlock or mlockall, that would have the same effect (these are usually privileged).
As far as the overhead is concerned, I don't really know, you'd have to measure it. My guess is that a mmap() which doesn't load pages in is more or less a constant time operation, but bringing the pages in will take longer with more pages.
Recent versions of Linux also support a flag MAP_POPULATE which instructs mmap to load the pages in immediately (presumably only if possible)

Answering Mr. Ravi Phulsundar's question:
Multiple processes can map the same file as long as the permissions are set correctly. Looking at the mmap man page just pass the MAP_SHARED flag ( if you need to map a really large file use mmap2 instead ):
mmap
MAP_SHARED
Share this mapping with all other processes that map this object.
Storing to the region is equivalent to
writing to the file. The file may not
actually be updated until msync(2) or
munmap(2) are called.

you use MAP_SHARED

Related

Why is mprotect a distinct syscall from mmap

I was working with syscalls relating to virtual memory lately. From the manual of mmap I know that it can be very powerful when MAP_FIXED flag is set, creating new mappings everywhere in the memory.
MAP_FIXED
Don't interpret addr as a hint: place the mapping at exactly
that address. addr must be suitably aligned: for most
architectures a multiple of the page size is sufficient;
however, some architectures may impose additional
restrictions. If the memory region specified by addr and len
overlaps pages of any existing mapping(s), then the overlapped
part of the existing mapping(s) will be discarded. If the
specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED
flag with care, keeping in mind that the exact layout of a
process's memory mappings is allowed to change significantly
between kernel versions, C library versions, and operating
system releases. Carefully read the discussion of this flag
in NOTES!
My question is, why there is a distinct syscall mprotect from mmap, given that mmap can do the exact same job by creating a new mapping with the same fd and offset, and set the new prot you want?
In my opinion, all operations about VM can be done ultimately with mmap and munmap, for those operations are basically just playing with the page table. Can someone tell me if this is a bad idea?

You need mprotect if you want to change the permissions on an existing region of memory, while keeping its contents intact.
mmap can't do this. If you use mmap with MAP_FIXED to create a new mapping at the same address, then the region's previous contents will be replaced by the contents of the new file you mapped, or zeros if using MAP_ANONYMOUS.
Using the same fd and offset does not solve this. If the map was originally created with MAP_ANONYMOUS (as is the case for most dynamically allocated memory) then there is no fd. Or, if the region was mapped to a file but with MAP_PRIVATE, then the contents could have been modified in your process's memory without being written back to the file. Attempting to map the file again with mmap will lose the modified data and replace it with the file's original contents.

Linux memory mapping a file with HUGETBL

I'm interested in using a memory mapped file for fast I/O, ive done this a few times before without issue.
I think I'm also hitting a page boundary overhead anytime I access memory past 4k ( I'm streaming data into this memory map ).
So I was thinking I can use huge pages to get to 2MB page sizes for efficiency here and avoid the small page size overhead and penalties.
When I try to allocate my memory mapped file with HUGETLB though, mmap fails with an invalid argument error.
So my basic question is, are hugetable pages supported with memory mapped files?
mmap call for reference, memsize is a multiple of 2M. mmapfd is a file descriptor to the file.
Note that this call works fine if i dont set MAP_HUGETLB
m_mmap = mmap( nullptr, memsize, PROT_READ | PROT_WRITE, MAP_SHARED| MAP_HUGETLB, mmapfd, 0 )

I believe MAP_HUGETLB have to be used with MAP_ANONYMOUS.
From the patch notes that added the MAP_HUGETLB flag (https://lwn.net/Articles/353828/):
"This patch set adds a flag to mmap that allows the user to request
a mapping to be backed with huge pages. This mapping will borrow
functionality from the huge page shm code to create a file on the
kernel internal mount and use it to approximate an anonymous mapping.
The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work
without both flags being preset."
All the documentation I have seen also only specifies use-examples with anonymous mappings.

Store variable anywhere (swap space, disk) but not in physical memory

I know that it's possible to force a variable to be stored in physical memory using mlock() function.
void *buffer = malloc(buf_size);
mlock(buffer, buf_size);
// If there is no error when executing these instructions,
// On First Write to buffer, the buffer will be stored in physical memory
However, what if we want to make sure that the variable will never reside in physical memory. Is it possible to do that ? If yes, how Linux allows to do this in userspace.

When something is written to disk, the disk controller reads the contents of the file via DMA. DMA is the abbreviation for direct memory access, and the term "memory" is key here. It will access memory. This is even OS independent, because that's implemented in hardware.
system("wget http://example.com/?x=2+2");
This will store the variable x with a value of 4 on my webserver, not in the RAM of your PC. Except for extreme examples like this, I cannot think of any solution.

When and how is mmap'ed memory swapped in and out?

In my understanding, mmap'ing a file that fits into RAM will be like having the file in memory.
Say that we have 16G of RAM, and we first mmap a 10G file that we use for a while. This should be fairly efficient in terms of access. If we then mmap a second 10G file, will that cause the first one be swapped out? Or parts of it? If so, when will this happen? At the mmap call, or on accessing the memory area of the newly loaded file?
And if we want to access the memory of the pointer for the first file again, will that make it load the swap the file in again? So, say we alternate reading between memory corresponding to the first file and the second file, will that lead to disastrous performance?
Lastly, if any of this is true, would it be better to mmap several smaller files?

As has been discussed, your file will be accessed in pages; on x86_64 (and IA32) architectures, a page is typically 4096 bytes. So, very little if any of the file will be loaded at mmap time. The first time you access some page in either file, then the kernel will generate a page fault and load some of your file. The kernel may prefetch pages, so more than one page may be loaded. Whether it does this depends on your access pattern.
In general, your performance should be good if your working set fits in memory. That is, if you're only regularly accesning 3G of file across the two files, so long as you have 3G of RAM available to your process, things should generally be fine.
On a 64-bit system there's no reason to split the files, and you'll be fine if the parts you need tend to fit in RAM.
Note that if you mmap an existing file, swap space will not be required to read that file. When an object is backed by a file on the filesystem, the kernel can read from that file rather than swap space. However, if you specify MMAP_PRIVATE in your call to mmap, swap space may be required to hold changed pages until you call msync.

Your question does not have a definitive answer, as swapping in/out is handled by your kernel, and each kernel will have a different implementation (and linux itself offers different profiles depending on your usage, RT, desktop, server…)
Generally speaking, though, whatever you load in memory is done using pages, so your mmap'ed file in memory is loaded (and offloaded) by pages between all the levels of memory (the caches, RAM and swap).
Then if you load two 10GB data into memory, you'll have parts of both between the RAM and your Swap, and the kernel will try to keep in RAM the pages you're likely to use now and guess what you'll load next.
What it means is that if you do truly random access to a few bytes of data in both files alternatively, you should expect awful performance, if you access contiguous chunks sequentially from both files alternatively, you should expect decent performance.
You can read some more details about kernel paging into virtual memory theory:
https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html
https://en.wikipedia.org/wiki/Paging

Dump global datas to disk in assembly code

The experiment is on Linux, x86 32-bit.
So suppose in my assembly program, I need to periodically (for instance every time after executing 100000 basic blocks) dump an array in .bss section from memory to the disk. The starting address and size of the array is fixed. The array records the executed basic block's address, the size is 16M right now.
I tried to write some native code, to memcpy from .bss section to the stack, and then write it back to disk. But it seems to me that it is very tedious and I am worried about the performance and memory consumption, say, every-time allocate a very large memory on the stack...
So here is my question, how can I dump the memory from global data sections in an efficient way? Am I clear enough?

First of all, don't write this part of your code in asm, esp. not at first. Write a C function to handle this part, and call it from asm. If you need to perf-tune the part that only runs when it's time to dump another 16MiB, you can hand-tune it then. System-level programming is all about checking error returns from system calls (or C stdio functions), and doing that in asm would be painful.
Obviously you can write anything in asm, since making system calls isn't anything special compared to C. And there's no part of any of this that's easier in asm compared to C, except for maybe throwing in an MFENCE around the locking.
Anyway, I've addressed three variations on what exactly you want to happen with your buffer:
Overwrite the same buffer in place (mmap(2) / msync(2))
Append a snapshot of the buffer to a file (with either write(2) or probably-not-working zero-copy vmsplice(2) + splice(2) idea.)
Start a new (zeroed) buffer after writing the old one. mmap(2) sequential chunks of your output file.
In-place overwrites
If you just want to overwrite the same area of disk every time, mmap(2) a file and use that as your array. (Call msync(2) periodically to force the data to disk.) The mmapped method won't guarantee a consistent state for the file, though. Writes can get flushed to disk other than on request. IDK if there's a way to avoid that with any kind of guarantee (i.e. not just choosing buffer-flush timers and so on so your pages usually don't get written except by msync(2).)
Append snapshots
The simple way to append a buffer to a file would be to simply call write(2) when you want it written. write(2) does everything you need. If your program is multi-threaded, you might need to take a lock on the data before the system call, and release the lock afterwards. I'm not sure how fast the write system call would return. It may only return after the kernel has copied your data to the page cache.
If you just need a snapshot, but all writes into the buffer are atomic transactions (i.e. the buffer is always in a consistent state, rather than pairs of values that need to be consistent with each other), then you don't need to take a lock before calling write(2). There will be a tiny amount of bias in this case (data at the end of the buffer will be from a slightly later time than data from the start, assuming the kernel copies in order).
IDK if write(2) returns slower or faster with direct IO (zero-copy, bypassing the page-cache). open(2) your file with with O_DIRECT, write(2) normally.
There has to be a copy somewhere in the process, if you want to write a snapshot of the buffer and then keep modifying it. Or else MMU copy-on-write trickery:
Zero-copy append snapshots
There is an API for doing zero-copy writes of user pages to disk files. Linux's vmsplice(2) and splice(2) in that order will let you tell the kernel to map your pages into the page cache. Without SPLICE_F_GIFT, I assume it sets them up as copy-on-write. (oops, actually the man page says without SPLICE_F_GIFT, the following splice(2) will have to copy. So IDK if there is a mechanism to get copy-on-write semantics.)
Assuming there was a way to get copy-on-write semantics for your pages, until the kernel was done writing them to disk and could release them:
Further writes might need the kernel to memcpy one or two pages before the data hit disk, but save copying the whole buffer. The soft page faults and page-table manipulation overhead might not be worth it anyway, unless your data access pattern is very spatially-localized over the short periods of time until the write hits disk and the to-be-written pages can be released. (I think an API that works this way doesn't exist, because there's no mechanism for getting the pages released right after they hit disk. Linux wants to take them over and keep them in the page cache.)
I haven't ever used vmsplice, so I might be getting some details wrong.
If there's a way to create a new copy-on-write mapping of the same memory, maybe by mmaping a new mapping of a scratch file (on a tmpfs filesystem, prob. /dev/shm), that would get you snapshots without holding the lock for long. Then you can just pass the snapshot to write(2), and unmap it ASAP before too many copy-on-write page faults happen.
New buffer for every chunk
If it's ok to start with a zeroed buffer after every write, you could mmap(2) successive chunk of the file, so the data you generate is always already in the right place.
(optional) fallocate(2) some space in your output file, to prevent fragmentation if your write pattern isn't sequential.
mmap(2) your buffer to the first 16MiB of your output file.
run normally
When you want to move on to the next 16MiB:
take a lock to prevent other threads from using the buffer
munmap(2) your buffer
mmap(2) the next 16MiB of the file to the same address, so you don't need to pass the new address around to writers. These pages will be pre-zeroed, as required by POSIX (can't have the kernel exposing memory).
release the lock
Possibly mmap(buf, 16MiB, ... MAP_FIXED, fd, new_offset) could replace the munmap / mmap pair. MAP_FIXED discards old mmapings that it overlaps. I assume this doesn't mean that modifications to the file / shared memory are discarded, but rather that the actual mapping changes, even without an munmap.

Two clarifications for Append snapshots case from Peter's answer.
1. Appending without O_DIRECT
As Peter said, if you don't use O_DIRECT, write() will return as soon data was copied to page cache. If page cache is full, it will block until some outdated page will be flushed to disk.
If you are only appending data without reading it (soon), you can benefit from periodically calling sync_file_range(2) to schedule flush for previously written pages and posix_fadvise(2) with POSIX_FADV_DONTNEED flag to remove already flushed pages from page cache. This could significantly reduce the posibility that write() would block.
2. Appending with O_DIRECT
With O_DIRECT, write() normally would block until data is sent to disk (although it's not strictly guaranteed, see here). Since this is slow, be prepared to implement you own I/O scheduling if you need non-blocking writes.
The benefits you could archive are: more predictable behaviour (you control when you will block) and probably reduced memory and CPU usage by collaboration of your application and kernel.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string