Linux memory mapping a file with HUGETBL - linux

I'm interested in using a memory mapped file for fast I/O, ive done this a few times before without issue.
I think I'm also hitting a page boundary overhead anytime I access memory past 4k ( I'm streaming data into this memory map ).
So I was thinking I can use huge pages to get to 2MB page sizes for efficiency here and avoid the small page size overhead and penalties.
When I try to allocate my memory mapped file with HUGETLB though, mmap fails with an invalid argument error.
So my basic question is, are hugetable pages supported with memory mapped files?
mmap call for reference, memsize is a multiple of 2M. mmapfd is a file descriptor to the file.
Note that this call works fine if i dont set MAP_HUGETLB
m_mmap = mmap( nullptr, memsize, PROT_READ | PROT_WRITE, MAP_SHARED| MAP_HUGETLB, mmapfd, 0 )

I believe MAP_HUGETLB have to be used with MAP_ANONYMOUS.
From the patch notes that added the MAP_HUGETLB flag (https://lwn.net/Articles/353828/):
"This patch set adds a flag to mmap that allows the user to request
a mapping to be backed with huge pages. This mapping will borrow
functionality from the huge page shm code to create a file on the
kernel internal mount and use it to approximate an anonymous mapping.
The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work
without both flags being preset."
All the documentation I have seen also only specifies use-examples with anonymous mappings.

Related

If I mmap a memory region with no access bits set, does mlockall still force it to be backed by physical memory?

By default Linux doesn't actually back the pages allocated with mmap with any physical memory until the first time they are accessed. However you can force it to do so preemptively by calling mlockall(MCL_CURRENT | MCL_FUTURE).
It is a common pattern to create guard pages by mmaping memory but not setting any of the access bits. Because these pages are not actually going to be used, it would be nice if mlockall ignored them and still didn't actually back the pages. Can I assume this? Does the Linux kernel go out of its way to make this true?

Why is mprotect a distinct syscall from mmap

I was working with syscalls relating to virtual memory lately. From the manual of mmap I know that it can be very powerful when MAP_FIXED flag is set, creating new mappings everywhere in the memory.
MAP_FIXED
Don't interpret addr as a hint: place the mapping at exactly
that address. addr must be suitably aligned: for most
architectures a multiple of the page size is sufficient;
however, some architectures may impose additional
restrictions. If the memory region specified by addr and len
overlaps pages of any existing mapping(s), then the overlapped
part of the existing mapping(s) will be discarded. If the
specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED
flag with care, keeping in mind that the exact layout of a
process's memory mappings is allowed to change significantly
between kernel versions, C library versions, and operating
system releases. Carefully read the discussion of this flag
in NOTES!
My question is, why there is a distinct syscall mprotect from mmap, given that mmap can do the exact same job by creating a new mapping with the same fd and offset, and set the new prot you want?
In my opinion, all operations about VM can be done ultimately with mmap and munmap, for those operations are basically just playing with the page table. Can someone tell me if this is a bad idea?
You need mprotect if you want to change the permissions on an existing region of memory, while keeping its contents intact.
mmap can't do this. If you use mmap with MAP_FIXED to create a new mapping at the same address, then the region's previous contents will be replaced by the contents of the new file you mapped, or zeros if using MAP_ANONYMOUS.
Using the same fd and offset does not solve this. If the map was originally created with MAP_ANONYMOUS (as is the case for most dynamically allocated memory) then there is no fd. Or, if the region was mapped to a file but with MAP_PRIVATE, then the contents could have been modified in your process's memory without being written back to the file. Attempting to map the file again with mmap will lose the modified data and replace it with the file's original contents.

Why does dereferencing pointer from mmap cause memory usage reported by top to increase?

I am calling mmap() with MAP_SHARED and PROT_READ to access a file which is about 25 GB in size. I have noticed that advancing the returned pointer has no effect to %MEM in top for the application, but once I start dereferencing the pointer at different locations, memory wildly increases and caps at 55%. That value goes back down to 0.2% once munmap is called.
I don't know if I should trust that 55% value top reports. It doesn't seem like it is actually using 8 GB of the available 16. Should I be worried?
When you first map the file, all it does is reserve address space, it doesn't necessarily read anything from the file if you don't pass MAP_POPULATE (the OS might do a little prefetch, it's not required to, and often doesn't until you begin reading/writing).
When you read from a given page of memory for the first time, this triggers a page fault. This "invalid page fault" most people think of when they hear the name, it's either:
A minor fault - The data is already loaded in the kernel, but the userspace mapping for that address to the loaded data needs to be established (fast)
A major fault - The data is not loaded at all, and the kernel needs to allocate a page for the data, populate it from the disk (slow), then perform the same mapping to userspace as in the minor fault case
The behavior you're seeing is likely due to the mapped file being too large to fit in memory alongside everything else that wants to stay resident, so:
When first mapped, the initial pages aren't already mapped to the process (some of them might be in the kernel cache, but they're not charged to the process unless they're linked to the process's address space by minor page faults)
You read from the file, causing minor and major faults until you fill main RAM
Once you fill main RAM, faulting in a new page typically leads to one of the older pages being dropped (you're not using all the pages as much as the OS and other processes are using theirs, so the low activity pages, especially ones that can be dropped for free rather than written to the page/swap file, are ideal pages to discard), so your memory usage steadies (for every page read in, you drop another)
When you munmap, the accounting against your process is dropped. Many of the pages are likely still in the kernel cache, but unless they're remapped and accessed again soon, they're likely first on the chopping block to discard if something else requests memory
And as commenters noted, shared memory mapped file accounting gets weird; every process is "charged" for the memory, but they'll all report it as shared even if no other processes map it, so it's not practical to distinguish "shared because it's MAP_SHARED and backed by kernel cache, but no one else has it mapped so it's effectively uniquely owned by this process" from "shared because N processes are mapping the same data, reporting shared_amount * N usage cumulatively, but actually only consuming shared_amount memory total (plus a trivial amount to maintain the per-process page tables for each mapping). There's no reason to be worried if the tallies don't line up.

Replacing `sbrk` with `mmap`

I've read that sbrk is a deprecated call and one should prefer mmap with MAP_ANONYMOUS flag. I need one continous (logical) memory block that can grow. However, mmap treats first parameter as a hint, so it can make gaps, which is unacceptable in my case. I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program: system functions like printf and clock_gettime begin to fail. I guess the first mmap which I call without MAP_FIXED returns page that has some mapped pages after it, which contain system data. So what is the right way to use mmap instead of sbrk?
With Linux you can use mmap with MAP_NORESERVE (and possibly PROT_NONE) to claim a large chunk of address space without actually allocating any memory. You map the largest area you could possibly want (and can get), and then remap bits of it with MAP_FIXED to actually allocate memory as needed.
I've read that sbrk is a deprecated call
Don't believe everything you read, especially if the source is not authoritative.
I need one continous (logical) memory block that can grow.
In that case, mmap is not for you, unless you are willing to declare the maximum size to which that block can grow.
I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program
With MMAP_FIXED you have to be very careful: the system will happily map over whatever (if anything) was there before, including libc data and code.

Linux MMAP internals

I have several questions regarding the mmap implementation in Linux systems which don't seem to be very much documented:
When mapping a file to memory using mmap, how would you handle prefetching the data in such file?
I.e. what happens when you read data from the mmaped region? Is that data moved to the L1/L2 caches? Is it read directly from disk cache? Does the prefetchnta and similar ASM instructions work on mmaped zones?
What's the overhead of the actual mmap call? Is it relative to the amount of mapped data, or constant?
Hope somebody has some insight into this. Thanks in advance.
mmap is basically programmatic access to the Virtual Memory sub system.
When you have, say, 1G file, and you mmap it, you get a pointer to "the entire" file as if it were in memory.
However, at this stage nothing has happened save the actual mapping operation of reserving pages for the file in the VM. (The large the file, the longer the mapping operation, of course.)
In order to start reading data from the file, you simply access it through the pointer you were returned in the mmap call.
If you wish to "preload" parts of the file, just visit the area you'd like to preload. Make sure you visit ALL of the pages you want to load, since the VM will only load the pages you access. For example, say within your 1G file, you have a 10MB "index" area that you'd like to map in. The simplest way would be to just "walk your index", or whatever data structure you have, letting the VM page in data as necessary. Or, if you "know" that it's the "first 10MB" of the file, and that your page size for your VM is, say, 4K, then you can just cast the mmap pointer to a char pointer, and just iterate through the pages.
void load_mmap(char *mmapPtr) {
// We'll load 10MB of data from mmap
int offset = 0;
for(int offset = 0; offset < 10 * 1024 * 1024; offset += 4 * 1024) {
char *p = mmapPtr + offset;
// deref pointer to force mmap load
char c = *p;
}
}
As for L1 and L2 caches, mmap has nothing to do with that, that's all about how you access the data.
Since you're using the underlying VM system, anything that addresses data within the mmap'd block will work (ever from assembly).
If you don't change any of the mmap'd data, the VM will automatically flush out old pages as new pages are needed If you actually do change them, then the VM will write those pages back for you.
It's nothing to do with the CPU caches; it maps it into virtual address space, and if it's subsequently accessed, or locked with mlock(), then it brings it physically into memory. What CPU caches it's in or not in is nothing you really have control over (at least, not via mmap).
Normally touching the pages is necessary to cause it to be mapped in, but if you do a mlock or mlockall, that would have the same effect (these are usually privileged).
As far as the overhead is concerned, I don't really know, you'd have to measure it. My guess is that a mmap() which doesn't load pages in is more or less a constant time operation, but bringing the pages in will take longer with more pages.
Recent versions of Linux also support a flag MAP_POPULATE which instructs mmap to load the pages in immediately (presumably only if possible)
Answering Mr. Ravi Phulsundar's question:
Multiple processes can map the same file as long as the permissions are set correctly. Looking at the mmap man page just pass the MAP_SHARED flag ( if you need to map a really large file use mmap2 instead ):
mmap
MAP_SHARED
Share this mapping with all other processes that map this object.
Storing to the region is equivalent to
writing to the file. The file may not
actually be updated until msync(2) or
munmap(2) are called.
you use MAP_SHARED

Resources