I'm using mmap to read/write a file that I'm using in a database-like way. The file is much larger than available RAM. My use case is single-process, multi-threaded. How can I maximize the performance of accessing the mmap'd memory?
I'm assuming I should use MAP_PRIVATE rather than MAP_SHARED to take advantage of copy-on-write.
Is there any performance advantage to using MAP_POPULATE and/or MAP_NONBLOCK?
Are there any other performance-related things I should consider when using mmap?
mmap manipulates process' virtual address space and the PTEs in the CPU and RAM, and that is not a cheap operation.
Linus Torvalds replied a number of times on drawbacks of mmap:
https://yarchive.net/comp/mmap.html#6
https://yarchive.net/comp/linux/zero-copy.html#9
One way to minimize mmap cost is keep mapping files (or parts of them) into the same virtual address space range, so that no PTE manipulation is necessary.
mmap without MAP_POPULATE reserves the virtual address space in the process, but doesn't back it with hardware memory pages, so that the thread raises a page fault hardware interrupt when accessing that page for the 1st time, and the kernel handles that interrupt by mapping the actual hardware memory page. MAP_POPULATE allows you to avoid those page-faults, but it may take longer to return from mmap.
MAP_LOCKED makes sure that the page doesn't get swapped out.
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags. If suitable for your application, huge pages minimize the number of TLB misses.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
MAP_PRIVATE vs MAP_SHARED only matters if you'd like to modify the mapped pages. MAP_PRIVATE doesn't propagate your modification to the file or other processes' mappings of that file.
Related
I am using mmap to open /dev/mem for read/write into UART registers.
It works well but my question is :
After a write, is the msync system call with MS_SYNC flag really needed ?
From my understanding, /dev/mem is a virtual device than provide access to physical memory zones (UART registers in my case ) by translating virtual memory address and so give access to some physical memory from user space.
This is not a common file and i guess that modifications of registers are not buffered/cached. I actually would like to avoid this system call for performance reasons.
Thanks
My understanding is that msync() is needed to update the data in a normal file that is modified through a mapping created with mmap().
But when you use mmap on /dev/mem you are not mapping a normal file on disk, you are just mapping the desired hardware memory range directly into your process virtual address space, so msync() is off topic, it will do nothing.
The only thing that lies between your writing into your mmapped virtual space and the hardware device is the CPU cache. To force that you can force a cache flush (__clear_cache() maybe?), but that is usually unnecessary because the kernel identifies the memory mapped device register and disables the cache for that range. In X86 CPUs that is usually done with MTRR, but with ARM I don't know the details...
I allocated some memory using memalign, and I set the last page as a guard page using mprotec(adde, size, PROT_NONE), so this page is inaccessible.
Does the inaccessible page consume physical memory? In my opinion, the kernel can offline the physical pages safely, right?
I also tried madvise(MADV_SOFT_OFFLINE) to manually offline the physical memory but the function always fails.
Can anybody tell me the internal behavior of kernel with mprotect(PROT_NONE), and how to offline the physical memory to save physical memory consumption?
Linux applications are using virtual memory. Only the kernel is managing physical RAM. Application code don't see the physical RAM.
A segment protected with mprotect & PROT_NONE won't consume any RAM.
You should allocate your segment with mmap(2) (maybe you want MAP_NORESERVE). Mixing memalign with mprotect may probably break libc invariants.
Read carefully madvise(2) man page. MADV_SOFT_OFFLINE may require a specially configured kernel.
I did come across through the LDD book that using the kmalloc we can allocate from high memory. I have one basic question here.
1)But to my knowledge we can't access the high memory directly from the kernel (unless it is mapped to the kernel space through the kmap()). And i didn't see any mapping area reserved for kmalloc(), But for vmalloc() it is present.So, to which part of the kernel address does the kmalloc() will map if allocated from high memory?
This is on x86 architecture,32bit system.
My knowledge may be out of date but the stack is something like this:
kmalloc allocates physically contiguous memory by calling get_free_pages (this is what the acronym GFP stands for). The GFP_* flags passed to kmalloc end up in get_free_pages, which is the page allocator.
Since special handling is required for highmem pages, you won't get them unless you add the GFP_HIGHMEM flag to the request.
All memory in Linux is virtual (a generalization that is not exactly true and that is architecture-dependent, but let's go with it, until the next parenthesized statement in this paragraph). There is a range of memory, however, that is not subject to the virtualization in the sense of remapping of pages: it is just a linear mapping between virtual addresses and physical addresses. The memory allocated by get_free_pages is linearly mapped, except for high memory. (On some architectures, linear mappings are supported for memory ranges without the use of a MMU: it's just a simple arithmetic translation of logical addresses to physical: add a displacement. On other architectures, linear mappings are done with the MMU.)
Anyway, if you call get_free_pages (directly, or via kmalloc) to allocate two or more pages, it has to find physically contiguous ones.
Now virtual memory is also implemented on top of get_free_pages, because we can take a page allocated that way, and install it into a virtual address space.
This is how mmap works and everything else for user space. When a piece of virtual memory is committed (becomes backed by a physical page, on a page fault or whatever), a page comes from get_free_pages. Unless that page is highmem, it has a linear mapping so that it is visible in the kernel. Additionally, it is wired into the virtual address space for which the request is being made. Some kernel data structures keep track of this, and of course it is punched into the page tables so the MMU makes it happen.
vmalloc is similar in principle to mmap, but far simpler because it doesn't deal with multiple backends (devices in the filesystem with a mmap virtual function) and doesn't deal with issues like coalescing and splitting of mappings that mmap allows. The vmalloc area consists of a reserved range of virtual addresses visible only to the kernel (whose base address and is architecture dependent and can be tweaked by you at kernel compile time). The vmalloc allocator carves out this virtual space, and populates it with pages from get_free_pages. These need not be contiguous, and so can be obtained one at a time, and wired into the allocated virtual space.
Highmem pages are physical memory that is not addressable in the kernel's linear map representing physical memory. Highmem exists because the kernel's linear "window" on physical memory isn't always large enough to cover all of memory. (E.g. suppose you have a 1GB window, but 4GB of RAM.) So, for coverage of all of memory, there is in addition to the linear map, some smaller "non linear" map where pages they are selectively made visible, on a temporary basis using kmap and kunmap. Placement of a page into this view is considered the acquisition of a precious resource that must be used sparingly and released as soon as possible.
A highmem page can be installed into a virtual memory map just like any other page, and no special "highmem" handling is needed for that view of the page. Any map: that of a process, or the vmalloc range.
If you're dealing with some virtual memory that could be a mixture of highmem and non-highmem pages, which you have to view through the kernel's linear space, you have to be prepared to use the mapping functions.
I understand from mmap() internals that a mmap read works by
- causing a page fault
- copying file data from disk to internal kernel buffer
- mapping the kernel buffer to user space
My questions are:
What happens to the kernel mapping to the buffer? if it still exists, dont we have a problem here of user application gaining access to kernel memory?
cant we run out of physical memory this way? I'd assume the kernel needs a minimum amount of physical memory to provide decent level of performance, and if we keep allocating it's buffers to mmapped user space buffer we'd eventually run out of buffers.
during a write, does the relevant memory gets mapped temporarily to a kernel buffer? if and this is a shared maping, another user process may access and again gain access to what is now kernel memory
Thanks, and sorry if these questions are pretty basic, but I did not find a clear answer.
I'm not a kernel hacker by any means, but this is what I've gathered:
I'm not entirely sure when it comes to the question of whether the kernel "relinquishes" its mapping to physical memory, since the kernel can access any physical memory it pleases. However, it would obviously be impermissible for the kernel to keep using that physical memory for its own purposes (e.g. as an internal pipe buffer) if user processes can access that memory as well, for the sake of both the user process and for the sake of the kernel. The kernel will simply designate those pages as part of the filesystem cache (if backed by a file) and not mess with them.
Yes, to the same extent that any process or number of processes can limit the amount of physical memory present for the kernel by requesting lots of resources like pipes. However, the kernel keeps track of how much physical memory is available and will start to page out userland memory to the disk when the remaining amount of physical memory runs low. Kernel memory itself typically should not be paged out to the disk for reasons including performance. Though the nice thing about mmap()ed memory backed by a file is that it's trivial to page out to the disk; no swap space needs to be allocated.
If you mean a write to available memory mapped to userland virtual address space (i.e. memcpy(), not write()), no. The whole point of mmap() is to map userland virtual address space to physical memory to allow reads and writes without resorting to system calls. Syncs to the disk will be performed directly by the kernel without additional copying to kernel buffers.
In Linux x86-64 environment, is the entire process allocated on virtual memory pages? By entire process i mean the text, data, bss, heap and stack?
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Lastly, can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
In Linux x86-64 environment, is the entire process allocated on virtual memory pages?
Yes, all processes have a virtual address space, i.e. have their own page table and virtual memory to physical memory mapping pattern.
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Yes, in fact, if you aren't hacking the OS kernel, virtual memory is transparent to you.
can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
No, you can't manage physical memory per my knowledge unless you run your program without support from OS. Because process has its own virtual space, all your action related to memory management is on virtual memory.
A process has one or more tasks (scheduled by the kernel) which for a multi-threaded process are the processes' threads (and for a non-threaded process the task running the process), and it has an address space (and some other resources, e.g. opened file descriptors).
Of course, the address space is in virtual memory. The kernel is allowed to swap pages (to e.g. the swap zone of your disk). It tries hard to avoid doing that (swapping pages to disk is very slow, because the disk access time is in dozens of milliseconds, while the RAM access time is in tenth of microsecond).
text & bss etc are virtual memory segments, which are memory mappings. You can think of a process space as a memory map. The mmap(2) system call is the way to modify it. When an executable is started with execve system call, the kernel establish a few mappings (e.g for text, data, bss, stack, ...). The sbrk(2) system call also change it. Most malloc implementations use mmap (at least for big enough zones) and sometimes sbrk.
You can avoid that a memory range is swapped out by locking it into RAM using the mlock(2) syscall, which usually requires root privilege. It is rarely useful in practice (unless you code real-time applications). There is also the msync syscall (to flush memory to disk), you can of course map a portion of file into virtual memory (using mmap), you can change the protection with mprotect(2), remove map with munmap(2), extend a mapping with mremap -a Linux specific syscall-, and you could even catch the SIGSEGV signal and handle it (often in a machine specific way). The madvise(2) syscall enables you to tune paging with hints.
You can understand the memory map of a process of pid 1234 by reading the /proc/1234/maps file (or also /proc/1234/smaps). (From inside an application, you can use /proc/self/ instead of /proc/1234/ ...) I suggest you to run in a terminal:
cat /proc/self/maps
which will show you the memory map of the process running that cat command. You can also use the pmap utility.
Most recent linux kernels provide Adress Space Layout Randomization (so two similar processes running the same program on the same input have different mmap-ed & malloc-ed addresses). You could disable it thru /proc/sys/kernel/randomize_va_space
Except in very rare circumstances (uClinux), processes only see virtual memory, which is mapped to physical memory by the kernel.
The kernel can be asked to make specific mappings that give a predictable physical address for a given virtual address; you need the appropriate capability to do that however, as this breaks down the process separation.
On execve, the current mappings are replaced by the loadable segments from the ELF file specified; these are mapped so that referenced pages are loaded from the ELF file (some initial readahead is also performed). The brk system call mainly extends the non-executable mapping with the highest addresses (excluding the stack mapping) by a few pages, allowing the process to access more virtual addresses without being sent a SIGSEGV.
The heap is generally managed by the process internally, but the virtual address space assigned to heap objects must be known to the virtual memory manager beforehand in order to create a mapping. malloc will generally look into its internal tables for a region that is already mapped and usable, and if none can be found, use either brk() or mmap() to create more mappings.