mmap options in malloc - linux

What is the effect of the MAP_ANONYMOUS|MAP_SHARED options in the mmap? I see that the malloc uses the MAP_ANONYMOUS|MAP_PRIVATE options for doing mmap for larger memory allocations.
I'm observing that with the MAP_ANONYMOUS|MAP_PRIVATE, the unmapped memory region is still with the process ( observed through the pmap) whereas with the MAP_ANONYMOUS|MAP_SHARED, the unmapped is released back immediately.

When using MAP_ANONYMOUS, MAP_PRIVATE versus MAP_SHARED only makes a difference if the process forks a child that also uses the mapped memory block.
If you use MAP_PRIVATE, the mapped memory is marked copy-on-write, so changes made by one of the processs will not be seen by the other process.
If you use MAP_SHARED, the mapped memory is shared by both processs, so they can see each other's changes.
malloc() uses MAP_PRIVATE so that the parent and child can continue to use the mapped memory for their heaps without needing to synchronize updates. It behaves just like the data segment that's used for the normal heap.

Related

Does kmalloc() reserve Copy-On-Write (COW) mappings?

My understanding is that kmalloc() allocates from anonymous memory. Does this actually reserve the physical memory immediately or is that going to happen only when a write page fault happens?
kmalloc() does not usually allocate memory pages(1), it's more complicated than that. kmalloc() is used to request memory for small objects (smaller than the page size), and manages those requests using already existing memory pages, similarly to what libc's malloc() does to manage the heap in userspace programs.
There are different allocators that can be used in the Linux kernel: SLAB, SLOB and SLUB. Those use different approaches for the same goal: managing allocation and deallocation of kernel memory for small objects at runtime. A call to kmalloc() may use any of the three depending on which one was configured into the kernel.
A call to kmalloc() does not usually reserve memory at all, but rather just manages already reserved memory. Therefore memory returned by kmalloc() does not require a subsequent page fault like a page requested through mmap() normally would.
Copy on Write (CoW) is a different concept. Although it's still triggered through page faults, CoW is a mechanism used by the kernel to save space sharing existing mappings until they are modified. It is not what happens when a fault is triggered on a newly allocated memory page. A great example of CoW is what happens when the fork syscall is invoked: the process memory of the child process is not immediately duplicated, instead existing pages are marked as CoW, and the duplication only happens at the first attempt to write.
I believe this should clear any doubt you have. The short answer is that kmalloc() does not usually "reserve the physical memory immediately" because it merely allocates an object from already reserved memory.
(1) Unless the request is very large, in which case it falls back to actually allocating new memory through alloc_pages().

How can I maximize mmap performance?

I'm using mmap to read/write a file that I'm using in a database-like way. The file is much larger than available RAM. My use case is single-process, multi-threaded. How can I maximize the performance of accessing the mmap'd memory?
I'm assuming I should use MAP_PRIVATE rather than MAP_SHARED to take advantage of copy-on-write.
Is there any performance advantage to using MAP_POPULATE and/or MAP_NONBLOCK?
Are there any other performance-related things I should consider when using mmap?
mmap manipulates process' virtual address space and the PTEs in the CPU and RAM, and that is not a cheap operation.
Linus Torvalds replied a number of times on drawbacks of mmap:
https://yarchive.net/comp/mmap.html#6
https://yarchive.net/comp/linux/zero-copy.html#9
One way to minimize mmap cost is keep mapping files (or parts of them) into the same virtual address space range, so that no PTE manipulation is necessary.
mmap without MAP_POPULATE reserves the virtual address space in the process, but doesn't back it with hardware memory pages, so that the thread raises a page fault hardware interrupt when accessing that page for the 1st time, and the kernel handles that interrupt by mapping the actual hardware memory page. MAP_POPULATE allows you to avoid those page-faults, but it may take longer to return from mmap.
MAP_LOCKED makes sure that the page doesn't get swapped out.
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags. If suitable for your application, huge pages minimize the number of TLB misses.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
MAP_PRIVATE vs MAP_SHARED only matters if you'd like to modify the mapped pages. MAP_PRIVATE doesn't propagate your modification to the file or other processes' mappings of that file.

How does a MAP_PRIVATE handle changes to the underlying file after mmap() is called

Let's say I have three separate processes (PV, PS, and PH), each of them call mmap() on the entirety of a file. PV calls mmap() with the MAP_PRIVATE flag while PS and PH call mmap() with the MAP_SHARED flag. I understand that changes made by PV to the memory block will NOT be propagated to PS nor PH. I also understand that changes made by PS will be propagated to PH, but will the changes be propagated to PV? Or is PV insulated from changes made by processes using MAP_SHARED?
Secondly, assuming none of them have written to the mmap() memory, will they all be using the same physical memory? Or do MAP_SHARED and MAP_PRIVATE lead to separate physical memory allocation?
From the man page for mmap():
It is unspecified whether changes made to the file after the
mmap() call are visible in the mapped region.
tldr = Booooo!

libc memory management

How does libc communicate with the OS (e.g., a Linux kernel) to manage memory? Specifically, how does it allocate memory, and how does it release memory? Also, in what cases can it fail to allocate and deallocate, respectively?
That is very general question, but I want to speak to the failure to allocate. It's important to realize that memory is actually allocated by kernel upon first access. What you are doing when calling malloc/calloc/realloc is reserving some addresses inside the virtual address space of a process (via syscalls brk, mmap, etc. libc does that).
When I get malloc or similar to fail (or when libc get brk or mmap to fail), it's usually because I exhausted the virtual address space of a process. This happens when there is no continuous block of free address, an no room to expand an existing one. You can either exhaust all space available or hit a limit RLIMIT_AS. It's pretty common especially on 32bit systems when using multiple threads, because people sometimes forget that each thread needs it's own stack. Stacks usually consume several megabytes, which means you can create only few hundreds threads before you have no more free address space. Maybe an even more common reason for exhausted address space are memory leaks. Libc of course tries to reuse space on the heap (space obtained by a brk syscall) and tries to munmmap unneeded mappings. However, it can't reuse something that is not "deallocated".
The shortage of physical memory is not detectable from within a process (or libc which is part of the process) by failure to allocate. Yeah, you can hit "overcommitting limit", but that doesn't mean the physical memory is all taken. When free physical memory is low, kernel invokes special task called OOM killer (Out Of Memory Killer) which terminates some processes in order to free memory.
Regarding failure to deallocate, my guess is it doesn't happen unless you do something silly. I can imagine setting program break (end of heap) below it's original position (by a brk syscall). That is, of course, recipe for a disaster. Hopefully libc won't do that and it doesn't make much sense either. But it can be seen as failed deallocation. munmap can also fail if you supply some silly argument, but I can't think of regular reason for it to fail. That doesn't mean it doesn't exists. We would have to dig deep within source code of glibc/kernel to find out.
1) how does it allocate memory
libc provides malloc() to C programs.
Normally, malloc allocates memory from the heap, and adjusts the
size of the heap as required, using sbrk(2). When allocating blocks of
memory larger than MMAP_THRESHOLD bytes, the glibc malloc()
implementation allocates the memory as a private anonymous mapping
using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable
using mallopt(3). Allocations performed using mmap(2) are unaffected
by the RLIMIT_DATA resource limit (see getrlimit(2)).
And this is about sbrk.
sbrk - change data segment size
2) in what cases can it fail to allocate
Also from malloc
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
And from proc
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
Mostly it uses the sbrk system call to adjust the size of the data segment, thereby reserving more memory for it to parcel out. Memory allocated in that way is generally not released back to the operating system because it is only possible to do it when the blocks available to be released are at the end of the data segment.
Larger blocks are sometime done by using mmap to allocate memory, and that memory can be released again with an munmap call.
How does libc communicate with the OS (e.g., a Linux kernel) to manage memory?
Through system calls - this is a low-level API that the kernel provides.
Specifically, how does it allocate memory, and how does it release memory?
Unix-like systems provide the "sbrk" syscall.
Also, in what cases can it fail to allocate and deallocate, respectively?
Allocation can fail, for example, when there's no enough available memory. Deallocation shall not fail.

Is heap allocated on memory pages?

In Linux x86-64 environment, is the entire process allocated on virtual memory pages? By entire process i mean the text, data, bss, heap and stack?
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Lastly, can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
In Linux x86-64 environment, is the entire process allocated on virtual memory pages?
Yes, all processes have a virtual address space, i.e. have their own page table and virtual memory to physical memory mapping pattern.
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Yes, in fact, if you aren't hacking the OS kernel, virtual memory is transparent to you.
can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
No, you can't manage physical memory per my knowledge unless you run your program without support from OS. Because process has its own virtual space, all your action related to memory management is on virtual memory.
A process has one or more tasks (scheduled by the kernel) which for a multi-threaded process are the processes' threads (and for a non-threaded process the task running the process), and it has an address space (and some other resources, e.g. opened file descriptors).
Of course, the address space is in virtual memory. The kernel is allowed to swap pages (to e.g. the swap zone of your disk). It tries hard to avoid doing that (swapping pages to disk is very slow, because the disk access time is in dozens of milliseconds, while the RAM access time is in tenth of microsecond).
text & bss etc are virtual memory segments, which are memory mappings. You can think of a process space as a memory map. The mmap(2) system call is the way to modify it. When an executable is started with execve system call, the kernel establish a few mappings (e.g for text, data, bss, stack, ...). The sbrk(2) system call also change it. Most malloc implementations use mmap (at least for big enough zones) and sometimes sbrk.
You can avoid that a memory range is swapped out by locking it into RAM using the mlock(2) syscall, which usually requires root privilege. It is rarely useful in practice (unless you code real-time applications). There is also the msync syscall (to flush memory to disk), you can of course map a portion of file into virtual memory (using mmap), you can change the protection with mprotect(2), remove map with munmap(2), extend a mapping with mremap -a Linux specific syscall-, and you could even catch the SIGSEGV signal and handle it (often in a machine specific way). The madvise(2) syscall enables you to tune paging with hints.
You can understand the memory map of a process of pid 1234 by reading the /proc/1234/maps file (or also /proc/1234/smaps). (From inside an application, you can use /proc/self/ instead of /proc/1234/ ...) I suggest you to run in a terminal:
cat /proc/self/maps
which will show you the memory map of the process running that cat command. You can also use the pmap utility.
Most recent linux kernels provide Adress Space Layout Randomization (so two similar processes running the same program on the same input have different mmap-ed & malloc-ed addresses). You could disable it thru /proc/sys/kernel/randomize_va_space
Except in very rare circumstances (uClinux), processes only see virtual memory, which is mapped to physical memory by the kernel.
The kernel can be asked to make specific mappings that give a predictable physical address for a given virtual address; you need the appropriate capability to do that however, as this breaks down the process separation.
On execve, the current mappings are replaced by the loadable segments from the ELF file specified; these are mapped so that referenced pages are loaded from the ELF file (some initial readahead is also performed). The brk system call mainly extends the non-executable mapping with the highest addresses (excluding the stack mapping) by a few pages, allowing the process to access more virtual addresses without being sent a SIGSEGV.
The heap is generally managed by the process internally, but the virtual address space assigned to heap objects must be known to the virtual memory manager beforehand in order to create a mapping. malloc will generally look into its internal tables for a region that is already mapped and usable, and if none can be found, use either brk() or mmap() to create more mappings.

Resources