Replacing `sbrk` with `mmap`

Replacing `sbrk` with `mmap` - linux

I've read that sbrk is a deprecated call and one should prefer mmap with MAP_ANONYMOUS flag. I need one continous (logical) memory block that can grow. However, mmap treats first parameter as a hint, so it can make gaps, which is unacceptable in my case. I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program: system functions like printf and clock_gettime begin to fail. I guess the first mmap which I call without MAP_FIXED returns page that has some mapped pages after it, which contain system data. So what is the right way to use mmap instead of sbrk?

With Linux you can use mmap with MAP_NORESERVE (and possibly PROT_NONE) to claim a large chunk of address space without actually allocating any memory. You map the largest area you could possibly want (and can get), and then remap bits of it with MAP_FIXED to actually allocate memory as needed.

I've read that sbrk is a deprecated call
Don't believe everything you read, especially if the source is not authoritative.
I need one continous (logical) memory block that can grow.
In that case, mmap is not for you, unless you are willing to declare the maximum size to which that block can grow.
I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program
With MMAP_FIXED you have to be very careful: the system will happily map over whatever (if anything) was there before, including libc data and code.

Related

Why is mprotect a distinct syscall from mmap

I was working with syscalls relating to virtual memory lately. From the manual of mmap I know that it can be very powerful when MAP_FIXED flag is set, creating new mappings everywhere in the memory.
MAP_FIXED
Don't interpret addr as a hint: place the mapping at exactly
that address. addr must be suitably aligned: for most
architectures a multiple of the page size is sufficient;
however, some architectures may impose additional
restrictions. If the memory region specified by addr and len
overlaps pages of any existing mapping(s), then the overlapped
part of the existing mapping(s) will be discarded. If the
specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED
flag with care, keeping in mind that the exact layout of a
process's memory mappings is allowed to change significantly
between kernel versions, C library versions, and operating
system releases. Carefully read the discussion of this flag
in NOTES!
My question is, why there is a distinct syscall mprotect from mmap, given that mmap can do the exact same job by creating a new mapping with the same fd and offset, and set the new prot you want?
In my opinion, all operations about VM can be done ultimately with mmap and munmap, for those operations are basically just playing with the page table. Can someone tell me if this is a bad idea?

You need mprotect if you want to change the permissions on an existing region of memory, while keeping its contents intact.
mmap can't do this. If you use mmap with MAP_FIXED to create a new mapping at the same address, then the region's previous contents will be replaced by the contents of the new file you mapped, or zeros if using MAP_ANONYMOUS.
Using the same fd and offset does not solve this. If the map was originally created with MAP_ANONYMOUS (as is the case for most dynamically allocated memory) then there is no fd. Or, if the region was mapped to a file but with MAP_PRIVATE, then the contents could have been modified in your process's memory without being written back to the file. Attempting to map the file again with mmap will lose the modified data and replace it with the file's original contents.

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.

You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?

No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.

Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.

By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.

Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

Does madvise(_, _, MADV_DONTNEED) instruct the OS to lazily write to disk?

Hypothetically, suppose I want to perform sequential writing to a potentially very large file.
If I mmap() a gigantic region and madvise(MADV_SEQUENTIAL) on that entire region, then I can write to the memory in a relatively efficient manner. This I have gotten to work just fine.
Now, in order to free up various OS resources as I am writing, I occasionally perform a munmap() on small chunks of memory that have already been written to. My concern is that munmap() and msync()will block my thread, waiting for the data to be physically committed to disk. I cannot slow down my writer at all, so I need to find another way.
Would it be better to use madvise(MADV_DONTNEED) on the small, already-written chunk of memory? I want to tell the OS to write that memory to disk lazily, and not to block my calling thread.
The manpage on madvise() has this to say, which is rather ambiguous:
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the
application is finished with the given range, so the kernel can free
resources associated with it.) Subsequent accesses of pages in this
range will succeed, but will result either in re-loading of the memory
contents from the underlying mapped file (see mmap(2)) or
zero-fill-on-demand pages for mappings without an underlying file.

No!
For your own good, stay away from MADV_DONTNEED. Linux will not take this as a hint to throw pages away after writing them back, but to throw them away immediately. This is not considered a bug, but a deliberate decision.
Ironically, the reasoning is that the functionality of a non-destructive MADV_DONTNEED is already given by msync(MS_INVALIDATE|MS_ASYNC), MS_ASYNC on the other hand does not start I/O (in fact, it does nothing at all, following the reasoning that dirty page writeback works fine anyway), fsync always blocks, and sync_file_range may block if you exceed some obscure limit and is considered "extremely dangerous" by the documentation, whatever that means.
Either way, you must msync(MS_SYNC), or fsync (both blocking), or sync_file_range (possibly blocking) followed by fsync, or you will lose data with MADV_DONTNEED. If you cannot afford to possibly block, you have no choice, sadly, but to do this in another thread.

For recent Linux kernels (just tested on Linux 5.4), MADV_DONTNEED works as expected when the mapping is NOT private, e.g. mmap without MAP_PRIVATE flag. I'm not sure what's the behavior on previous versions of Linux kernel.
From release 4.15 of the Linux man-pages project's madvise manpage:
After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Linux added a new flag MADV_FREE with the same behavior in BSD systems in Linux 4.5
which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again.
For why MADV_DONTNEED for private mapping may result zero filled pages upon future access, watch Bryan Cantrill's rant as mentioned in comments of #Damon's answer. Spoiler: it comes from Tru64 UNIX.

As already mentioned, MADV_DONTNEED is not your friend. Since Linux 5.4, you can use MADV_COLD to tell the kernel it should page out that memory when there is memory pressure. This seems to be exactly what is wanted in this situation.
Read more here:
https://lwn.net/Articles/793462/

first, madv_sequential enables aggressive readahead, so you don't need it.
second, os will lazily write dirty file-baked memory to disk anyway, even if you will do nothing. but madv_dontneed will instruct it to free memory immediately (what you call "various os resources"). third, it is not clear that mmapping files for sequential writing has any advantage. you probably will be better served by just write(2) (but use buffers - either manual or stdio).

Virtual Memory and sbrk

On a 32 bit Linux system, a process can access up to 4 GB of virtual address space; however, processes seem to be conservative in varying degrees in reserving any of that. So a program that uses malloc will occasionally grow its data segment by a syscall sbrk/brk. Even those pages aren't claimed in physical memory yet. What I don't fully understand is why we need to sbrk in the first place, why not just give me 4 GB address space avoiding any sbrk call, as until we touch/claim those blocks, it is essentially a free operation right?

What happens if you memory-map a file (a very common thing to do under Linux)? It has to go somewhere in the address space, so there must be some means of defining "used" and "not used" parts.
Shared memory (which is really just mapping a file without an actual file) is the same. It has to go somewhere, and the OS must be sure it can place it without overwriting something.
Also, it is preferrable to maintain locality of reference for obvious (and less obvious) efficiency reasons. If you were allowed to just write to and read from any location in your address space, you can bet that some people would do just that.

There's a couple of reasons that come to mind:
You'd no longer get segfaults when accessing unmapped memory
The Translation lookaside buffer (TLB) would be larger, possibly requiring more time to set it up
You'd have to unmap some of that memory anyway if you load in a new shared library or mmap() something

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string