FreeBSD ftruncate()+mmap() big hole warning - freebsd

FreeBSD 13.1 manual page mmap() has the following warning:
WARNING! Extending a file with ftruncate(2), thus creating a big hole, and then filling the hole by modifying a shared mmap() can lead to severe file fragmentation. In order to avoid such fragmentation you should always pre-allocate the file's backing store by write()ing zero's into the newly extended area prior to modifying the area via your mmap(). The fragmentation problem is especially sensitive to MAP_NOSYNC pages, because pages may be flushed to disk in a totally random order.
Questions:
What is this big hole; why ftruncate() has the effect of creating it; and why write() is a proposed solution to the problem?
How does one efficiently zero-fill the hole after ftruncate() and before mmap()? Repeatedly calling write() sounds like more system calls that maybe necessary.
Does the same problem exist on other operating systems? I find no such warning on macOS or Linux.

You can avoid this problem by using posix_fallocate to preallocate desired areas.
The hole is created because files can be sparse, taking up only the space required for the actually used areas, so when using just ftruncate, the backing for the new area is virtual, it isn't reserved on disk until you allocate it or write to it.
It applies to Linux as well, it's just not mentioned. You can expect most filesystem implementations to do similarly; often they do try to be smart, and will defragment your writes if time-wise close one to another, but can't do magic.

Related

Understanding how bootmem works

I have been studying OS concepts and decided to look in how these stuff are actually implemented in Linux. But I am having problem understanding some thing in relation with memory management during boot process before page_allocator is turned on, more precisely how bootmem works. I do not need the exact workings of it, but just an understanding how some things are/can be solved.
So obviously, bootmem cannot use dynamic memory, meaning that size he has must be known before runtime, so appropriate steps can be taken, i.e. the maximum size of his bitmap must be known in advance. From what I understand, this is most likely solved by simply mapping enough memory during kernel initialization, if architecture changes, simply change the size of the mapped memory. Obviously, there is probably a lot more going on, but I guess I got the general idea? However, what really makes no sense to me is NUMA architecture. Everywhere I read, it says that pg_data_t is created for each memory node. This pg_data is put into a list(how can it know the size of the list? Or is the size fixed for specific arch?) and for each node, bitmap is allocated. So, basically, it sounds like it can create undefined number of these pg_data, each of which has their memory bitmap of arbitrary size. How? What am I missing?
EDIT: Sorry for not including reference. Here is bootmem code, it can also be found in mm/bootmem.c: http://lxr.free-electrons.com/source/mm/bootmem.c
It is architecture-dependent. On the x86 architecture, early on in the boot process the kernel does issue one BIOS call - the 0xe820 function of the trap at Interrupt Vector 0x15. This returns a memory map that the kernel can use to build it's memory tables, including holes for non-memory (PCI or ISA) devices, etc. Bootloaders (before the kernel) will do the same.
See: Detecting Memory
After looking into this more, I think it works this way: basically, all the necessary things are statically allocated, i.e. by using preprocessor DEFINES it is ensured certain sections of bootmem (as well as other parts of the kernel) code either exist or do not exist in the compiled code for specific architecture (even though the code itself is architecture-independent). These DEFINES are specified in architecture-dependent sources codes found under arch/ (e.g. arch/i386, arch/arm/, etc.). For NUMA architectures there is a define called MAX_NUMNODES, ensuring that list of structs (more specifically, list of pg_data_t structures) representing nodes is allocated as a static array (which is then treated as a list). The bitmaps representing memory map are obviously relatively small, since each page is represented as only one bit, taking up KBs, or maybe MBs. Whatever the case, architecture dependent head.S sets-up all the necessary structures needed for system functioning (like page-tables) and ensure enough physical memory is mapped to virtual so these bitmaps can fit in it without causing page-fault (in case of x86 arch, initial 8MB of RAM is mapped, which is more then enough for both kernel and additional structures, like bitmaps).

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

Does madvise(___, ___, MADV_DONTNEED) instruct the OS to lazily write to disk?

Hypothetically, suppose I want to perform sequential writing to a potentially very large file.
If I mmap() a gigantic region and madvise(MADV_SEQUENTIAL) on that entire region, then I can write to the memory in a relatively efficient manner. This I have gotten to work just fine.
Now, in order to free up various OS resources as I am writing, I occasionally perform a munmap() on small chunks of memory that have already been written to. My concern is that munmap() and msync()will block my thread, waiting for the data to be physically committed to disk. I cannot slow down my writer at all, so I need to find another way.
Would it be better to use madvise(MADV_DONTNEED) on the small, already-written chunk of memory? I want to tell the OS to write that memory to disk lazily, and not to block my calling thread.
The manpage on madvise() has this to say, which is rather ambiguous:
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the
application is finished with the given range, so the kernel can free
resources associated with it.) Subsequent accesses of pages in this
range will succeed, but will result either in re-loading of the memory
contents from the underlying mapped file (see mmap(2)) or
zero-fill-on-demand pages for mappings without an underlying file.
No!
For your own good, stay away from MADV_DONTNEED. Linux will not take this as a hint to throw pages away after writing them back, but to throw them away immediately. This is not considered a bug, but a deliberate decision.
Ironically, the reasoning is that the functionality of a non-destructive MADV_DONTNEED is already given by msync(MS_INVALIDATE|MS_ASYNC), MS_ASYNC on the other hand does not start I/O (in fact, it does nothing at all, following the reasoning that dirty page writeback works fine anyway), fsync always blocks, and sync_file_range may block if you exceed some obscure limit and is considered "extremely dangerous" by the documentation, whatever that means.
Either way, you must msync(MS_SYNC), or fsync (both blocking), or sync_file_range (possibly blocking) followed by fsync, or you will lose data with MADV_DONTNEED. If you cannot afford to possibly block, you have no choice, sadly, but to do this in another thread.
For recent Linux kernels (just tested on Linux 5.4), MADV_DONTNEED works as expected when the mapping is NOT private, e.g. mmap without MAP_PRIVATE flag. I'm not sure what's the behavior on previous versions of Linux kernel.
From release 4.15 of the Linux man-pages project's madvise manpage:
After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Linux added a new flag MADV_FREE with the same behavior in BSD systems in Linux 4.5
which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again.
For why MADV_DONTNEED for private mapping may result zero filled pages upon future access, watch Bryan Cantrill's rant as mentioned in comments of #Damon's answer. Spoiler: it comes from Tru64 UNIX.
As already mentioned, MADV_DONTNEED is not your friend. Since Linux 5.4, you can use MADV_COLD to tell the kernel it should page out that memory when there is memory pressure. This seems to be exactly what is wanted in this situation.
Read more here:
https://lwn.net/Articles/793462/
first, madv_sequential enables aggressive readahead, so you don't need it.
second, os will lazily write dirty file-baked memory to disk anyway, even if you will do nothing. but madv_dontneed will instruct it to free memory immediately (what you call "various os resources"). third, it is not clear that mmapping files for sequential writing has any advantage. you probably will be better served by just write(2) (but use buffers - either manual or stdio).

Block based storage

I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph
In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).
Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.
Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements

Resources