State after mremap Failure - linux

When mremap fails, is the old mapping still valid and usable? I mmap a file similar to a database in that it has blocks of data and free lists. When I run out of blocks, I need to grow the file. At this point I do this by calling ftruncate with the new size, then mremap to remap the entire file with the additional space. I have not been able to find out what happens if mremap fails and don't have a particularly good way to test on the embedded platform where this is running.
Note I am using the flag MREMAP_MAYMOVE and am not specifying a new address.
I apologize if this has an answer somewhere, but I sure can't seem to find it in any man pages or other documentation online.

Related

If I private-`mmap` a file and read it, then another process writes to the same file, will another read at the same location return the same value?

(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?

How to safely use mmap() for reading?

I have a need to do a lot of random-access reads in a large file so I use mmap(). This solution seems to be perfect as long as the mapped file is untouched. But this is not always the case. If the mapped file is tampered with, several problems arise:
If a change to a file reduces its length or the file becomes inaccessible then a process in order to live must handle SIGBUS signal (at least, in Linux implementation). This adds additional complications since I'm writing a library.
To make things even worse, mmap() manpage says it is unspecified if changes to the original
file are propagated to the memory. So they can very well be propagated.
This essentially means the contents of the file I work with can become white noise at any moment.
Does all of this mean that any program that maps a freely accessible file and does not handle these problems can be brought down by a DoS attack? Even while I do not expect evil hackers to go after my program, I can easily see a user modifying my mapped file, replacing it with another one or making the file inaccessible by, for example, removing a USB drive. And while I can write a signal handler (and this is a bit messy, so I am looking for a better solution) to solve the first problem,
I have no idea how to solve the second one.
The file can not be copied and can be freely moved around if it's not used by a program (just like any other media file). Linux file locks do not always work.
So, how to safely use mmap() for reading?

Throw away dirty copy-on-write pages

Is there a way that I can create a copy-on-write mapping via MAP_PRIVATE, write some data (ie, dirtying some pages), and then discard my changes, without using munmap and re-mmaping? The goal is to maintain the same virtual address for the given mapping (something not guaranteed to happen if I unmap & then mmap the same file again), but to discard all of my COW changes at once.
My understanding is that attempting to re-map the space via hinting the address and using the MAP_FIXED flag may have this effect; however I'm not sure if my interpretation of the MAP_FIXED docs is correct, or if this behaviour is guaranteed.
To quote from the mmap(2) docs:
If the memory region specified by addr and len overlaps pages of any existing
mapping(s), then the overlapped part of the existing mapping(s) will be
discarded.
Does "discarded" in this case mean that any COW pages will be thrown away, and new reads from the corresponding pages will fault and reflect changes on disk?
If you perform a mmap operation which overlaps with existing mappings, the Linux kernel will turf the overlapping part of the existing mappings as if an unmap had been done on them first. So for instance if you map a frame buffer where a shared library used to be, that memory now has nothing to do with the shared library; it points to the frame buffer.
The underlying page object from the removed mapping lives on independently of the mapping: pages are reference counted objects. When two maps share a view of the same page, it's simply due to the same page being "installed" in the different views. When a page is made dirty, and then unmapped, this does not create a dependency whereby the dirty page must be written out prior to the new mapping; the virtual memory can already be re-assigned to a new mapping (such as a piece of a graphics frame buffer) before the original dirty page (for example, part of a file-backed shared mapping) is flushed out.
About throwing away a mapping: I don't think you can do this. That is to say, if you have a mapping which is supposed to flush dirty pages to an underlying file, you cannot write to that memory and then unmap it quickly (or mmap something over it) in hopes that the write is never done. In Linux's madvise API, there is MAP_REMOVE operation which seems relevant, but according to the manual page, it seems to only work on tmpfs and shmfs. I think the only way to block the write from happening would be to do the time-honored ritual known as "dive for the power switch".
There is a way to map a file object such that changes are not propagated: namely, MAP_PRIVATE (opposite to MAP_SHARED). MAP_PRIVATE is needed, for instance, by debuggers like gdb which need to be able to put a breakpoint into an executable or shared library, without throwing a trap instruction into every instance of that executable or library in every running process (and the copy on disk!).
If you have a MAP_PRIVATE with modified parts, and you unmap it (or those parts) or map something over top of them, I believe they will be discarded. Those pages should have been subject to copy-on-write, and so the process which made them dirty should hold the one and only reference. When they are unmapped, their refcount drops to zero and since they are private pages, they get turfed.
The virtual address remains the same even after the copy. Only the physical address changes (and the associated memory mapping page registers).
After the process has written to the page, it is too late to undo it. The copy occurs during the first write to the memory region.

How to portably extend a file accessed using mmap()

We're experimenting with changing SQLite, an embedded database system,
to use mmap() instead of the usual read() and write() calls to access
the database file on disk. Using a single large mapping for the entire
file. Assume that the file is small enough that we have no trouble
finding space for this in virtual memory.
So far so good. In many cases using mmap() seems to be a little faster
than read() and write(). And in some cases much faster.
Resizing the mapping in order to commit a write-transaction that
extends the database file seems to be a problem. In order to extend
the database file, the code could do something like this:
ftruncate(); // extend the database file on disk
munmap(); // unmap the current mapping (it's now too small)
mmap(); // create a new, larger, mapping
then copy the new data into the end of the new memory mapping.
However, the munmap/mmap is undesirable as it means the next time each
page of the database file is accessed a minor page fault occurs and
the system has to search the OS page cache for the correct frame to
associate with the virtual memory address. In other words, it slows
down subsequent database reads.
On Linux, we can use the non-standard mremap() system call instead
of munmap()/mmap() to resize the mapping. This seems to avoid the
minor page faults.
QUESTION: How should this be dealt with on other systems, like OSX,
that do not have mremap()?
We have two ideas at present. And a question regarding each:
1) Create mappings larger than the database file. Then, when extending
the database file, simply call ftruncate() to extend the file on
disk and continue using the same mapping.
This would be ideal, and seems to work in practice. However, we're
worried about this warning in the man page:
"The effect of changing the size of the underlying file of a
mapping on the pages that correspond to added or removed regions of
the file is unspecified."
QUESTION: Is this something we should be worried about? Or an anachronism
at this point?
2) When extending the database file, use the first argument to mmap()
to request a mapping corresponding to the new pages of the database
file located immediately after the current mapping in virtual
memory. Effectively extending the initial mapping. If the system
can't honour the request to place the new mapping immediately after
the first, fall back to munmap/mmap.
In practice, we've found that OSX is pretty good about positioning
mappings in this way, so this trick works there.
QUESTION: if the system does allocate the second mapping immediately
following the first in virtual memory, is it then safe to eventually
unmap them both using a single big call to munmap()?
2 will work but you don't have to rely on the OS happening to have space available, you can reserve your address space beforehand so your fixed mmapings will always succeed.
For instance, To reserve one gigabyte of address space. Do a
mmap(NULL, 1U << 30, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Which will reserve one gigabyte of continuous address space without actually allocating any memory or resources. You can then perform future mmapings over this space and they will succeed. So mmap the file into the beginning of the space returned, then mmap further sections of the file as needed using the fixed flag. The mmaps will succeed because your address space is already allocated and reserved by you.
Note: linux also has the MAP_NORESERVE flag which is the behavior you would want for the initial mapping if you were allocating RAM, but in my testing it is ignored as PROT_NONE is sufficient to say you don't want any resources allocated yet.
I think #2 is the best currently available solution. In addition to this, on 64bit systems you may create your mapping explicitly at an address that OS would never choose for an mapping (for example 0x6000 0000 0000 0000 in Linux) to avoid the case that OS cannot place the new mapping immediatly after the first one.
It is always safe to unmap mutiple mappinsg with a single munmap call. You can even unmap a part of the mapping if you wish to do so.
Use fallocate() instead of ftruncate() where available. If not, just open file in O_APPEND mode and increase file by writing some amount of zeroes. This greatly reduce fragmentation.
Use "Huge pages" if available - this greatly reduce overhead on big mappings.
pread()/pwrite()/pwritev()/preadv() with not-so-small block size is not slow really. Much faster than IO can actually be performed.
IO errors when using mmap() will generate just segfault instead of EIO or so.
The most of SQLite WRITE performance problems is concentrated in good transactional use (i.e. you should debug when COMMIT actually performed).

How to tell Linux that a mmap()'d page does not need to be written to swap if the backing physical page is needed?

Hopefully the title is clear. I have a chunk of memory obtained via mmap(). After some time, I have concluded that I no longer need the data within this range. I still wish to keep this range, however. That is, I do not want to call mummap(). I'm trying to be a good citizen and not make the system swap more than it needs.
Is there a way to tell the Linux kernel that if the given page is backed by a physical page and if the kernel decides it needs that physical page, do not bother writing that page to swap?
I imagine under the hood this magical function call would destroy any mapping between the given virtual page and physical page, if present, without writing to swap first.
Your question (as stated) makes no sense.
Let's assume that there was a way for you to tell the kernel to do what you want.
Let's further assume that it did need the extra RAM, so it took away your page, and didn't swap it out.
Now your program tries to read that page (since you didn't want to munmap the data, presumably you might try to access it). What is the kernel to do? The choices I see:
it can give you a new page filled with 0s.
it can give you SIGSEGV
If you wanted choice 2, you could achieve the same result with munmap.
If you wanted choice 1, you could mremap over the existing mapping with MAP_ANON (or munmap followed by new mmap).
In either case, you can't depend on the old data being there when you need it.
The only way your question would make sense is if there was some additional mechanism for the kernel to let you know that it is taking away your page (e.g. send you a special signal). But the situation you described is likely rare enough to warrant additional complexity.
EDIT:
You might be looking for madvise(..., MADV_DONTNEED)
You could munmap the region, then mmap it again with MAP_NORESERVE
If you know at initial mapping time that swapping is not needed, use MAP_NORESERVE

Resources