If I private-`mmap` a file and read it, then another process writes to the same file, will another read at the same location return the same value? - linux

(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?

Related

How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?

According to mmap() manpage:
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Question: How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?
Background: I am designing a data structure for a text editor designed to allow editing huge text files efficiently. The data structure is akin to an on-disk rope but with the actual strings being pointer to mmap()-ed ranges from the original file.
Since the file could be very large, there are a few restrictions around the design:
Must not load the entire file into RAM as the file may be larger than available physical RAM
Must not copy files on opening as this will make opening new files really slow
Must work on filesystems like ext4 that does not support copy-on-write (cp --reflink/ioctl_ficlone)
Must not rely on mandatory file locking, as this is deprecated, and requires specific mount option -o mand in the filesystem
As long as the changes aren't visible in my mmap(), it's ok for the underlying file to change on the filesystem
Only need to support recent Linux and using Linux-specific system APIs are ok
The data structure I'm designing would keep track of a list of unedited and edited ranges in the file by storing start and end index of the ranges into the mmap()-ed buffer. While the user is browsing through the file, ranges of text that have never been modified by the user would be read directly from a mmap() of the original file, while a swap file will store the ranges of texts that have been edited by the user but had not been saved.
When the user saves a file, the data structure would use copy_file_range to splice the swap file and the original file to assemble the new file. For this splicing to work, the original file as seen by my program must remain unchanged throughout the entire editing session.
Problem: The user may concurrently have other programs modifying the same file, possibly other text editors or some other programs that modified the text file in-place, after making unsaved changes in my text editor.
In such situation, the editor can detect such external change using inotify, and then I want to give the user two options on how to continue from this:
discard all unsaved changes and re-read the file from disk, implementing this option is fairly straightforward
allow the user to continue editing the file and later on the user should be able to save the unsaved changes in a new location or to overwrite the changes that had been made by the other program, but implementing this seems tricky
Since my editor did not make a copy of the file when it opened the file, when the other program overwrite the file, the text ranges that my data structure are tracking may become invalid because the data on-disk have changed and these changes are now visible through my mmap(). This means if my editor tried to write unsaved changes after the file has been modified from another process, it could be splicing text ranges in the old file using data from the data from the new file, which could mean that my editor could be producing a corrupt file when saving the unsaved changes.
I don't think advisory locks would have saved the situation here in all cases, as other programs may not honor advisory lock.
My ideal solution would be to make it so that when other programs overwrites the file, the system should transparently copy the file to allow my program to continue seeing the old version while the other program finishes their write to disk and make their version visible in the filesystem. I think ioctl_ficlone could have made this possible, but to my understanding, this only works with a copy-on-write filesystem like btrfs.
Is such a thing possible?
Any other suggestions to solve this problem would also be welcome.
What you want to do isn't possible with mmap, and I'm not sure if it's possible at all with your constraints.
When you map a region, the kernel may or may not actually load all of it into memory. The region of memory that lacks data will actually contain an invalid page, so when you access it, the kernel takes a page fault and maps that region into memory. That region will likely contain whatever is in that portion of the file at the time the page fault occurs. There is an option, MAP_LOCKED, which tries to prefault all of the pages in, but doesn't guarantee it, so you can't rely on it working.
In general, you cannot prevent other processes from changing a file out from under you. Some tools (including editors) will write a new file to the side, calling rename to overwrite the file, and some will rewrite the file in place. The former is what you want, but many editors choose to do the latter, since it preserves characteristics such as ACLs and permissions you can't restore.
Furthermore, you really don't want to use mmap on any file you can't totally control, because if another process truncates the file and you try to access that portion of the buffer, your process will die with SIGBUS. Catching this signal is undefined behavior, and the only sane thing to do is die. (Also, it can be sent in other situations, such as unaligned access, and you'll have a hard time distinguishing between them.)
Ultimately, if you're not interested in copying the file, you can't guarantee someone won't change underneath you, and you'll need to be prepared for that to occur.

How to safely use mmap() for reading?

I have a need to do a lot of random-access reads in a large file so I use mmap(). This solution seems to be perfect as long as the mapped file is untouched. But this is not always the case. If the mapped file is tampered with, several problems arise:
If a change to a file reduces its length or the file becomes inaccessible then a process in order to live must handle SIGBUS signal (at least, in Linux implementation). This adds additional complications since I'm writing a library.
To make things even worse, mmap() manpage says it is unspecified if changes to the original
file are propagated to the memory. So they can very well be propagated.
This essentially means the contents of the file I work with can become white noise at any moment.
Does all of this mean that any program that maps a freely accessible file and does not handle these problems can be brought down by a DoS attack? Even while I do not expect evil hackers to go after my program, I can easily see a user modifying my mapped file, replacing it with another one or making the file inaccessible by, for example, removing a USB drive. And while I can write a signal handler (and this is a bit messy, so I am looking for a better solution) to solve the first problem,
I have no idea how to solve the second one.
The file can not be copied and can be freely moved around if it's not used by a program (just like any other media file). Linux file locks do not always work.
So, how to safely use mmap() for reading?

How to portably extend a file accessed using mmap()

We're experimenting with changing SQLite, an embedded database system,
to use mmap() instead of the usual read() and write() calls to access
the database file on disk. Using a single large mapping for the entire
file. Assume that the file is small enough that we have no trouble
finding space for this in virtual memory.
So far so good. In many cases using mmap() seems to be a little faster
than read() and write(). And in some cases much faster.
Resizing the mapping in order to commit a write-transaction that
extends the database file seems to be a problem. In order to extend
the database file, the code could do something like this:
ftruncate(); // extend the database file on disk
munmap(); // unmap the current mapping (it's now too small)
mmap(); // create a new, larger, mapping
then copy the new data into the end of the new memory mapping.
However, the munmap/mmap is undesirable as it means the next time each
page of the database file is accessed a minor page fault occurs and
the system has to search the OS page cache for the correct frame to
associate with the virtual memory address. In other words, it slows
down subsequent database reads.
On Linux, we can use the non-standard mremap() system call instead
of munmap()/mmap() to resize the mapping. This seems to avoid the
minor page faults.
QUESTION: How should this be dealt with on other systems, like OSX,
that do not have mremap()?
We have two ideas at present. And a question regarding each:
1) Create mappings larger than the database file. Then, when extending
the database file, simply call ftruncate() to extend the file on
disk and continue using the same mapping.
This would be ideal, and seems to work in practice. However, we're
worried about this warning in the man page:
"The effect of changing the size of the underlying file of a
mapping on the pages that correspond to added or removed regions of
the file is unspecified."
QUESTION: Is this something we should be worried about? Or an anachronism
at this point?
2) When extending the database file, use the first argument to mmap()
to request a mapping corresponding to the new pages of the database
file located immediately after the current mapping in virtual
memory. Effectively extending the initial mapping. If the system
can't honour the request to place the new mapping immediately after
the first, fall back to munmap/mmap.
In practice, we've found that OSX is pretty good about positioning
mappings in this way, so this trick works there.
QUESTION: if the system does allocate the second mapping immediately
following the first in virtual memory, is it then safe to eventually
unmap them both using a single big call to munmap()?
2 will work but you don't have to rely on the OS happening to have space available, you can reserve your address space beforehand so your fixed mmapings will always succeed.
For instance, To reserve one gigabyte of address space. Do a
mmap(NULL, 1U << 30, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Which will reserve one gigabyte of continuous address space without actually allocating any memory or resources. You can then perform future mmapings over this space and they will succeed. So mmap the file into the beginning of the space returned, then mmap further sections of the file as needed using the fixed flag. The mmaps will succeed because your address space is already allocated and reserved by you.
Note: linux also has the MAP_NORESERVE flag which is the behavior you would want for the initial mapping if you were allocating RAM, but in my testing it is ignored as PROT_NONE is sufficient to say you don't want any resources allocated yet.
I think #2 is the best currently available solution. In addition to this, on 64bit systems you may create your mapping explicitly at an address that OS would never choose for an mapping (for example 0x6000 0000 0000 0000 in Linux) to avoid the case that OS cannot place the new mapping immediatly after the first one.
It is always safe to unmap mutiple mappinsg with a single munmap call. You can even unmap a part of the mapping if you wish to do so.
Use fallocate() instead of ftruncate() where available. If not, just open file in O_APPEND mode and increase file by writing some amount of zeroes. This greatly reduce fragmentation.
Use "Huge pages" if available - this greatly reduce overhead on big mappings.
pread()/pwrite()/pwritev()/preadv() with not-so-small block size is not slow really. Much faster than IO can actually be performed.
IO errors when using mmap() will generate just segfault instead of EIO or so.
The most of SQLite WRITE performance problems is concentrated in good transactional use (i.e. you should debug when COMMIT actually performed).

Can file size be used to detect a partial append?

I'm thinking about ways for my application to detect a partially-written record after a program or OS crash. Since records are only ever appended to a file (never overwritten), is a crash while writing guaranteed to yield a file size that is shorter than it should be? Is this guaranteed even if the file was opened in read-write mode instead of append mode, so long as writes are always at the end of the file? This would greatly simplify crash recovery, since comparing the last record's expected size and position with the actual file size would be enough to detect a partial write.
I understand that random-access writes can be reordered by the filesystem, but I'm having trouble finding information on whether this can happen when appending. I imagine an out-of-order append would require the filesystem to create a "hole" at the tail of the (sparse) file, write blocks beyond the hole, and then fill in the blocks in between, but I'm hoping that such an approach would be so inefficient that nobody would ever implement their filesystem that way.
I suppose another problem might be a filesystem updating the directory entry's file size field before appending the new blocks to to the file, and the OS crashing in between. Does this ever happen in practice? (ext4, perhaps?) Is there a quick way to detect it? (And what happens when trying to read the unwritten blocks that should exist according to the file's size?)
Is there anything else, such as write reordering performed by a disk/flash drive, that would get in the way of using file size as a way to detect a partial append? I don't expect to be able to compensate for this sort of drive trickery in my application, but it would be good to know about.
If you want to be SURE that you're never going to lose records, you need a consistent journaling or transactional system for your files.
There is absolutely no guarantee that a write will have been fulfilled unless you either set O_DIRECT [which you probably do not want to do], or you use markers to indicate aht "this has been fully committed", that are only written when the file is closed. You can either do that in the mainfile, or, for example, have a file that records, externally, "last written record". If you open & close that file, it should be safe as long as the APP is what is crashing - if the OS crashes [or is otherwise abruptly stopped - e.g. power cut, disk unplugged, etc], all bets are off.
Write reordering and write caching is/can be done at all levels - the C library, the OS, the filesystem module and the hard disk/controller itself are all ABLE to reorder writes.

Reading file in Kernel Mode

I am building a driver and i want to read some files.
Is there any way to use "ZwReadFile()" or a similar function to read the
contents of the files line by line so that i can process them in a loop.
The documentation in MSDN states that :-
ZwReadFile begins reading from the given ByteOffset or the current file position into the given Buffer. It terminates the read operation under one of the following conditions:
The buffer is full because the number of bytes specified by the Length parameter has been read. Therefore, no more data can be placed into the buffer without an overflow.
The end of file is reached during the read operation, so there is no more data in the file to be transferred into the buffer.
Thanks.
No, there is not. You'll have to create a wrapper to achieve what you want.
However, given that kernel mode code has the potential to crash the system rather than the process it runs in, you have to make sure that problems such as those known from usermode with very long lines etc will not cause issues.
If the amount of data is (and will stay) below the threshold of what registry values can hold, you should use that instead. In particular REG_MULTI_SZ which has the properties you are looking for ("line-wise" storage of data).
In this situation unless performance is a critical (like 'realtime') then I would pass the filtering to a user mode service or application. Send the file name to the application to process. A user mode application is easier to test and easier to debug. It wont blue screen or hang your box either.

Resources