Can file size be used to detect a partial append? - linux

I'm thinking about ways for my application to detect a partially-written record after a program or OS crash. Since records are only ever appended to a file (never overwritten), is a crash while writing guaranteed to yield a file size that is shorter than it should be? Is this guaranteed even if the file was opened in read-write mode instead of append mode, so long as writes are always at the end of the file? This would greatly simplify crash recovery, since comparing the last record's expected size and position with the actual file size would be enough to detect a partial write.
I understand that random-access writes can be reordered by the filesystem, but I'm having trouble finding information on whether this can happen when appending. I imagine an out-of-order append would require the filesystem to create a "hole" at the tail of the (sparse) file, write blocks beyond the hole, and then fill in the blocks in between, but I'm hoping that such an approach would be so inefficient that nobody would ever implement their filesystem that way.
I suppose another problem might be a filesystem updating the directory entry's file size field before appending the new blocks to to the file, and the OS crashing in between. Does this ever happen in practice? (ext4, perhaps?) Is there a quick way to detect it? (And what happens when trying to read the unwritten blocks that should exist according to the file's size?)
Is there anything else, such as write reordering performed by a disk/flash drive, that would get in the way of using file size as a way to detect a partial append? I don't expect to be able to compensate for this sort of drive trickery in my application, but it would be good to know about.

If you want to be SURE that you're never going to lose records, you need a consistent journaling or transactional system for your files.
There is absolutely no guarantee that a write will have been fulfilled unless you either set O_DIRECT [which you probably do not want to do], or you use markers to indicate aht "this has been fully committed", that are only written when the file is closed. You can either do that in the mainfile, or, for example, have a file that records, externally, "last written record". If you open & close that file, it should be safe as long as the APP is what is crashing - if the OS crashes [or is otherwise abruptly stopped - e.g. power cut, disk unplugged, etc], all bets are off.
Write reordering and write caching is/can be done at all levels - the C library, the OS, the filesystem module and the hard disk/controller itself are all ABLE to reorder writes.

Related

If I private-`mmap` a file and read it, then another process writes to the same file, will another read at the same location return the same value?

(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?

How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?

According to mmap() manpage:
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Question: How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?
Background: I am designing a data structure for a text editor designed to allow editing huge text files efficiently. The data structure is akin to an on-disk rope but with the actual strings being pointer to mmap()-ed ranges from the original file.
Since the file could be very large, there are a few restrictions around the design:
Must not load the entire file into RAM as the file may be larger than available physical RAM
Must not copy files on opening as this will make opening new files really slow
Must work on filesystems like ext4 that does not support copy-on-write (cp --reflink/ioctl_ficlone)
Must not rely on mandatory file locking, as this is deprecated, and requires specific mount option -o mand in the filesystem
As long as the changes aren't visible in my mmap(), it's ok for the underlying file to change on the filesystem
Only need to support recent Linux and using Linux-specific system APIs are ok
The data structure I'm designing would keep track of a list of unedited and edited ranges in the file by storing start and end index of the ranges into the mmap()-ed buffer. While the user is browsing through the file, ranges of text that have never been modified by the user would be read directly from a mmap() of the original file, while a swap file will store the ranges of texts that have been edited by the user but had not been saved.
When the user saves a file, the data structure would use copy_file_range to splice the swap file and the original file to assemble the new file. For this splicing to work, the original file as seen by my program must remain unchanged throughout the entire editing session.
Problem: The user may concurrently have other programs modifying the same file, possibly other text editors or some other programs that modified the text file in-place, after making unsaved changes in my text editor.
In such situation, the editor can detect such external change using inotify, and then I want to give the user two options on how to continue from this:
discard all unsaved changes and re-read the file from disk, implementing this option is fairly straightforward
allow the user to continue editing the file and later on the user should be able to save the unsaved changes in a new location or to overwrite the changes that had been made by the other program, but implementing this seems tricky
Since my editor did not make a copy of the file when it opened the file, when the other program overwrite the file, the text ranges that my data structure are tracking may become invalid because the data on-disk have changed and these changes are now visible through my mmap(). This means if my editor tried to write unsaved changes after the file has been modified from another process, it could be splicing text ranges in the old file using data from the data from the new file, which could mean that my editor could be producing a corrupt file when saving the unsaved changes.
I don't think advisory locks would have saved the situation here in all cases, as other programs may not honor advisory lock.
My ideal solution would be to make it so that when other programs overwrites the file, the system should transparently copy the file to allow my program to continue seeing the old version while the other program finishes their write to disk and make their version visible in the filesystem. I think ioctl_ficlone could have made this possible, but to my understanding, this only works with a copy-on-write filesystem like btrfs.
Is such a thing possible?
Any other suggestions to solve this problem would also be welcome.
What you want to do isn't possible with mmap, and I'm not sure if it's possible at all with your constraints.
When you map a region, the kernel may or may not actually load all of it into memory. The region of memory that lacks data will actually contain an invalid page, so when you access it, the kernel takes a page fault and maps that region into memory. That region will likely contain whatever is in that portion of the file at the time the page fault occurs. There is an option, MAP_LOCKED, which tries to prefault all of the pages in, but doesn't guarantee it, so you can't rely on it working.
In general, you cannot prevent other processes from changing a file out from under you. Some tools (including editors) will write a new file to the side, calling rename to overwrite the file, and some will rewrite the file in place. The former is what you want, but many editors choose to do the latter, since it preserves characteristics such as ACLs and permissions you can't restore.
Furthermore, you really don't want to use mmap on any file you can't totally control, because if another process truncates the file and you try to access that portion of the buffer, your process will die with SIGBUS. Catching this signal is undefined behavior, and the only sane thing to do is die. (Also, it can be sent in other situations, such as unaligned access, and you'll have a hard time distinguishing between them.)
Ultimately, if you're not interested in copying the file, you can't guarantee someone won't change underneath you, and you'll need to be prepared for that to occur.

How to safely use mmap() for reading?

I have a need to do a lot of random-access reads in a large file so I use mmap(). This solution seems to be perfect as long as the mapped file is untouched. But this is not always the case. If the mapped file is tampered with, several problems arise:
If a change to a file reduces its length or the file becomes inaccessible then a process in order to live must handle SIGBUS signal (at least, in Linux implementation). This adds additional complications since I'm writing a library.
To make things even worse, mmap() manpage says it is unspecified if changes to the original
file are propagated to the memory. So they can very well be propagated.
This essentially means the contents of the file I work with can become white noise at any moment.
Does all of this mean that any program that maps a freely accessible file and does not handle these problems can be brought down by a DoS attack? Even while I do not expect evil hackers to go after my program, I can easily see a user modifying my mapped file, replacing it with another one or making the file inaccessible by, for example, removing a USB drive. And while I can write a signal handler (and this is a bit messy, so I am looking for a better solution) to solve the first problem,
I have no idea how to solve the second one.
The file can not be copied and can be freely moved around if it's not used by a program (just like any other media file). Linux file locks do not always work.
So, how to safely use mmap() for reading?

File contents lost after power outage

I am using C++ ofstream to write a log file on Linux. When I monitor the file contents with tail -f command I can see the contents are correctly populated. But if a power outage happens and I check the file again after power cycle, the last couple lines of records are gone. With hexdump I can see those records turned into null characters '\0' instead. I tried flush() and manipulator std::endl and they don't help anyway.
Is it true what tail showed to me was not actually written to the disk and they were just in buffer? The inode table wasn't update before the power outage? I can accept this fact but I don't understand why the records turned to null characters if they weren't written to the file.
Btw, I tried Google's glog and have the same results (a bunch of null characters at the end). I also tried zlog, a C library. and found it only lost the last records but didn't replace them with null chars.
Well, when you have a power outage, and then start the system again, the linux kernel tries to forward the journal log to detect and correct the inconsistencies held from memory to disk when the system crashed. Normally this means to redo and commit all operations possible until the system crash, but undo (and erase) all data not commited on the time of the crash.
Linux (and other un*x kernels, like freebsd) has a facility called ordered data write, that forces metadata (like block pointers from inodes, or directory entries) to be updated after the actual data they point to is effectively written on disk, so inconsistencies reduce to a minimum. I don't know the actual linux implementation, but for example, in freebsd what you point (a block of zeros in a file instead of the actual data written) is completely impossible with freebsd kernel (well, you can do it on purpose, but not accidentally) The most probable thing is that linux probably just manages the blocks info and not the file contents, or it has updated the file size pointer and not the data up to there. This should not happen as it's an already solved problem.
The other thing is how many data you have written or why what you see on the screen doesn't appear after the system crash. Probably you have heard about something called delayed write that allows the kernel to save write operations to disk on busy systems by not writing immediately data onto disk, but waiting some time so updates can be resolved in core memory buffers before they go to disk. Disk writes, anyway, are forced after some time delay, that means 5secs in linux (I try to remember, there's a lot of time I checked that value last time, I'm in doubt between 5 and 30 seconds) so you can lose your last five seconds at most.

using files as IPC on linux

I have one writer which creates and sometimes updates a file with some status information. The readers are implemented in lua (so I got only io.open) and possibly bash (cat, grep, whatever). I am worried about what would happen if the status information is updated (which means a complete file rewrite) while a reader has an open handle to the file: what can happen? I have also read that if the write/read operation is below 4KB, it is atomic: that would be perfectly fine for me, as the status info can fit well in such dimension. Can I make this assumption?
A read or write is atomic under 4Kbytes only for pipes, not for disk files (for which the atomic granularity may be the file system block size, usually 512 bytes).
In practice you could avoid bothering about such issues (assuming your status file is e.g. less than 512 bytes), and I believe that if the writer is opening and writing quickly that file (in particular, if you avoid open(2)-ing a file and keeping the opened file handle for a long time -many seconds-, then write(2)-ing later -once, a small string- inside it), you don't need to bother.
If you are paranoid, but do assume that readers are (like grep) opening a file and reading it quickly, you could write to a temporary file and rename(2)-ing it when written (and close(2)-ed) in totality.
As Duck suggested, locking the file in both readers and writers is also a solution.
I may be mistaken, in which case someone will correct me, but I don't think the external readers are going to pay any attention to whether the file is being simultaneously updated. They are are going to print (or possibly eof or error out) whatever is there.
In any case, why not avoid the whole mess and just use file locks. Have the writer flock (or similar) and the readers check the lock. If they get the lock they know they are ok to read.

Resources