I am using C++ ofstream to write a log file on Linux. When I monitor the file contents with tail -f command I can see the contents are correctly populated. But if a power outage happens and I check the file again after power cycle, the last couple lines of records are gone. With hexdump I can see those records turned into null characters '\0' instead. I tried flush() and manipulator std::endl and they don't help anyway.
Is it true what tail showed to me was not actually written to the disk and they were just in buffer? The inode table wasn't update before the power outage? I can accept this fact but I don't understand why the records turned to null characters if they weren't written to the file.
Btw, I tried Google's glog and have the same results (a bunch of null characters at the end). I also tried zlog, a C library. and found it only lost the last records but didn't replace them with null chars.
Well, when you have a power outage, and then start the system again, the linux kernel tries to forward the journal log to detect and correct the inconsistencies held from memory to disk when the system crashed. Normally this means to redo and commit all operations possible until the system crash, but undo (and erase) all data not commited on the time of the crash.
Linux (and other un*x kernels, like freebsd) has a facility called ordered data write, that forces metadata (like block pointers from inodes, or directory entries) to be updated after the actual data they point to is effectively written on disk, so inconsistencies reduce to a minimum. I don't know the actual linux implementation, but for example, in freebsd what you point (a block of zeros in a file instead of the actual data written) is completely impossible with freebsd kernel (well, you can do it on purpose, but not accidentally) The most probable thing is that linux probably just manages the blocks info and not the file contents, or it has updated the file size pointer and not the data up to there. This should not happen as it's an already solved problem.
The other thing is how many data you have written or why what you see on the screen doesn't appear after the system crash. Probably you have heard about something called delayed write that allows the kernel to save write operations to disk on busy systems by not writing immediately data onto disk, but waiting some time so updates can be resolved in core memory buffers before they go to disk. Disk writes, anyway, are forced after some time delay, that means 5secs in linux (I try to remember, there's a lot of time I checked that value last time, I'm in doubt between 5 and 30 seconds) so you can lose your last five seconds at most.
Related
(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?
I'm using XFS on Linux and have a memory mapped file to which I write once per second. I notice that the file mtime (shown by watch ls --full-time) changes periodically but irregularly. The gap between mtimes seems to be between 2 and 20 seconds but it is not consistent. There is very little else running on the system--in particular there's only one program of mine writing the file, plus one reading.
The same program writes much more frequently to some other mmapped files, and their mtime changes exactly once per 30 seconds.
I am not using msync() (which would update mtime when called).
My questions:
What updates mtime?
Is the update interval configurable?
Why do some mtimes get updated exactly once per 30 seconds but some files which I write less frequently have fresher (irregular but always less than 30 seconds old) mtimes?
When you mmap a file, you're basically sharing memory directly between your process and the kernel's page cache — the same cache that holds file data that's been read from disk, or is waiting to be written to disk. A page in the page cache that's different from what's on disk (because it's been written to) is referred to as "dirty".
There is a kernel thread that scans for dirty pages and writes them back to disk, under the control of several parameters. One important one is dirty_expire_centisecs. If any of the pages for a file have been dirty for longer than dirty_expire_centisecs then all of the dirty pages for that file will get written out. The default value is 3000 centisecs (30 seconds).
Another set of variables is dirty_writeback_centisecs, dirty_background_ratio, and dirty_ratio. dirty_writeback_centisecs controls how often the kernel thread checks for dirty pages, and defaults to 500 (5 seconds). If the percentage of dirty pages (as a fraction of the memory available for caching) is less than dirty_background_ratio then nothing happens; if it's more than dirty_background_ratio, then the kernel will start writing some pages to disk. Finally, if the percentage of dirty pages exceeds dirty_ratio, then any processes attempting to write will block until the amount of dirty data decreases. This ensures that the amount of unwritten data can't increase without bound; eventually, processes producing data faster than the disk can write it will have to slow down to match the disk's pace.
The question of how the mtime gets updated is related to the question of how the kernel knows that a page is dirty in the first place. In the case of mmap, the answer is that the kernel sets the pages of the mapping to read-only. That doesn't mean that you can't write them, but it means that the first time you do, it triggers an exception in the memory-management unit, which is handled by the kernel. The exception handler does (at least) four things:
Marks the page as dirty, so that it will get written back.
Updates the file mtime.
Marks the page as read-write, so that the write can succeed.
Jumps back to the instruction in your program that writes to the mmaped page, which succeeds this time.
So when you write data to a clean page, it causes an mtime update, but it also causes the page to become read-write, so that further writes don't cause an exception (or an mtime update)note 1. However, when the dirty page gets flushed to disk, it becomes clean, and also becomes "read-only" again, so that any further writes to it will trigger another eventual disk write, and also another mtime update.
So now, with a few assumptions, we can start to piece together the puzzle.
First, dirty_background_ratio and dirty_ratio are probably not coming into play. If the pace of your writes was fast enough to trigger background flushes, then most likely you would see the "irregular" behavior on all files.
Second, the difference between the "irregular" files and the "30 second" files is the page access pattern. I surmise that the "irregular" files are being written to in some sort of append-mode or circular-buffer fashion, such that you start writing to a new page every few seconds. Every time you dirty a previously untouched page, it triggers an mtime update. But for the files displaying the 30-second pattern, you only write to one page (perhaps they are one page or less in length). In that case, the mtime is updated on first write, and then not again until the file is flushed to disk by exceeding dirty_expire_centisecs, which is 30 seconds.
Note 1: This behavior is, technically, wrong. It's unpredictable, but the standards allow for some degree of unpredictability. But they do require that the mtime be sometime at or after the last write to a file, and at or before an msync (if any). In the case where a page is written to multiple times in the interval before it's flushed to disk, this isn't what happens — the mtime gets the timestamp of the first write. This has been discussed, but a patch that would have fixed it wasn't accepted. Therefore, when using mmap, mtimes can be in error. dirty_expire_centisecs sort of limits that error, but only partially, since other disk traffic might cause the flush to have to wait, extending the window for a write to bypass mtime even further.
I'm thinking about ways for my application to detect a partially-written record after a program or OS crash. Since records are only ever appended to a file (never overwritten), is a crash while writing guaranteed to yield a file size that is shorter than it should be? Is this guaranteed even if the file was opened in read-write mode instead of append mode, so long as writes are always at the end of the file? This would greatly simplify crash recovery, since comparing the last record's expected size and position with the actual file size would be enough to detect a partial write.
I understand that random-access writes can be reordered by the filesystem, but I'm having trouble finding information on whether this can happen when appending. I imagine an out-of-order append would require the filesystem to create a "hole" at the tail of the (sparse) file, write blocks beyond the hole, and then fill in the blocks in between, but I'm hoping that such an approach would be so inefficient that nobody would ever implement their filesystem that way.
I suppose another problem might be a filesystem updating the directory entry's file size field before appending the new blocks to to the file, and the OS crashing in between. Does this ever happen in practice? (ext4, perhaps?) Is there a quick way to detect it? (And what happens when trying to read the unwritten blocks that should exist according to the file's size?)
Is there anything else, such as write reordering performed by a disk/flash drive, that would get in the way of using file size as a way to detect a partial append? I don't expect to be able to compensate for this sort of drive trickery in my application, but it would be good to know about.
If you want to be SURE that you're never going to lose records, you need a consistent journaling or transactional system for your files.
There is absolutely no guarantee that a write will have been fulfilled unless you either set O_DIRECT [which you probably do not want to do], or you use markers to indicate aht "this has been fully committed", that are only written when the file is closed. You can either do that in the mainfile, or, for example, have a file that records, externally, "last written record". If you open & close that file, it should be safe as long as the APP is what is crashing - if the OS crashes [or is otherwise abruptly stopped - e.g. power cut, disk unplugged, etc], all bets are off.
Write reordering and write caching is/can be done at all levels - the C library, the OS, the filesystem module and the hard disk/controller itself are all ABLE to reorder writes.
In my application I am continually writing data to file1 and flushing it to the device. In another thread, I am reading data from file1 and writing it to file2.
Every time I do the fwrite + fflush on file1, I signal to the other thread to start reading from it. The other thread reads data from file1 and dumps it into file2. Pretty simple logic. Additionally, after every few minutes, I seek back to start of file1 and start overwriting old data.
Now my problem is that once I start overwriting data in file1, the data read into file2 is sometimes the old data (i.e. data written in the previous iteration) even though writer thread has signaled that it wrote the new data (and flushed it).
I am writing to and reading from a solid state drive (128 GB SAMSUNG 470 Series, if that helps) on [C + linux + arm platfrom]. I feel that there is an issue with the processor cache. Perhaps the write goes into the cache and the read by the reader thread comes from the flash, and hence the stale data.
The catch here is that this problem occurs if the SSD is formatted with NTFS. If I format it with ext3, the problem goes away. Unfortunately, NTFS is a hard requirement. Another interesting observation is that if I have two reader threads, both get stale data at different instants.
Event after disabling the SSD write cache (with hdparm -W0 /dev/sda1), I get the same problem with NTFS. I am badly stuck up with this since more than a week.
Any idea what is happening, and why is it happening that way?
Any help will be worth its weight in gold...
EDIT Turns out that the NTFS driver does not like me overwriting a file by rewinding the file pointer. Is this a known thing?
Ok, so I found the issue myself (and how rarely does that happen !!!).
I found that there was a problem with the C library buffering (fread/fwrite). So I do fflush() before every fread(). This solves my problem (I don't know what exactly went wrong with the driver but I am assuming that there is some issue with the "read" buffering of the C library I/O functions, when reading from the same location of the file second time around).
Thanks #Asad Rasheed and #jrtipton for your inputs :)
Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.