Prevent data corruption - linux

I'm working on an Embedded linux running on ARM9.
The filesystem is ext4 type (rw, sync, noatime, data=writeback)
I implemented a process that writes/reads to a SQLite3 database in a Write-Ahead-Loggin (WAL) mode, with unsync enabled. When a powerloss is happening, I have around two seconds to save all data by syncing and checkpointing the DB. But, still, I see that sometimes the DB is being corrupted which is really not good in my case.
I would like to write a new DB engine for my purpose, In a similar way to SQLite, where the DB will be hold in one file. But in this case, I'm thinking of writing the header data to one sector and the rest of the data at least two sectors after , so the size of the DB will be larger but when writing the data, It will not ruin the header of the file, which holds the indexes and etc. That way, only the last data will be corrupted and not all the file, as SQLite behaves.
My question is if my approach is right?

you can use ping pong technique.
In ping pong technique you use 2 separate files and write alternatively to one and another. If a power loss occurs in the worst case you have at most 1 single corrupted file and you can safely use the other one. In the best case none of them is corrupted and you can continue using the latest one.
A corrupted file is easily detected if you use hashing functions or other CRC schemes
Obviously this scheme doesn't save you from write-cache or other disk caching mechanism which could be working under the hood.
Alternatively, you can use a journaled file system which features data integrity protection on it's own
Be aware that ping-pong and journaling schemes ensure only data integrity. Data loss could still occur. Data integrity and data loss are two completely different things

Related

node fs.fsync (when to use?)

I want to safely write a file and I wan't to understand the proper use/place for fsync.
https://linux.die.net/man/2/fsync
After reading ^ that, I am puzzled as to where to effectively use it.
Question, do I:
fs.write('temp/file.txt','utf-8',function(error){
if(error){fs.unlink('temp/file.txt',function(){cb(error,undefined);});}
else{
fs.rename('temp/file.txt','real/file.txt',function(){
fs.fsync('real/file.txt',function(){
cb(undefined,true);
});
});
}
});
I'm writing something that performs many file changes. I have looked at modules that write atomic, but I would like to understand the process.
fsync is one of those functions where it's extremely rare that you'll need to use it.
All operating systems mask the fact that storage devices are slow by caching reads and writes. When you write to a file, it doesn't immediately write to the actual storage medium; it'll capture it in a cache, tell your program that the write has completed, and go and write the contents to the storage device in the background instead. The operating system will keep everything consistent though; if another application reads from that file, it'll see the new contents, as the OS will serve the contents from cache.
Note for a moment that this isn't universal; I believe Windows disables caching for removable storage devices to prevent data loss when people pull the drive out. There's also some set of flags you can pass to open() to disable the cache.
For almost all use cases, you don't need to care that this happens. The only upshot for you is that your program can continue faster. There are some cases where this is problematic though:
If power is lost, the contents of the cache are lost, so the disk won't have all the new contents of the file.
If the drive is removed, writes will equally be lost. This is pretty typical for removable storage devices, and I'm pretty sure 90% of people ignore the "safely remove" prompt ;).
I think doing direct reads directly from a device (i.e. /dev/sdX in Linux) will bypass this cache, but I'm not 100% sure.
Examples of where it is needed are, say, databases. When you run an update query, the database will normally update its in-memory state, and write the mutation to a transaction log. Reliability is a good thing for a database though, so it will write to the transaction log and do an fsync on that file before responding to the user (or will have opened the transaction log as unbuffered) so there's some level of guarantee that the transaction has been persisted.
In your example, the fsync will ensure that the rename has actually taken place and has been flushed to disk.

Is it possible to verify data of multiple sequential writes just by checking the last n bytes written?

Just to be clear, my questions are language/OS agnostic (independent).
I am working on a program (support many OSs, currently written in Golang) that receive many chunks of data (like a stream of data chunks) and sequentially write it all down to a pre-specified position (pos >= 0) in a file. Only 1 process with 1 thread accessing the file. I use regular write function that use write system call (which one depends on the OS it runs) internally, not buffered IO.
Supposed that while my program was writing, suddenly system crashed (the most severe kind of crash: power failure).
When the system is turned back on, I need to verify how many chunks is completely written to HDD. (*)
The HDD that my program writes to is just today regular desktop or laptop HDD (not some fancy one (with battery-backed) found in some high-end servers).
Supposed that bit corruption during transfer to and reading from HDD is very highly unlikely to happen and is negligible.
My questions are:
Do I need to do checksum on all of the written chunks to verify (*)?
Or do I just need to check and confirm that the nth chunk is correct and assume all the chunks before it (0 -> n-1) is correct too?
2.1 If 2. is enough, does that means sequential writes order is guaranteed to be preserved by FS of any OS (random writes can still be reordered though)?
Is my way of doing recovery rely on the same principle as append-only log file as seen in many crash-proof databases?
I suspect your best bet is to study up on Cyclic Redundancy Checks (CRC).
As I understand it a CRC would allow you to verify that what was intended to be written actually was.
I also suggest that worrying about the cause of any errors is not very worthwhile (transmission errors vs. errors for any other reason such as a power failure).
Hope this helps.

Can file size be used to detect a partial append?

I'm thinking about ways for my application to detect a partially-written record after a program or OS crash. Since records are only ever appended to a file (never overwritten), is a crash while writing guaranteed to yield a file size that is shorter than it should be? Is this guaranteed even if the file was opened in read-write mode instead of append mode, so long as writes are always at the end of the file? This would greatly simplify crash recovery, since comparing the last record's expected size and position with the actual file size would be enough to detect a partial write.
I understand that random-access writes can be reordered by the filesystem, but I'm having trouble finding information on whether this can happen when appending. I imagine an out-of-order append would require the filesystem to create a "hole" at the tail of the (sparse) file, write blocks beyond the hole, and then fill in the blocks in between, but I'm hoping that such an approach would be so inefficient that nobody would ever implement their filesystem that way.
I suppose another problem might be a filesystem updating the directory entry's file size field before appending the new blocks to to the file, and the OS crashing in between. Does this ever happen in practice? (ext4, perhaps?) Is there a quick way to detect it? (And what happens when trying to read the unwritten blocks that should exist according to the file's size?)
Is there anything else, such as write reordering performed by a disk/flash drive, that would get in the way of using file size as a way to detect a partial append? I don't expect to be able to compensate for this sort of drive trickery in my application, but it would be good to know about.
If you want to be SURE that you're never going to lose records, you need a consistent journaling or transactional system for your files.
There is absolutely no guarantee that a write will have been fulfilled unless you either set O_DIRECT [which you probably do not want to do], or you use markers to indicate aht "this has been fully committed", that are only written when the file is closed. You can either do that in the mainfile, or, for example, have a file that records, externally, "last written record". If you open & close that file, it should be safe as long as the APP is what is crashing - if the OS crashes [or is otherwise abruptly stopped - e.g. power cut, disk unplugged, etc], all bets are off.
Write reordering and write caching is/can be done at all levels - the C library, the OS, the filesystem module and the hard disk/controller itself are all ABLE to reorder writes.

is write() with O_DIRECT ACID compliant?

My database engine writes records of 64 bytes by issuing write() syscall of the entire disk block. The device is opened with O_DIRECT mode. For example third record within a block starts at byte 128 and ends at position 192, when I do an UPDATE the entire disk block (which is by default 512 bytes) is written.
My question is, can I calim ACID compliance if I am writing the record over itself every time UPDATE occurs? Usually database engines do this in 2 steps by writing modified disk block to another (free) place and then updating an index to new block with one (atomic) write immediately after first write returned success. But I am not doing this, I am overwriting current data with new one expecting the write to be successful. Does my method has any potential problems? Is it ACID compliant? What if the hardware writes only half of the block and my record is exactly in the middle? Or does the hardware already does the 2 step write process I described , but at block level, so I don't need to repeat the same in software?
(note: no record is larger than physical disk block (512 bytes by default) and fsync goes after each write(), this is for Linux only)
ACID anticipates failures, and suggests ways to deal with them. Two-phase commits and three-phase commits are two fairly common and well-understood approaches.
Although I'm a database guy, the dbms frees me from having to think about this kind of thing very much. But I'd say overwriting a record without taking any other precautions is liable to fail the "C" and "D" properties ("consistent" and "durable").
To build really good code, imagine that your dbms server has no battery-backed cache, only one power supply, and that during a transaction there's a catastrophic failure in that one power supply. If your dbms can cope with that kind of failure fairly cleanly, I think you can call it ACID compliant.
Later . . .
I read Tweedie's transcript. He's not talking about database direct disk access; he's talking about a journaling filesystem. A journaling filesystem also does a two-phase commit.
It sounds like you're trying to reach ACID compliance (in the database sense) with a single-phase commit. I don't think you can get away with that.
Opening with O_DIRECT means "Try to minimize cache effects of the I/O to and from this file" (emphasis added). I think you'll also need O_SYNC. (But the linked kernel docs caution that most Linux filesystems don't implement POSIX semantics of O_SYNC. And both filesystems and disks have been known to lie about whether a write has hit a platter.)
There are two more cautions in the kernel docs. First, "It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default." You're not doing that. You're trying to use it to achieve ACID compliance.
Second,
"The thing that has always disturbed me about O_DIRECT is that the
whole interface is just stupid, and was probably designed by a
deranged monkey on some serious mind-controlling substances." -- Linus
SQLite has a readable paper on how they handle atomic commits. Atomic Commit in SQLite
No.
You cannot assume the disk write will be successful. And you cannot assume that the disk will leave the existing data in place. Here is some QNX documentation also stating this.
If you got really, really unlucky the disk power will fail while it is writing, leaving the block with corrupt checksums and half-written data.
This is why ACID systems use at least two copies of the data.
is write() with O_DIRECT ACID compliant?
No, this is not guaranteed in the general case. Here are some counterexamples for Durability:
O_DIRECT makes no guarantees that acknowledged data made it out of a volatile cache that is part of the device
O_DIRECT makes no guarantees about persistence of filesystem metadata that might be required to actually read back the (acknowledged) write data (e.g. in the case of appending writes)
My question is, can I calim [sic] ACID compliance if I am writing the record over itself every time UPDATE occurs?
In the general case no. For example a spec compliant SCSI disk doesn't have to guarantee the semantics of only getting only the old or only the new data if a crash happens mid-write (it's legal for it to return an error reading that data until the region is unconditionally overwritten). If you're doing a write to a file in a filesystem then things are even more complicated. Having a successful fsync() after the write() before you issue new I/O will help you to know the write was stable but is not enough to ensure Atomicity (only old or new data) in the general case of awkwardly timed power loss.
Does my method [assuming overwrites are perfectly atomic] has [sic] any potential problems?
Yes, see above. What you are doing may work as you wish in certain setups but there's no guarantee it should work in all (even though they are "non-faulty" per their spec).
See this answer on "What does O_DIRECT really mean?" for further discussion.

Writing to a remote file: When does write() really return?

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system
Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.
You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

Resources