Writing consistent data to file in Linux - linux

I want to write a library which logs data to a file. Unfortunately, my system suffers from unexpected reboots and power loss.
Does Linux write operation to a file guaranties that my file will always contain consistent data? Does it guarantee "all or nothing"?
If so, is there a limitation on the size of the data being written?
Thanks.

When you mount the file system you can specify one of the below options. It seems like the third one suits my requirements.
This is what I found at
http://lxr.free-electrons.com/source/Documentation/filesystems/ext3.txt
Data Mode
There are 3 different data modes:
writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext3 performance.
ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all other modes.

You can never predict where a physical write operations stops on a power outage. Even if you use journaling features of some filesystems. Note that the journal needs to get written too.

Related

Ext4 on magnetic disk: Is it possible to process an arbitrary list of files in a seek-optimized manner?

I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
compute smaller versions of all JPEG images to process them with a classifier
reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
extract the headlines of all Word documents
recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the FIBMAP ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).

Does O_DIRECT bypass filesystem journaling?

The man page for open(2) only suggests that O_DIRECT bypasses the page cache, but many descriptions around the net describe it as causing the user buffer to be DMA'd straight to the drive. If this is the case I imagine it would also bypass journaling done by the filesystem (e.g. xfs, ext4, etc.). Is this the case?
I can't find anyone claiming one way or the other. It seems to me this would be consistent with O_DIRECT being used by databases -- the common example use for O_DIRECT is when an application like a database is doing its own caching in userspace, and similarly I can imagine databases doing their own transaction logs.
Does O_DIRECT bypass filesystem journaling?
Usually it does. However, file data usually doesn't go into a filesystem's journal anyway. More details below (but note this answer doesn't try to account for CoW filesystems):
Most Linux journaling filesystems (Ext4 when journal is set to writeback or ordered (the default), XFS, JFS etc) are not journalling the data within files - they are journaling the consistency of the filesystem's data structures (metadata).
Filesystem journals only metadata (typical case): Well data within the files doesn't go into the journal anyway so using O_DIRECT doesn't change this and the data continues to not to go into the journal. However, O_DIRECT operations can still trigger metadata updates just like normal but the initiating operation can return before the metadata has been updated. See the Ext4 wiki Clarifying Direct IO's Semantics page for details.
Ext4 in journal=data mode: This is trickier - there's a warning that the desired outcome with O_DIRECT in journal=data mode might not be what is expected . From the "data=journal" section of ext4.txt:
Enabling this mode [journal=data] will disable delayed allocation and O_DIRECT support.
In this scenario O_DIRECT is looked at as only a hint and the filesystem silently falls back to stuffing the data into the page cache (making it no longer direct!). So in this case yes the data will end up going into the journal and the journal won't be bypassed. See the "Re: [PATCH 1/1 linux-next] ext4: add compatibility flag check" thread for where Ted Ts'o articulates this. There are patches floating around ("ext4: refuse O_DIRECT opens for mode where DIO doesn't work") to make the filesystem return an error at open instead but from what I can see these were rejected from the mainline kernel.

Prevent data corruption

I'm working on an Embedded linux running on ARM9.
The filesystem is ext4 type (rw, sync, noatime, data=writeback)
I implemented a process that writes/reads to a SQLite3 database in a Write-Ahead-Loggin (WAL) mode, with unsync enabled. When a powerloss is happening, I have around two seconds to save all data by syncing and checkpointing the DB. But, still, I see that sometimes the DB is being corrupted which is really not good in my case.
I would like to write a new DB engine for my purpose, In a similar way to SQLite, where the DB will be hold in one file. But in this case, I'm thinking of writing the header data to one sector and the rest of the data at least two sectors after , so the size of the DB will be larger but when writing the data, It will not ruin the header of the file, which holds the indexes and etc. That way, only the last data will be corrupted and not all the file, as SQLite behaves.
My question is if my approach is right?
you can use ping pong technique.
In ping pong technique you use 2 separate files and write alternatively to one and another. If a power loss occurs in the worst case you have at most 1 single corrupted file and you can safely use the other one. In the best case none of them is corrupted and you can continue using the latest one.
A corrupted file is easily detected if you use hashing functions or other CRC schemes
Obviously this scheme doesn't save you from write-cache or other disk caching mechanism which could be working under the hood.
Alternatively, you can use a journaled file system which features data integrity protection on it's own
Be aware that ping-pong and journaling schemes ensure only data integrity. Data loss could still occur. Data integrity and data loss are two completely different things

is write() with O_DIRECT ACID compliant?

My database engine writes records of 64 bytes by issuing write() syscall of the entire disk block. The device is opened with O_DIRECT mode. For example third record within a block starts at byte 128 and ends at position 192, when I do an UPDATE the entire disk block (which is by default 512 bytes) is written.
My question is, can I calim ACID compliance if I am writing the record over itself every time UPDATE occurs? Usually database engines do this in 2 steps by writing modified disk block to another (free) place and then updating an index to new block with one (atomic) write immediately after first write returned success. But I am not doing this, I am overwriting current data with new one expecting the write to be successful. Does my method has any potential problems? Is it ACID compliant? What if the hardware writes only half of the block and my record is exactly in the middle? Or does the hardware already does the 2 step write process I described , but at block level, so I don't need to repeat the same in software?
(note: no record is larger than physical disk block (512 bytes by default) and fsync goes after each write(), this is for Linux only)
ACID anticipates failures, and suggests ways to deal with them. Two-phase commits and three-phase commits are two fairly common and well-understood approaches.
Although I'm a database guy, the dbms frees me from having to think about this kind of thing very much. But I'd say overwriting a record without taking any other precautions is liable to fail the "C" and "D" properties ("consistent" and "durable").
To build really good code, imagine that your dbms server has no battery-backed cache, only one power supply, and that during a transaction there's a catastrophic failure in that one power supply. If your dbms can cope with that kind of failure fairly cleanly, I think you can call it ACID compliant.
Later . . .
I read Tweedie's transcript. He's not talking about database direct disk access; he's talking about a journaling filesystem. A journaling filesystem also does a two-phase commit.
It sounds like you're trying to reach ACID compliance (in the database sense) with a single-phase commit. I don't think you can get away with that.
Opening with O_DIRECT means "Try to minimize cache effects of the I/O to and from this file" (emphasis added). I think you'll also need O_SYNC. (But the linked kernel docs caution that most Linux filesystems don't implement POSIX semantics of O_SYNC. And both filesystems and disks have been known to lie about whether a write has hit a platter.)
There are two more cautions in the kernel docs. First, "It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default." You're not doing that. You're trying to use it to achieve ACID compliance.
Second,
"The thing that has always disturbed me about O_DIRECT is that the
whole interface is just stupid, and was probably designed by a
deranged monkey on some serious mind-controlling substances." -- Linus
SQLite has a readable paper on how they handle atomic commits. Atomic Commit in SQLite
No.
You cannot assume the disk write will be successful. And you cannot assume that the disk will leave the existing data in place. Here is some QNX documentation also stating this.
If you got really, really unlucky the disk power will fail while it is writing, leaving the block with corrupt checksums and half-written data.
This is why ACID systems use at least two copies of the data.
is write() with O_DIRECT ACID compliant?
No, this is not guaranteed in the general case. Here are some counterexamples for Durability:
O_DIRECT makes no guarantees that acknowledged data made it out of a volatile cache that is part of the device
O_DIRECT makes no guarantees about persistence of filesystem metadata that might be required to actually read back the (acknowledged) write data (e.g. in the case of appending writes)
My question is, can I calim [sic] ACID compliance if I am writing the record over itself every time UPDATE occurs?
In the general case no. For example a spec compliant SCSI disk doesn't have to guarantee the semantics of only getting only the old or only the new data if a crash happens mid-write (it's legal for it to return an error reading that data until the region is unconditionally overwritten). If you're doing a write to a file in a filesystem then things are even more complicated. Having a successful fsync() after the write() before you issue new I/O will help you to know the write was stable but is not enough to ensure Atomicity (only old or new data) in the general case of awkwardly timed power loss.
Does my method [assuming overwrites are perfectly atomic] has [sic] any potential problems?
Yes, see above. What you are doing may work as you wish in certain setups but there's no guarantee it should work in all (even though they are "non-faulty" per their spec).
See this answer on "What does O_DIRECT really mean?" for further discussion.

Possible to implement journaling with a single fsync per commit?

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Resources