The man page for open(2) only suggests that O_DIRECT bypasses the page cache, but many descriptions around the net describe it as causing the user buffer to be DMA'd straight to the drive. If this is the case I imagine it would also bypass journaling done by the filesystem (e.g. xfs, ext4, etc.). Is this the case?
I can't find anyone claiming one way or the other. It seems to me this would be consistent with O_DIRECT being used by databases -- the common example use for O_DIRECT is when an application like a database is doing its own caching in userspace, and similarly I can imagine databases doing their own transaction logs.
Does O_DIRECT bypass filesystem journaling?
Usually it does. However, file data usually doesn't go into a filesystem's journal anyway. More details below (but note this answer doesn't try to account for CoW filesystems):
Most Linux journaling filesystems (Ext4 when journal is set to writeback or ordered (the default), XFS, JFS etc) are not journalling the data within files - they are journaling the consistency of the filesystem's data structures (metadata).
Filesystem journals only metadata (typical case): Well data within the files doesn't go into the journal anyway so using O_DIRECT doesn't change this and the data continues to not to go into the journal. However, O_DIRECT operations can still trigger metadata updates just like normal but the initiating operation can return before the metadata has been updated. See the Ext4 wiki Clarifying Direct IO's Semantics page for details.
Ext4 in journal=data mode: This is trickier - there's a warning that the desired outcome with O_DIRECT in journal=data mode might not be what is expected . From the "data=journal" section of ext4.txt:
Enabling this mode [journal=data] will disable delayed allocation and O_DIRECT support.
In this scenario O_DIRECT is looked at as only a hint and the filesystem silently falls back to stuffing the data into the page cache (making it no longer direct!). So in this case yes the data will end up going into the journal and the journal won't be bypassed. See the "Re: [PATCH 1/1 linux-next] ext4: add compatibility flag check" thread for where Ted Ts'o articulates this. There are patches floating around ("ext4: refuse O_DIRECT opens for mode where DIO doesn't work") to make the filesystem return an error at open instead but from what I can see these were rejected from the mainline kernel.
Related
I've encountered a race condition with LVM and a disk driver I'm working on. It looks like things like vgcreate and lvcreate do their IO in O_DIRECT mode. I discovered this when running those commands with -vvv.
Clearing start of logical volume "test"
/dev/Finance-PG-vg/test: Added to device cache
Opened /dev/Finance-PG-vg/test RW O_DIRECT
Wiping /dev/Finance-PG-vg/test at sector 0 length 8 sectors
/dev/Finance-PG-vg/test: block size is 4096 bytes
Closed /dev/Finance-PG-vg/test
Specifically, I suspect that our reads are hitting the cache, and not getting the latest disk contents.
If something is written with O_DIRECT, my understanding is that this bypasses the cache. Therefore any reads to that sector are going to recieve the old data from cache, at least until the cache is invalidated. So if I want to read whatever O_DIRECT just wrote within a few seconds, I should be dropping the cache first?
Correct?
There are several confusions here:
The tools you mention mostly likely use O_DIRECT to make sure the new LVM configuration is persistent. The LVM metadata is actually stored in a specific location on all the physical disks/partitions you provide.
Writing to LVM devices does not use by default O_DIRECT (although you can pass this flag when you open a file).
Bypassing the cache with O_DIRECT does not mean that you get stale data. Lets assume that you open a file, write to it, close it, then open it again with O_DIRECT, and read the file. The read is guaranteed to return the latest changes to the file. There is no stale data ever returned. There is no need to drop caches when using O_DIRECT.
I want to write a library which logs data to a file. Unfortunately, my system suffers from unexpected reboots and power loss.
Does Linux write operation to a file guaranties that my file will always contain consistent data? Does it guarantee "all or nothing"?
If so, is there a limitation on the size of the data being written?
Thanks.
When you mount the file system you can specify one of the below options. It seems like the third one suits my requirements.
This is what I found at
http://lxr.free-electrons.com/source/Documentation/filesystems/ext3.txt
Data Mode
There are 3 different data modes:
writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext3 performance.
ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all other modes.
You can never predict where a physical write operations stops on a power outage. Even if you use journaling features of some filesystems. Note that the journal needs to get written too.
I want to know under what circumstances a direct I/O transfer will fail?
I have following three sub-queries for that. As per "Understanding Linux kernel" book..
Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
-- So to explain failure one needs to check whether application has self caching feature or not? Not sure how that can be done.
2.Furthermore the book says "When a self-caching application wishes to directly access a file, it opens the file specifying the O_DIRECT flag . While servicing the open( ) system call, the dentry_open( ) function checks whether the direct_IO method is implemented for the address_space object of the file being opened, and returns an error code in the opposite case".
-- Apart from this any other reason that can explain direct I/O failure ?
3.Will this command "dd if=/dev/zero of=myfile bs=1M count=1 oflag=direct" ever fail in linux (assuming ample disk space available) ?
The underlying filesystem and block device must support O_DIRECT flag. This command will fail because tmpfs doesn't support O_DIRECT.
dd if=/dev/zero of=/dev/shm/test bs=1M count=1 oflag=direct
The write size must be the multiply of the block size of underlying driver. This command will fail because 123 is not multiply of 512:
dd if=/dev/zero of=myfile bs=123 count=1 oflag=direct
There are many reasons why direct I/O can go on to fail.
So to explain failure one needs to check whether application has self caching feature or not?
You can't do this externally - you have the either deduce this from the source code or watching how the program used resources as it ran (binary disassembly I guess). This is more a property of how the program does its work rather than a "turn this feature on in a call". It would be a dangerous assumption to think all programs that use O_DIRECT have self caching (probabilistically I'd say it's more likely but you don't know for sure).
There are strict requirements for using O_DIRECT and they are mentioned in the man page of open (see the O_DIRECT section of NOTES).
With modern kernels the area being operated on must be aligned to, and its size must be a multiple of, the disk's block size. Failure to do this correctly may even result in silent fallback to buffered I/O.
Yes, for example trying to use it on a filesystem (such as tmpfs) doesn't support O_DIRECT. I suppose it could also fail if the path down to the disk returns failures for some reason (e.g. disk is dying and is returning the error much sooner in comparison to what happens with writeback).
My database engine writes records of 64 bytes by issuing write() syscall of the entire disk block. The device is opened with O_DIRECT mode. For example third record within a block starts at byte 128 and ends at position 192, when I do an UPDATE the entire disk block (which is by default 512 bytes) is written.
My question is, can I calim ACID compliance if I am writing the record over itself every time UPDATE occurs? Usually database engines do this in 2 steps by writing modified disk block to another (free) place and then updating an index to new block with one (atomic) write immediately after first write returned success. But I am not doing this, I am overwriting current data with new one expecting the write to be successful. Does my method has any potential problems? Is it ACID compliant? What if the hardware writes only half of the block and my record is exactly in the middle? Or does the hardware already does the 2 step write process I described , but at block level, so I don't need to repeat the same in software?
(note: no record is larger than physical disk block (512 bytes by default) and fsync goes after each write(), this is for Linux only)
ACID anticipates failures, and suggests ways to deal with them. Two-phase commits and three-phase commits are two fairly common and well-understood approaches.
Although I'm a database guy, the dbms frees me from having to think about this kind of thing very much. But I'd say overwriting a record without taking any other precautions is liable to fail the "C" and "D" properties ("consistent" and "durable").
To build really good code, imagine that your dbms server has no battery-backed cache, only one power supply, and that during a transaction there's a catastrophic failure in that one power supply. If your dbms can cope with that kind of failure fairly cleanly, I think you can call it ACID compliant.
Later . . .
I read Tweedie's transcript. He's not talking about database direct disk access; he's talking about a journaling filesystem. A journaling filesystem also does a two-phase commit.
It sounds like you're trying to reach ACID compliance (in the database sense) with a single-phase commit. I don't think you can get away with that.
Opening with O_DIRECT means "Try to minimize cache effects of the I/O to and from this file" (emphasis added). I think you'll also need O_SYNC. (But the linked kernel docs caution that most Linux filesystems don't implement POSIX semantics of O_SYNC. And both filesystems and disks have been known to lie about whether a write has hit a platter.)
There are two more cautions in the kernel docs. First, "It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default." You're not doing that. You're trying to use it to achieve ACID compliance.
Second,
"The thing that has always disturbed me about O_DIRECT is that the
whole interface is just stupid, and was probably designed by a
deranged monkey on some serious mind-controlling substances." -- Linus
SQLite has a readable paper on how they handle atomic commits. Atomic Commit in SQLite
No.
You cannot assume the disk write will be successful. And you cannot assume that the disk will leave the existing data in place. Here is some QNX documentation also stating this.
If you got really, really unlucky the disk power will fail while it is writing, leaving the block with corrupt checksums and half-written data.
This is why ACID systems use at least two copies of the data.
is write() with O_DIRECT ACID compliant?
No, this is not guaranteed in the general case. Here are some counterexamples for Durability:
O_DIRECT makes no guarantees that acknowledged data made it out of a volatile cache that is part of the device
O_DIRECT makes no guarantees about persistence of filesystem metadata that might be required to actually read back the (acknowledged) write data (e.g. in the case of appending writes)
My question is, can I calim [sic] ACID compliance if I am writing the record over itself every time UPDATE occurs?
In the general case no. For example a spec compliant SCSI disk doesn't have to guarantee the semantics of only getting only the old or only the new data if a crash happens mid-write (it's legal for it to return an error reading that data until the region is unconditionally overwritten). If you're doing a write to a file in a filesystem then things are even more complicated. Having a successful fsync() after the write() before you issue new I/O will help you to know the write was stable but is not enough to ensure Atomicity (only old or new data) in the general case of awkwardly timed power loss.
Does my method [assuming overwrites are perfectly atomic] has [sic] any potential problems?
Yes, see above. What you are doing may work as you wish in certain setups but there's no guarantee it should work in all (even though they are "non-faulty" per their spec).
See this answer on "What does O_DIRECT really mean?" for further discussion.
Is it safe to call rename(tmppath, path) without calling fsync(tmppath_fd) first?
I want the path to always point to a complete file.
I care mainly about Ext4. Is the rename() promised to be safe in all future Linux kernel versions?
A usage example in Python:
def store_atomically(path, data):
tmppath = path + ".tmp"
output = open(tmppath, "wb")
output.write(data)
output.flush()
os.fsync(output.fileno()) # The needed fsync().
output.close()
os.rename(tmppath, path)
No.
Look at libeatmydata, and this presentation:
Eat My Data: How Everybody Gets File IO Wrong
http://www.oscon.com/oscon2008/public/schedule/detail/3172
by Stewart Smith from MySql.
In case it is offline/no longer available, I keep a copy of it:
The video here
The presentation slides (online version of slides)
From ext4 documentation:
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
auto_da_alloc(*) Many broken applications don't use fsync() when
noauto_da_alloc replacing existing files via patterns such as
fd = open("foo.new")/write(fd,..)/close(fd)/
rename("foo.new", "foo"), or worse yet,
fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
If auto_da_alloc is enabled, ext4 will detect
the replace-via-rename and replace-via-truncate
patterns and force that any delayed allocation
blocks are allocated such that at the next
journal commit, in the default data=ordered
mode, the data blocks of the new file are forced
to disk before the rename() operation is
committed. This provides roughly the same level
of guarantees as ext3, and avoids the
"zero-length" problem that can happen when a
system crashes before the delayed allocation
blocks are forced to disk.
Judging by the wording "broken applications", it is definitely considered bad practice by the ext4 developers, but in practice it is so widely used approach that it was patched in ext4 itself.
So if your usage fits the pattern, you should be safe.
If not, I suggest you to investigate further instead of inserting fsync here and there just to be safe. That might not be such a good idea since fsync can be a major performance hit on ext3 (read).
On the other hand, flushing before rename is the correct way to do the replacement on non-journaling file systems. Maybe that's why ext4 at first expected this behavior from programs, the auto_da_alloc option was added later as a fix. Also this ext3 patch for the writeback (non-journaling) mode tries to help the careless programs by flushing asynchronously on rename to lower the chance of data loss.
You can read more about the ext4 problem here.
If you only care about ext4 and not ext3 then I'd recommend using fsync on the new file before doing the rename. The fsync performance on ext4 seems to be much better than on ext3 without the very long delays. Or it might be the fact that writeback is the default mode (at least on my Linux system).
If you only care that the file is complete and not which file is named in the directory then you only need to fsync the new file. There's no need to fsync the directory too since it will point to either the new file with its complete data, or the old file.