I'm confused with fsync + direct IO.
It's easy to understand the code like this:
fd = open(filename, O_RDWR, 00644);
write(fd, data, size);
fsync(fd);
In this case, write() will write the data to the page cache, and fsync will force all modified data in the page cache referred to by the fd to the disk device.
But if we open a file with O_DIRECT flag, like this,
fd = open(filename, O_RDWR|O_DIRECT, 00644);
write(fd, data, size);
fsync(fd);
In this case, write() will bypass the page cache, the write directly to disk device. So what will the fsync do, there is no dirty page in the page cache referred to by the fd.
And if we open a raw device, what will fsync do,
fd = open('/dev/sda', O_RDWR|O_DIRECT, 00644);
write(fd, data, size);
fsync(fd);
In this case, we open a raw device with O_DIRECT, there is no filesystem on this device. What will sync do here ?
The filesystem might not implement O_DIRECT at all, in which case it will have no effect.
If it does implement O_DIRECT, then that still doesn't mean it goes to disk, it only means it's minimally cached by the page cache. It could still be cached elsewhere, even in hardware buffers.
fsync(2) is an explicit contract between the kernel and application to persist the data such that it doesn't get lost and is guaranteed to be available to the next thing that wants to access it.
With device files, the device drivers are the ones implementing flags, including O_DIRECT.
Linux does use the page cache cache to cache access to block devices and does support O_DIRECT in order to minimize cache interaction when writing directly to a block device.
In both cases, you need fsync(2) or a call with an equivalent guarantee in order to be sure that the data is persistent on disk.
Related
What would happen if I were to use write() to write some data to a file on disk. But my application were to crash before flushing. Is it guaranteed that my data will get eventually flushed to disk if there is no system failure?
If you're using write (and not fwrite or std::ostream::write),
then there is no in process buffering. If there is no system failure,
then the data will, sooner or later (and generally fairly soon) be
written to disk.
If you're really concerned by data integrity, you can or in the flags
O_DSYNC and O_SYNC to the flags when you open the file. If you do
this, you are guaranteed that the data is physically written to the disk
before the return from write.
I am using ext4 on linux 2.6 kernel. I have records in byte arrays, which can range from few hundred to 16MB. Is there any benefit in an application using write() for every record as opposed to saying buffering X MB and then using write() on X MB?
If there is a benefit in buffering, what would be a good value for ext4. This question is for someone who has profiled the behavior of the multiblock allocator in ext4.
My understanding is that filesystem will buffer in multiples of pagesize and attempt to flush them on disk. What happens if the buffer provided to write() is bigger than filesystem buffer? Is this a crude way to force filesystem to flush to disk()
The "correct" answer depends on what you really want to do with the data.
write(2) is designed as single trip into kernel space, and provides good control over I/O. However, unless the file is opened with O_SYNC, the data goes into kernel's cache only, not on disk. O_SYNC changes that to ensure file is synchroinized to disk. The actual writing to disk is issued by kernel cache, and ext4 will try to allocate as big buffer to write to minimize fragmentation, iirc. In general, write(2) with either buffered or O_SYNC file is a good way to control whether the data goes to kernel or whether it's still in your application's cache.
However, for writing lots of records, you might be interested in writev(2), which writes data from a list of buffers. Similarly to write(2), it's an atomic call (though of course that's only in OS semantics, not actually on disk, unless, again, Direct I/O is used).
I want to know whether the buffer cache in Linux kernel is present for file systems like UDF for DVD and FUSE?
I tried to search for this but unfortunately found little information.
Thanks.
The buffer cache will be used for any access to a filehandle opened against a block device, unless the file handle is opened with O_DIRECT. This includes accesses on behalf of FUSE filesystems. Note that if FUSE does caching as well (I don't know offhand), this may result in double-caching of data; unlike normal in-kernel filesystems, with FUSE the kernel can't safely overlap the page and buffer caches. In this case it may be worthwhile to consider using O_DIRECT in the FUSE filesystem daemon to reduce cache pressure (but be sure to profile first!).
For in-kernel filesystems such as UDF, the buffer cache will be used for all IO. For blocks containing file data, the block will simultaneously be in both the buffer and page caches (using the same underlying memory). This will be accounted as page cache, not buffer cache, in memory usage statistics.
We want to try our bests to avoid data loss during power failure. So I decide to use O_DIRECT flag to open a file to write data in disk. Does O_DIRECT mean that the data bypass OS cache completely? If the request returns successful to the application, does it mean that the data must have been flushed to the disk? If I open a regular file in one file system, how about the FS metadata? Is it also be flushed immediately, or is it cached?
By the way, O_DIRECT can be used in Windows? Or are there any corresponding method in Windows?
O_DIRECT will probably do what you want, but it will greatly slow down your I/O.
I think just calling fsync() or fflush() depending on whether you use direct file descriptor operations or FILE * should be enough.
As for the metadata question, it depends on the underlying file system and even on the hardware if you want to be extra paranoid. A hard drive (and especially a SSD) may report the operation finished but could take a while to actually write the data.
You can use O_DIRECT but for many applications, calling fdatasync() is more convenient. O_DIRECT imposes a lot of restrictions because the IOs completely bypass the OS cache. It bypasses read cache as well as write cache.
For filesystem metadata, all you can do is fsync() your file after writing it. fsync flushes the file metadata, so you can be sure that the file won't disappear (or change its attributes etc) if the power is lost immediately afterwards.
Any of these mechanisms depend on your IO subsystem not lying to the OS about having persisted data to storage, and in many cases, other hardware-dependent things (such as the RAID controller battery not running out before the power returns)
CreateFile can do this.
HANDLE WINAPI CreateFile(
__in LPCTSTR lpFileName,
__in DWORD dwDesiredAccess,
__in DWORD dwShareMode,
__in_opt LPSECURITY_ATTRIBUTES lpSecurityAttributes,
__in DWORD dwCreationDisposition,
__in DWORD dwFlagsAndAttributes,
__in_opt HANDLE hTemplateFile
);
For dwFlagsAndAttributes you can specify FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING.
If FILE_FLAG_WRITE_THROUGH and
FILE_FLAG_NO_BUFFERING are both
specified, so that system caching is
not in effect, then the data is
immediately flushed to disk without
going through the Windows system
cache. The operating system also
requests a write-through of the hard
disk's local hardware cache to
persistent media.
Can I use O_DIRECT for write requests to avoid data loss during power failure?
No!
On Linux while O_DIRECT tries to bypass your OS's cache it never bypasses your disk's cache. If your disk has a volatile write cache you can still lose data that was only in the disk cache during an abrupt power off!
Does O_DIRECT mean that the data bypass OS cache completely?
Usually, but some Linux filesystems may fall back to buffered I/O with O_DIRECT (the Ext4 Wiki Clarifying Direct IO's Semantics
page warns this can happen with allocating writes).
If the request returns successful to the application, does it mean that the data must have been flushed to the disk?
It usually means the disk has "seen" it but see the above caveats (e.g. data might have gone to buffer cache / data might only be in disk's volatile cache).
If I open a regular file in one file system, how about the FS metadata? Is it also be flushed immediately, or is it cached?
Excellent question! Metadata may still be rolling around in cache and not yet synced to disk even though the request finished successfully.
All of the above mean you HAVE to do the appropriate fsync() command in the correct places (and check their results!) if you want to be sure whether an operation has reached non-volatile storage. See https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/ and the LWN article "Ensuring data reaches disk" for details.
The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).
What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?
O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.
O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.
O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".
Actuall under linux 2.6, o_direct is syncronous, see the man page:
manpage of open, there is 2 section about it..
Under 2.4 it is not guaranteed
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
but under 2.6 it is guaranteed, see
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus
AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.
This IBM doc explains the difference rather clearly, I think.
A file opened in the O_DIRECT mode ("direct I/O"), GPFS™ transfers data directly between the user buffer and the file on the disk.Using direct I/O may provide some performance benefits in the
following cases:
The file is accessed at random locations.
There is no access locality.
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true: The number of bytes
transferred is a multiple of 512 bytes. The file offset is a multiple
of 512 bytes. The user memory buffer address is aligned on a 512-byte
boundary. When these conditions are not all true, the operation will
still proceed but will be treated more like other normal file I/O,
with the O_SYNC flag that flushes the dirty buffer to disk.