Reasons for direct_io failure - linux

I want to know under what circumstances a direct I/O transfer will fail?
I have following three sub-queries for that. As per "Understanding Linux kernel" book..
Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
-- So to explain failure one needs to check whether application has self caching feature or not? Not sure how that can be done.
2.Furthermore the book says "When a self-caching application wishes to directly access a file, it opens the file specifying the O_DIRECT flag . While servicing the open( ) system call, the dentry_open( ) function checks whether the direct_IO method is implemented for the address_space object of the file being opened, and returns an error code in the opposite case".
-- Apart from this any other reason that can explain direct I/O failure ?
3.Will this command "dd if=/dev/zero of=myfile bs=1M count=1 oflag=direct" ever fail in linux (assuming ample disk space available) ?

The underlying filesystem and block device must support O_DIRECT flag. This command will fail because tmpfs doesn't support O_DIRECT.
dd if=/dev/zero of=/dev/shm/test bs=1M count=1 oflag=direct
The write size must be the multiply of the block size of underlying driver. This command will fail because 123 is not multiply of 512:
dd if=/dev/zero of=myfile bs=123 count=1 oflag=direct

There are many reasons why direct I/O can go on to fail.
So to explain failure one needs to check whether application has self caching feature or not?
You can't do this externally - you have the either deduce this from the source code or watching how the program used resources as it ran (binary disassembly I guess). This is more a property of how the program does its work rather than a "turn this feature on in a call". It would be a dangerous assumption to think all programs that use O_DIRECT have self caching (probabilistically I'd say it's more likely but you don't know for sure).
There are strict requirements for using O_DIRECT and they are mentioned in the man page of open (see the O_DIRECT section of NOTES).
With modern kernels the area being operated on must be aligned to, and its size must be a multiple of, the disk's block size. Failure to do this correctly may even result in silent fallback to buffered I/O.
Yes, for example trying to use it on a filesystem (such as tmpfs) doesn't support O_DIRECT. I suppose it could also fail if the path down to the disk returns failures for some reason (e.g. disk is dying and is returning the error much sooner in comparison to what happens with writeback).

Related

Is running `sync` necessary after writing a disk image?

Common way to write an image to disk looks like:
dd if=file.img of=/dev/device
After this command, is it necessary to run sync?
sync(2) explains it only flushes filesystem caches. Since dd command is not related to any filesystem, I think it is not necessary to run sync. However, block layer is complex and in doubt, most people prefers to run sync.
Does anyone has a proof that it is useful or useless?
TL;DR: Run blockdev --flushbufs /dev/device after dd.
I tried to follow the different paths in kernel. Here is what I understood:
ioctl(block_dev, BLKFLSBUF, 0) call blkdev_flushbuf(). Considering its name, it should flush caches associated with device (or I think you can consider there is bug in device driver). I think it should also responsible to flush hardware caches if they exist. Notice e2fsprogs use BLKFLSBUF.
fdatasync() (and fsync()) will call blkdev_fsync(). It looks like blkdev_flushbuf() but it only impact range of data written by current process (It use filemap_write_and_wait_range() while BLKFLSBUF use filemap_write_and_wait).
Closing a block device calls blkdev_close() that do not flush buffers.
sync() will call sync_fs(). It will flush filesystem caches and call fsync() on underlying block device.
Command sync /dev/device will call fsync() on /dev/device. However, I think it is useless since dd didn't touch to any filesystem.
So my conclusions is that call to sync has no (direct) impact on block device. However, passing fdatasync (or fsync) to dd is the only way to guarantee that data are correctly written on media.
If have you run dd but you missed to pass fdatasync, running sync /dev/device is not sufficient. You have have to run dd with fdatasync on whole device. Alternatively, you can call BLKFLSBUF to flush whole device. Unfortunately, there is no standard command for that.
EDIT
You can issue a BLKFLSBUF with blockdev --flushbufs /dev/device.
To ensure the data is flushed on a usb device before to unplug, I use the following command :
echo 1 > /sys/block/${device}/device/delete
This way, the data is flushed, and if the device is a hard drive, then the head is parked.

Why is the iops observed by fio different from that observed by iostat?

Recently, I'm trying to test my disk using fio. My configuration of fio is as follows:
[global]
invalidate=0 # mandatory
direct=1
#sync=1
fdatasync=1
thread=1
norandommap=1
runtime=10000
time_based=1
[write4k-rand]
stonewall
group_reporting
bs=4k
size=1g
rw=randwrite
numjobs=1
iodepth=1
In this configuration, you can see that I configured fio to do random writes using direct io. While the test is running, I used iostat to monitor the I/O performance. And I found that: if I set fdatasync to 1, then the iops observed by fio is about 64, while that observed by iostat is about 170. Why is this different? And if I don't configure the "fdatasync", both iops are approximately the same, but much higher, about 450. Why? As far as I know, direct io does not go through page cache, which, in my opinion, means that it should take about the same time not matter whether fdatasync is used.
And I heard that iostat could come up with wrong statistics under some circumstances. Is that real? What exactly circumstance could make iostat go wrong? Is there any other tools that I can use to monitor the I/O performance?
Looking at your jobfile it appears you are not doing I/O against a block device but instead against a file within a filesystem. Thus while you may ask the filesystem "put this data at that location in that file" the filesystem may turn into multiple block device requests because it has to also update metadata associated with that file (e.g. the journal, file timestamps, copy on write etc) too. Thus when the requests are sent down to the disk (which is what you're measuring with iostat) the original request has been amplified.
Something to also bear in mind is that Linux may have an ioscheduler for that disk. This can rearrange, split and merge requests before submission to the disk / returning them further up in the stack. See the different parameters of nomerges in https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt for how to avoid some of the merging/rearranging but note you can't control the splitting of a request that is too large (but a filesystem won't make overly large requests).
(PS: I've not known iostat to be "wrong" so you might need to ask the people who say it directly to find out what they mean)

Does O_DIRECT bypass filesystem journaling?

The man page for open(2) only suggests that O_DIRECT bypasses the page cache, but many descriptions around the net describe it as causing the user buffer to be DMA'd straight to the drive. If this is the case I imagine it would also bypass journaling done by the filesystem (e.g. xfs, ext4, etc.). Is this the case?
I can't find anyone claiming one way or the other. It seems to me this would be consistent with O_DIRECT being used by databases -- the common example use for O_DIRECT is when an application like a database is doing its own caching in userspace, and similarly I can imagine databases doing their own transaction logs.
Does O_DIRECT bypass filesystem journaling?
Usually it does. However, file data usually doesn't go into a filesystem's journal anyway. More details below (but note this answer doesn't try to account for CoW filesystems):
Most Linux journaling filesystems (Ext4 when journal is set to writeback or ordered (the default), XFS, JFS etc) are not journalling the data within files - they are journaling the consistency of the filesystem's data structures (metadata).
Filesystem journals only metadata (typical case): Well data within the files doesn't go into the journal anyway so using O_DIRECT doesn't change this and the data continues to not to go into the journal. However, O_DIRECT operations can still trigger metadata updates just like normal but the initiating operation can return before the metadata has been updated. See the Ext4 wiki Clarifying Direct IO's Semantics page for details.
Ext4 in journal=data mode: This is trickier - there's a warning that the desired outcome with O_DIRECT in journal=data mode might not be what is expected . From the "data=journal" section of ext4.txt:
Enabling this mode [journal=data] will disable delayed allocation and O_DIRECT support.
In this scenario O_DIRECT is looked at as only a hint and the filesystem silently falls back to stuffing the data into the page cache (making it no longer direct!). So in this case yes the data will end up going into the journal and the journal won't be bypassed. See the "Re: [PATCH 1/1 linux-next] ext4: add compatibility flag check" thread for where Ted Ts'o articulates this. There are patches floating around ("ext4: refuse O_DIRECT opens for mode where DIO doesn't work") to make the filesystem return an error at open instead but from what I can see these were rejected from the mainline kernel.

Writing to a remote file: When does write() really return?

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system
Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.
You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

How to ensure read() to read data from the real device each time?

I'm periodically reading from a file and checking the readout to decide subsequent action. As this file may be modified by some mechanism which will bypass the block file I/O manipulation layer in the Linux kernel, I need to ensure the read operation reading data from the real underlying device instead of the kernel buffer.
I know fsync() can make sure all I/O write operations completed with all data written to the real device, but it's not for I/O read operations.
The file has to be kept opened.
So could anyone please kindly tell me how I can do to meet such requirement in Linux system? is there such a API similar to fsync() that can be called?
Really appreciate your help!
I believe that you want to use the O_DIRECT flag to open().
I think memory mapping in combination with madvise() and/or posix_fadvise() should satisfy your requirements... Linus contrasts this with O_DIRECT at http://kerneltrap.org/node/7563 ;-).
You are going to be in trouble if another device is writing to the block device at the same time as the kernel.
The kernel assumes that the block device won't be written by any other party than itself. This is true even if the filesystem is mounted readonly.
Even if you used direct IO, the kernel may cache filesystem metadata, so a change in the location of those blocks of the file may result in incorrect behaviour.
So in short - don't do that.
If you wanted, you could access the block device directly - which might be a more successful scheme, but still potentially allowing harmful race-conditions (you cannot guarantee the order of the metadata and data updates by the other device). These could cause you to end up reading junk from the device (if the metadata were updated before the data). You'd better have a mechanism of detecting junk reads in this case.
I am of course, assuming some very simple braindead filesystem such as FAT. That might reasonably be implemented in userspace (mtools, for instance, does)

Resources