man 2 read says:
EINVAL fd is attached to an object which is unsuitable for reading; or
the file was opened with the O_DIRECT flag, and either the address
specified in buf, the value specified in count, or the current file
offset is not suitably aligned.
Non-direct I/O has no such limits, but why direct I/O requires alignments?
(Kernel 2.6+) This is because direct I/O can be zero-copy from the kernel's perspective (i.e. no more copying takes place in the kernel of the data itself) and disks have a minimum addressable size of I/O known as their "logical block size" (often 512 bytes but might be 4096 bytes or even more). This O_DIRECT requirement (must obey logical block size alignments) is actually described in the man page for open() (see the O_DIRECT section under NOTES).
In the buffered I/O case, the kernel copies data out of the userspace addresses and into its own internal page cache address (which obey all the alignment rules and will do read-modify-write if necessary to make sure everything is aligned) and then tells the device to do I/O to/from the page cache locations. In the direct I/O case when everything goes correctly, the memory allocated to your userspace program is the same memory handed to the device to do I/O to/from and thus your program must obey the alignment because there's nothing in between that will fix things up.
Related
How does an IO device know that a value in memory pertaining to it has changed in memory mapped IO?
For example, let's say memory address 0 has been dedicated to hold the background color for a VGA device. How does the VGA device know when we change the value in memory[0]? Is the VGA device constantly polling the memory location? Or does the CPU somehow notify the device when it changes the value (and if so how?)?
An example architecture is MIPS. Given that the MIPS instruction set does not have in or out instructions, I don't understand how it could possibly communicate (on change) with the VGA device in the example. Another example is the ARM architecture.
In memory-mapped I/O, performing a memory read/write to the device's memory region will cause the CPU to perform a transaction with the device to fetch/store that value -- either directly through the CPU's memory bus, or through a secondary bus (such as AHB/APB on ARM systems). This memory transaction directly notifies the device that a value is being changed; no separate notification is necessary.
You're assuming that memory-mapped I/O is mapped by normal RAM. This is not the case. Indeed, these devices may behave in ways which are entirely unlike real memory! For instance, a typical UART or SPI device implementation may have a single data register which can be written to to transmit data, or read from to retrieve received data. Similarly, it's not uncommon for interrupt registers to have "clear on read" or "write 1 to clear" semantics.
For what it's worth: in practice, many framebuffer graphics implementations do actually behave as normal memory. What's different is that the memory is stored in a dual-ported RAM (or a time-multiplexed bus), and the video RAMDAC continuously reads through that memory to transmit its contents to an attached display.
A region of the physical address space that is designated as memory-mapped I/O (MMIO) is not mapped to main memory (system memory); it's mapped to I/O registers which are physically part of the I/O device.
To determine how to handle a memory access (read or write), the processor checks first the type of the region to which the target memory address belongs. In any MIPS processor, there are at least two types: Uncached and Cached. MMIO regions are always Uncached. An Uncached memory access request is directly sent to the main memory controller without examining or affecting any of the caches. However, an I/O Uncached memory access request is sent to an I/O controller, and eventually the request will reach the destination I/O device.
Now exactly how the CPU and the I/O device communicate with each other is completely specified by the I/O device itself. So an I/O device would have a specification that discusses how many I/O registers there are and how each of them should be used. An I/O register could be used to hold status flags, control flags, data to be read or written by the CPU, or some combination thereof. Note that since the I/O registers are physically part of the I/O device, then the I/O device can be designed so that it can detect when any of its registers are being read from or written to and take an action accordingly if required.
An I/O device can send an interrupt to the CPU to inform it that some data is available or maybe it wants attention for whatever reason. The CPU can also frequently poll the I/O device by checking some status flag(s) and then take some action accordingly.
Following the text at https://www.kernel.org/doc/Documentation/DMA-API.txt a few inlined questions
Part Ia - Using large dma-coherent buffers
------------------------------------------
void *
dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag)
Consistent memory is memory for which a write by either the device or
the processor can immediately be read by the processor or device
without having to worry about caching effects. (You may however need
to make sure to flush the processor's write buffers before telling
devices to read that memory.)
Q1. Is it safe to assume that the area allocated is cacheable ? As the last line state that flushing is required
Q1a. Does this API allocate memory from lower 16MB which is considered DMA safe.
dma_addr_t
dma_map_single(struct device *dev, void *cpu_addr, size_t size,
enum dma_data_direction direction)
Maps a piece of processor virtual memory so it can be accessed by the
device and returns the physical handle of the memory.
The direction for both api's may be converted freely by casting.
However the dma_ API uses a strongly typed enumerator for its
direction:
DMA_NONE no direction (used for debugging)
DMA_TO_DEVICE data is going from the memory to the device
DMA_FROM_DEVICE data is coming from the device to the memory
DMA_BIDIRECTIONAL direction isn't known
Q2. Does the DMA_XXX options direct change of Page Attributes for the VA=>PA mapping. Say DMA_TO_DEVICE would mark the area as non-cacheable ?
It says "without having to worry about caching effects". That means dma_alloc_coherent() returns uncacheable memory unless the architecture has cache coherent DMA hardware so the caching makes no difference. However being uncached does not mean that writes do not go through the CPU write buffers (i.e. not every memory access is immediately executed or executed in the same order as they appear in the code). To be sure that everything you write into memory is really there when you tell the device to read it, you will have to execute a wmb() at least. See Documentation/memory-barriers.txt for more information.
dma_alloc_coherent() does not return memory from the lower 16 MB, it returns memory that is accessible by the device inside the addressable area specified by dma_set_coherent_mask(). You have to call that as part of the device initialization.
Cacheability is irrelevant to dma_map_*() functions. They make sure that the given memory region is accessible to the device at the DMA address they return. After the DMA is finished dma_unmap_*() is called. For DMA_TO_DEVICE the sequence is "write data to memory, map(), start DMA, unmap() when finished", for DMA_FROM_DEVICE "map(), start DMA, unmap() when finished, read data from memory".
Cache makes no difference because usually you are not writing or reading the buffer while it is mapped. If you really have to do that you have to explicitly dma_sync_*() the memory before reading or after writing the buffer.
I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.
I am using ext4 on linux 2.6 kernel. I have records in byte arrays, which can range from few hundred to 16MB. Is there any benefit in an application using write() for every record as opposed to saying buffering X MB and then using write() on X MB?
If there is a benefit in buffering, what would be a good value for ext4. This question is for someone who has profiled the behavior of the multiblock allocator in ext4.
My understanding is that filesystem will buffer in multiples of pagesize and attempt to flush them on disk. What happens if the buffer provided to write() is bigger than filesystem buffer? Is this a crude way to force filesystem to flush to disk()
The "correct" answer depends on what you really want to do with the data.
write(2) is designed as single trip into kernel space, and provides good control over I/O. However, unless the file is opened with O_SYNC, the data goes into kernel's cache only, not on disk. O_SYNC changes that to ensure file is synchroinized to disk. The actual writing to disk is issued by kernel cache, and ext4 will try to allocate as big buffer to write to minimize fragmentation, iirc. In general, write(2) with either buffered or O_SYNC file is a good way to control whether the data goes to kernel or whether it's still in your application's cache.
However, for writing lots of records, you might be interested in writev(2), which writes data from a list of buffers. Similarly to write(2), it's an atomic call (though of course that's only in OS semantics, not actually on disk, unless, again, Direct I/O is used).
The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).
What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?
O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.
O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.
O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".
Actuall under linux 2.6, o_direct is syncronous, see the man page:
manpage of open, there is 2 section about it..
Under 2.4 it is not guaranteed
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
but under 2.6 it is guaranteed, see
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus
AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.
This IBM doc explains the difference rather clearly, I think.
A file opened in the O_DIRECT mode ("direct I/O"), GPFS™ transfers data directly between the user buffer and the file on the disk.Using direct I/O may provide some performance benefits in the
following cases:
The file is accessed at random locations.
There is no access locality.
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true: The number of bytes
transferred is a multiple of 512 bytes. The file offset is a multiple
of 512 bytes. The user memory buffer address is aligned on a 512-byte
boundary. When these conditions are not all true, the operation will
still proceed but will be treated more like other normal file I/O,
with the O_SYNC flag that flushes the dirty buffer to disk.