I have a question about kernel I/O cache for disk file.
As I know when write() or read() is called, there's a buffer cache in kernel space for disk file I/O operation.
My question is that, is this I/O buffering only applies to disk file, or it also applies to terminal, FIFO, pipe, and sockets?
Thanks
It is called the "page cache". It consists of pages backed by files and "anonymous pages" backed by swap. This is all part of the Linux virtual memory (VM) subsystem.
It is not used for TTYs, FIFOs, pipes, or sockets. Each of those do provide buffering of their own by their nature; for example, the data you write to a pipe has to reside somewhere before it is read back out again. But that buffering has nothing to do with the VM subsystem.
[update]
Note that this buffering is totally independent of the user-space buffering provided by (e.g.) fwrite(). (I see you asked a similar question earlier, and it is not clear whether you understand the distinction.)
Related
If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?
(This answer pertains to Linux - other OSes may have different caveats/semantics)
Let's start with the sub-question:
If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?
No (as #michael-foukarakis commented) - if you need a guarantee your data made it to non-volatile storage you must use/add something else.
What does O_DIRECT really mean?
It's a hint that you want your I/O to bypass the Linux kernel's caches. What will actually happen depends on things like:
Disk configuration
Whether you are opening a block device or a file in a filesystem
If using a file within a filesystem
The exact filesystem used and the options in use on the filesystem and the file
Whether you've correctly aligned your I/O
Whether a filesystem has to do a new block allocation to satisfy your I/O
If the underlying disk is local, what layers you have in your kernel storage stack before you reach the disk block device
Linux kernel version
...
The list above is not exhaustive.
In the "best" case, setting O_DIRECT will avoid making extra copies of data while transferring it and the call will return after transfer is complete. You are more likely to be in this case when directly opening block devices of "real" local disks. As previously stated, even this property doesn't guarantee that data of a successful write() call will survive sudden power loss. IF the data is DMA'd out of RAM to non-volatile storage (e.g. battery backed RAID controller) or the RAM itself is persistent storage THEN you may have a guarantee that the data reached stable storage that can survive power loss. To know if this is the case you have to qualify your hardware stack so you can't assume this in general.
In the "worst" case, O_DIRECT can mean nothing at all even though setting it wasn't rejected and subsequent calls "succeed". Sometimes things in the Linux storage stack (like certain filesystem setups) can choose to ignore it because of what they have to do or because you didn't satisfy the requirements (which is legal) and just silently do buffered I/O instead (i.e. write to a buffer/satisfy read from already buffered data). It is unclear whether extra effort will be made to ensure that the data of an acknowledged write was at least "with the device" (but in the O_DIRECT and barriers thread Christoph Hellwig posts that the O_DIRECT fallback will ensure data has at least been sent to the device). A further complication is that using O_DIRECT implies nothing about file metadata so even if write data is "with the device" by call completion, key file metadata (like the size of the file because you were doing an append) may not be. Thus you may not actually be able to get at the data you thought had been transferred after a crash (it may appear truncated, or all zeros etc).
While brief testing can make it look like data using O_DIRECT alone always implies data will be on disk after a write returns, changing things (e.g. using an Ext4 filesystem instead of XFS) can weaken what is actually achieved in very drastic ways.
As you mention "guarantee that the data" (rather than metadata) perhaps you're looking for O_DSYNC/fdatasync()? If you want to guarantee metadata was written too, you will have to look at O_SYNC/fsync().
References
Ext4 Wiki: Clarifying Direct IO's Semantics. Also contains notes about what O_DIRECT does on a few non-Linux OSes.
The "[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch" LKML thread has a reply from Ext4 lead dev Ted Ts'o talking about how filesystems can fallback to buffered I/O for O_DIRECT rather than failing the open() call.
In the "ubifs: Allow O_DIRECT" LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when O_DIRECT is requested on compressed files.
ZFS on Linux commit message discussing the semantics of O_DIRECT in different scenarios. Also see the (at the time of writing mid-2020) proposed new O_DIRECT semantics for ZFS on Linux (the interactions are complex and defy a brief explanation).
Linux open(2) man page (search for O_DIRECT in the Description section and the Notes section)
Ensuring data reaches disk LWN article
Infamous Linus Torvalds O_DIRECT LKML thread summary (for even more context you can see the full LKML thread)
If one process sends data through a socket to another process on the same machine how likely is it that a disk read/write will occur during transmission? There seems to be a socket file type, are these guaranteed to be in memory provided there is free memory?
Not directly. TCP / UDP network sockets, over localhost, or a UNIX Domain Socket will operate in memory. UNIX Domain Sockets are typically the fastest option outside of dropping into kernel space with a module.
sockets over localhost pipes are nearly as simple as a couple of memcpy's between userspace and kernel space and back. In TCP case, you have the stack overhead.
Both files and sockets share the kernel abstraction of descriptor table, but that doesn't imply an actual file.
Of course, the database may trigger some write to a log, as a result of your transaction.
In the POSIX model, as well as many other kernels, files are not only in disks. Instead, every device is represented by a "special file". They live in directories or some sort of namespace, but accessing them is not disk access, even if they are placed in a directory on disk.
If you have memory pressure, then it's possible for some of your data buffers to get swapped out. But this has nothing to do with the "file" nature of devices. It's just using the disk as additional RAM.
So "Yes, socket I/O is file I/O, but not disk read/write."
Grabbing the "Design of the 4.4BSD Operating System", which describes what can be considered the reference implementation, sections 11.2 "implementation structure" and 11.3 "memory management", and in the absense of extreme memory pressure, it appears to be guaranteed that there will be no disk I/O involved in transmission.
Data transmitted is stored in special structures, mbufs and mbuf clusters, data is added or removed at either end of each buffer, directly. Probably the same buffers will be used over and over, being freed to a specific pool and then reallocated from there. Fresh buffers are allocated from the kernel malloc pool, which is not swappable. Growth of the number of buffers will obviously only occur when the consumer is slow and up to a limit.
Put simply, as to the data, in the reference implementation these buffers are not backed by files, much less by a file in the file system where the inode is placed, at best they would be backed by swap space, even if extremely unlikely to be paged out.
This only leaves out meta data and status information which may be on the inode. Naturally, inode creation and lookup will cause disk access. As to status, all I can think of is atime.
I can't find authoritative information regarding atime on UNIX domain sockets. But I tried on FreeBSD and on Linux and all four file times were always kept as the inode creation time. Even establishing a second connection to a UNIX domain socket does not seem to update atime.
It seems that writes/reads to regular files can't not be made non-blocking. I found the following references for support:
from The Linux Programming Interface: A Linux and UNIX System Programming Handbook:
"--- Nonblocking mode can be used with devices (e.g., terminals and pseudoterminals), pipes, FIFOs, and sockets. (Because file descriptors for pipes and sockets are not obtained using open(), we must enable this flag using the fcntl() F_SETFL operation described in Section 5.3.) O_NONBLOCK is generally ignored for regular files, because the kernel buffer cache ensures that I/O on regular files does not block, as described in Section 13.1. However, O_NONBLOCK does have an effect for regular files when mandatory file locking is employed (Section 55.4). ---"
from Advanced Programming in the UNIX Environment 2nd Ed:
"--- We also said that system calls related to disk I/O are not considered slow, even though the read or write of a disk file can block the caller temporarily. ---"
from http://www.remlab.net/op/nonblock.shtml:
"--- Regular files are always readable and they are also always writeable. This is clearly stated in the relevant POSIX specifications. I cannot stress this enough. Putting a regular file in non-blocking has ABSOLUTELY no effects other than changing one bit in the file flags. Reading from a regular file might take a long time. For instance, if it is located on a busy disk, the I/O scheduler might take so much time that the user will notice the application is frozen. Nevertheless, non-blocking mode will not work. It simply will not work. Checking a file for readability or writeability always succeeds immediately. If the system needs time to perform the I/O operation, it will put the task in non-interruptible sleep from the read or write system call. ---"
When memory is adequately available, reads/writes is performed through kernel buffering.
My question is: is there a scenario that the kernel is out of memory that buffering is not usable immediately? If yes, what will kernel do? Simply returns an error or do some amazing trick?
Thanks guys!
My take on O_NONBLOCK and on quoted text:
O_NONBLOCK has absolutely no effect if a syscall which triggers an actual disk I/O blocks or not waiting for the I/O to complete. O_NONBLCOK only affects filesystem level operations such as accesses to pseudo-files (as the first quote mentions) and to locked files. O_NONBLOCK does not affect any operations at the block device level.
Lack of memory has nothing to do with O_NONBLOCK.
O_NONBLOCK dictates if accesses to locked files block or not. For example, flock() / lockf() can be used to lock a file. If O_NONBLOCK is used, a read()/write() would return immediately with EGAIN instead of blocking and waiting for the file lock to be released. Please keep in mind that this synchronization differences are implemented at the level of the filesystem level and have nothing to do with whether the read()/write() syscall triggers a true disk I/O.
The fragment because the kernel buffer cache ensures that I/O on regular files does not block from the first quote is misleading and I would go as far as considering it wrong. It is true that buffering lowers the chance for a file read/write syscall to result in a disk I/O and thus block, however, buffering alone can never fully avoid I/Os. If a file is not cached, the kernel needs to perform an actual IO when you read() from the file. If you write() to the file and the page cache is full with dirty pages, the kernel has to make room by first flushing some data to the storage device. I feel that if you mentally skip this fragment the text becomes more clear.
The second quote seems generic (what does it mean for something to be slow?) and provides no explanation of why I/O-related calls are not considered slow. I suspect there is more background information in the text around it that qualifies a bit more what the author intended to say.
Lack of memory in the kernel can come in two forms: (a) lack of free pages in the buffer cache and (b) not enough memory to allocate new data structures for servicing new syscalls. For (a) the kernel simply recycles pages in the buffer cache, possibly by writing dirty pages first to disk. This is a very common scenario. For (b) the kernel needs to free up memory either by paging program data to the swap partition or (if this fails) even by killing an existing process (the OOM function is invoked which pretty much kills the process with the biggest memory consumption). This is an uncommon mode of operation but the system will continue running after the user process is killed.
I am using ext4 on linux 2.6 kernel. I have records in byte arrays, which can range from few hundred to 16MB. Is there any benefit in an application using write() for every record as opposed to saying buffering X MB and then using write() on X MB?
If there is a benefit in buffering, what would be a good value for ext4. This question is for someone who has profiled the behavior of the multiblock allocator in ext4.
My understanding is that filesystem will buffer in multiples of pagesize and attempt to flush them on disk. What happens if the buffer provided to write() is bigger than filesystem buffer? Is this a crude way to force filesystem to flush to disk()
The "correct" answer depends on what you really want to do with the data.
write(2) is designed as single trip into kernel space, and provides good control over I/O. However, unless the file is opened with O_SYNC, the data goes into kernel's cache only, not on disk. O_SYNC changes that to ensure file is synchroinized to disk. The actual writing to disk is issued by kernel cache, and ext4 will try to allocate as big buffer to write to minimize fragmentation, iirc. In general, write(2) with either buffered or O_SYNC file is a good way to control whether the data goes to kernel or whether it's still in your application's cache.
However, for writing lots of records, you might be interested in writev(2), which writes data from a list of buffers. Similarly to write(2), it's an atomic call (though of course that's only in OS semantics, not actually on disk, unless, again, Direct I/O is used).
I want to know whether the buffer cache in Linux kernel is present for file systems like UDF for DVD and FUSE?
I tried to search for this but unfortunately found little information.
Thanks.
The buffer cache will be used for any access to a filehandle opened against a block device, unless the file handle is opened with O_DIRECT. This includes accesses on behalf of FUSE filesystems. Note that if FUSE does caching as well (I don't know offhand), this may result in double-caching of data; unlike normal in-kernel filesystems, with FUSE the kernel can't safely overlap the page and buffer caches. In this case it may be worthwhile to consider using O_DIRECT in the FUSE filesystem daemon to reduce cache pressure (but be sure to profile first!).
For in-kernel filesystems such as UDF, the buffer cache will be used for all IO. For blocks containing file data, the block will simultaneously be in both the buffer and page caches (using the same underlying memory). This will be accounted as page cache, not buffer cache, in memory usage statistics.