I am using ext4 on linux 2.6 kernel. I have records in byte arrays, which can range from few hundred to 16MB. Is there any benefit in an application using write() for every record as opposed to saying buffering X MB and then using write() on X MB?
If there is a benefit in buffering, what would be a good value for ext4. This question is for someone who has profiled the behavior of the multiblock allocator in ext4.
My understanding is that filesystem will buffer in multiples of pagesize and attempt to flush them on disk. What happens if the buffer provided to write() is bigger than filesystem buffer? Is this a crude way to force filesystem to flush to disk()
The "correct" answer depends on what you really want to do with the data.
write(2) is designed as single trip into kernel space, and provides good control over I/O. However, unless the file is opened with O_SYNC, the data goes into kernel's cache only, not on disk. O_SYNC changes that to ensure file is synchroinized to disk. The actual writing to disk is issued by kernel cache, and ext4 will try to allocate as big buffer to write to minimize fragmentation, iirc. In general, write(2) with either buffered or O_SYNC file is a good way to control whether the data goes to kernel or whether it's still in your application's cache.
However, for writing lots of records, you might be interested in writev(2), which writes data from a list of buffers. Similarly to write(2), it's an atomic call (though of course that's only in OS semantics, not actually on disk, unless, again, Direct I/O is used).
Related
like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/
I have a C program that runs only weekly, and reads a large amount of files only once. Since Linux also caches everything that's read, they fill up the cache needlessly and this slows down the system a lot unless it has an SSD drive.
So how do I open and read from a file without filling up the disk cache?
Note:
By disk caching I mean that when you read a file twice, the second time it's read from RAM, not from disk. I.e. data once read from the disk is left in RAM, so subsequent reads of the same file will not need to reread the data from disk.
I believe passing O_DIRECT to open() should help:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
There are further detailed notes on O_DIRECT towards the bottom of the man page, including a fun quote from Linus.
You can use posix_fadvise() with the POSIX_FADV_DONTNEED advice to request that the system free the pages you've already read.
What would happen if I were to use write() to write some data to a file on disk. But my application were to crash before flushing. Is it guaranteed that my data will get eventually flushed to disk if there is no system failure?
If you're using write (and not fwrite or std::ostream::write),
then there is no in process buffering. If there is no system failure,
then the data will, sooner or later (and generally fairly soon) be
written to disk.
If you're really concerned by data integrity, you can or in the flags
O_DSYNC and O_SYNC to the flags when you open the file. If you do
this, you are guaranteed that the data is physically written to the disk
before the return from write.
I want to know whether the buffer cache in Linux kernel is present for file systems like UDF for DVD and FUSE?
I tried to search for this but unfortunately found little information.
Thanks.
The buffer cache will be used for any access to a filehandle opened against a block device, unless the file handle is opened with O_DIRECT. This includes accesses on behalf of FUSE filesystems. Note that if FUSE does caching as well (I don't know offhand), this may result in double-caching of data; unlike normal in-kernel filesystems, with FUSE the kernel can't safely overlap the page and buffer caches. In this case it may be worthwhile to consider using O_DIRECT in the FUSE filesystem daemon to reduce cache pressure (but be sure to profile first!).
For in-kernel filesystems such as UDF, the buffer cache will be used for all IO. For blocks containing file data, the block will simultaneously be in both the buffer and page caches (using the same underlying memory). This will be accounted as page cache, not buffer cache, in memory usage statistics.
The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).
What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?
O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.
O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.
O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".
Actuall under linux 2.6, o_direct is syncronous, see the man page:
manpage of open, there is 2 section about it..
Under 2.4 it is not guaranteed
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
but under 2.6 it is guaranteed, see
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus
AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.
This IBM doc explains the difference rather clearly, I think.
A file opened in the O_DIRECT mode ("direct I/O"), GPFS™ transfers data directly between the user buffer and the file on the disk.Using direct I/O may provide some performance benefits in the
following cases:
The file is accessed at random locations.
There is no access locality.
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true: The number of bytes
transferred is a multiple of 512 bytes. The file offset is a multiple
of 512 bytes. The user memory buffer address is aligned on a 512-byte
boundary. When these conditions are not all true, the operation will
still proceed but will be treated more like other normal file I/O,
with the O_SYNC flag that flushes the dirty buffer to disk.