As we know that file writings are cache by Linux OS and only get commited to disk when the OS has to do so or fsync() is called.
But,
Are file system operations such as rename/unlink cached? And does a successful return from rename/unlink ensure that the operation is commited to disk and will survive even OS crash?
If this kind of stuff was instantly written out the performance would be beyond terrible.
There are various approaches: log-structured filesystems, soft-updates, journaling and probably more.
I suggest you read http://www.nobius.org/~dbg/practical-file-system-design.pdf
Related
I am writing a large file to disk from a user-mode application. In parallel to it, I am writing one or more smaller files. The large file won't be read back anytime soon, but the small files could be. I have enough RAM for the application + smaller files, but not enough for the large file. Can I tell the OS not to keep parts of the large file in cache after they are written to disk so that more cache is available for smaller files? I still want writes to the large file be fast enough.
Can I tell the OS not to keep parts of the large file in cache ?
Yes, you probably want to use some system call like posix_fadvise(2) or madvise(2). In weird cases, you might use readahead(2) or userfaultfd(2) or Linux-specific flags to mmap(2). Or very cleverly handle SIGSEGV (see signal(7), signal-safety(7) and eventfd(2) and signalfd(2)) You'll need to write your C program doing that.
But I am not sure that it is worth your development efforts. In many cases, the behavior of a recent Linux kernel is good enough.
See also proc(5) and linuxatemyram.com
You many want to read the GC handbook. It is relevant to your concerns
Conbsider studying for inspiration the source code of existing open-source software such as GCC, Qt, RefPerSys, PostGreSQL, GNU Bash, etc...
Most of the time, it is simply not worth the effort to explicitly code something to manage your page cache.
I guess that mount(2) options in your /etc/fstab file (see fstab(5)...) are in practice more important. Or changing or tuning your file system (e.g. ext4(5), xfs(5)..). Or read(2)-ing in large pieces (1Mbytes).
Play with dd(1) to measure. See also time(7)
Most applications are not disk-bound, and for those who are disk bound, renting more disk space is cheaper that adding and debugging extra code.
don't forget to benchmark, e.g. using strace(1) and time(1)
PS. Don't forget your developer costs. They often are a lot above the price of a RAM module (or of some faster SSD disk).
I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system
Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.
You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.
Is it safe to call rename(tmppath, path) without calling fsync(tmppath_fd) first?
I want the path to always point to a complete file.
I care mainly about Ext4. Is the rename() promised to be safe in all future Linux kernel versions?
A usage example in Python:
def store_atomically(path, data):
tmppath = path + ".tmp"
output = open(tmppath, "wb")
output.write(data)
output.flush()
os.fsync(output.fileno()) # The needed fsync().
output.close()
os.rename(tmppath, path)
No.
Look at libeatmydata, and this presentation:
Eat My Data: How Everybody Gets File IO Wrong
http://www.oscon.com/oscon2008/public/schedule/detail/3172
by Stewart Smith from MySql.
In case it is offline/no longer available, I keep a copy of it:
The video here
The presentation slides (online version of slides)
From ext4 documentation:
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
auto_da_alloc(*) Many broken applications don't use fsync() when
noauto_da_alloc replacing existing files via patterns such as
fd = open("foo.new")/write(fd,..)/close(fd)/
rename("foo.new", "foo"), or worse yet,
fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
If auto_da_alloc is enabled, ext4 will detect
the replace-via-rename and replace-via-truncate
patterns and force that any delayed allocation
blocks are allocated such that at the next
journal commit, in the default data=ordered
mode, the data blocks of the new file are forced
to disk before the rename() operation is
committed. This provides roughly the same level
of guarantees as ext3, and avoids the
"zero-length" problem that can happen when a
system crashes before the delayed allocation
blocks are forced to disk.
Judging by the wording "broken applications", it is definitely considered bad practice by the ext4 developers, but in practice it is so widely used approach that it was patched in ext4 itself.
So if your usage fits the pattern, you should be safe.
If not, I suggest you to investigate further instead of inserting fsync here and there just to be safe. That might not be such a good idea since fsync can be a major performance hit on ext3 (read).
On the other hand, flushing before rename is the correct way to do the replacement on non-journaling file systems. Maybe that's why ext4 at first expected this behavior from programs, the auto_da_alloc option was added later as a fix. Also this ext3 patch for the writeback (non-journaling) mode tries to help the careless programs by flushing asynchronously on rename to lower the chance of data loss.
You can read more about the ext4 problem here.
If you only care about ext4 and not ext3 then I'd recommend using fsync on the new file before doing the rename. The fsync performance on ext4 seems to be much better than on ext3 without the very long delays. Or it might be the fact that writeback is the default mode (at least on my Linux system).
If you only care that the file is complete and not which file is named in the directory then you only need to fsync the new file. There's no need to fsync the directory too since it will point to either the new file with its complete data, or the old file.
I'm creating a web application running on a Linux server. The application is constantly accessing a 250K file - it loads it in memory, reads it and sends back some info to the user. Since this file is read all the time, my client is suggesting to use something like memcache to cache it to memory, presumably because it will make read operations faster.
However, I'm thinking that the Linux filesystem is probably already caching the file in memory since it's accessed frequently. Is that right? In your opinion, would memcache provide a real improvement? Or is it going to do the same thing that Linux is already doing?
I'm not really familiar with neither Linux nor memcache, so I would really appreciate if someone could clarify this.
Yes, if you do not modify the file each time you open it.
Linux will hold the file's information in copy-on-write pages in memory, and "loading" the file into memory should be very fast (page table swap at worst).
Edit: Though, as cdhowie points out, there is no 'linux filesystem'. However, I believe the relevant code is in linux's memory management, and is therefore independent of the filesystem in question. If you're curious, you can read in the linux source about handling vm_area_struct objects in linux/mm/mmap.c, mainly.
As people have mentioned, mmap is a good solution here.
But, one 250k file is very small. You might want to read it in and put it in some sort of memory structure that matches what you want to send back to the user on startup. Ie, if it is a text file an array of lines might be a good choice, etc.
The file should be cached, but make sure the noatime option is set on the mount, otherwise the access time will attempt to be saved to the file, invalidating the cache.
Yes, definitely. It will keep accessed files in memory indefinitely, unless something else needs the memory.
You can control this behaviour (to some extent) with the fadvise system call. See its "man" page for more details.
A read/write system call will still normally need to copy the data, so if you see a real bottleneck doing this, consider using mmap() which can avoid the copy, by mapping the cache pages directly into the process.
I guess putting that file into ramdisk (tmpfs) may make enough advantage without big modifications. Unless you are really serious about response time in microseconds unit.
Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.