How to update part of a file atomically? - node.js

I have a big file (several gigabytes), and I want to update a small section in it (overwrite some bytes with a new value). This must be done atomically (either the operation succeeds, or the file is left unchanged). How can I do that?
The purpose is to store progress information in a file that takes a lot of time to generate/upload (it can be on a remote file system). There will probably be times where I need to write at different locations in the file (and commit all changes at once), but if needed I can rewrite the whole index, which is a contiguous block and relatively small compared to the rest of the file. There is only one process and thread writing to the file at any given time.

Normal disks are not transactional, and don't provide atomicity guarantees.
If the underlying file system doesn't provide atomic writes (and most of them don't), then you'll need to create atomicity in your own application/data structure. This could be done via journaling (as many file systems and databases do), copy-on-write techniques, etc.
In Windows, the Transactional File System (TxF) feature does exactly what you need - but your application will need to explicitly use the Win32 transactional file I/O APIs to do that.

I think simple lockfile should be enough...
For example proper-lockfile:
const lockfile = require('proper-lockfile');
lockfile.lock('some/file')
.then(() => doStuff())
.finally(() => lockfile.unlock('some/file'));
Note that any logic working with some/file has to respect the lockfile.

Related

How to perform conditional IO in the file system?

I'm trying to implement a multi-user key-value store over the file system, such as the local Linux or Windows file system, or a network-based one (SMB or NFS). My intent is to fully avoid the need of a server because servers require some VM, deployment, upgrades, etc. And filesystems are typically readily available.
The engine returns the timestamp of when the value was set. One operation that uses the timestamp is "put if not modified since", which is similar to compare-and-swap and supports synchronization among processes. It turns out that this is quite costly to implement without a server.
It seems that no file system supports "write if not modified" or any form of conditional write semantics. At best I can lock a file, but then I need to read the date and compare inside the process, and only then write the new content and release the lock. The minimum number of IOs to implement is four: 1) lock entire file; 2) read modification date and compare locally; 3) write the new content; 4) unlock. And this ignores the IOs to open and close the file, which are pooled so they will be less frequent.
Is there any OS or filesystem facility, or algorithm that could reduce the number of IOs? Please remember that I need the solution to work over NFS or SMB...
Thanks
Filesystems already do read-ahead and write avoidance, so I/O calls will only block for disk when read data is not in cache or write cache is full and a flush is required. The performance problem with the "write if not modified since" is the 4 syscalls, which can get expensive. One way to fix this would be to add a conditional write kernel module. You would pass it the timestamp, file name, and data. It would do the conditional write using internal calls and callbacks, and return the status and new timestamp, reducing the overhead to a single syscall. Properly written, it should be filesystem-agnostic.

node fs.fsync (when to use?)

I want to safely write a file and I wan't to understand the proper use/place for fsync.
https://linux.die.net/man/2/fsync
After reading ^ that, I am puzzled as to where to effectively use it.
Question, do I:
fs.write('temp/file.txt','utf-8',function(error){
if(error){fs.unlink('temp/file.txt',function(){cb(error,undefined);});}
else{
fs.rename('temp/file.txt','real/file.txt',function(){
fs.fsync('real/file.txt',function(){
cb(undefined,true);
});
});
}
});
I'm writing something that performs many file changes. I have looked at modules that write atomic, but I would like to understand the process.
fsync is one of those functions where it's extremely rare that you'll need to use it.
All operating systems mask the fact that storage devices are slow by caching reads and writes. When you write to a file, it doesn't immediately write to the actual storage medium; it'll capture it in a cache, tell your program that the write has completed, and go and write the contents to the storage device in the background instead. The operating system will keep everything consistent though; if another application reads from that file, it'll see the new contents, as the OS will serve the contents from cache.
Note for a moment that this isn't universal; I believe Windows disables caching for removable storage devices to prevent data loss when people pull the drive out. There's also some set of flags you can pass to open() to disable the cache.
For almost all use cases, you don't need to care that this happens. The only upshot for you is that your program can continue faster. There are some cases where this is problematic though:
If power is lost, the contents of the cache are lost, so the disk won't have all the new contents of the file.
If the drive is removed, writes will equally be lost. This is pretty typical for removable storage devices, and I'm pretty sure 90% of people ignore the "safely remove" prompt ;).
I think doing direct reads directly from a device (i.e. /dev/sdX in Linux) will bypass this cache, but I'm not 100% sure.
Examples of where it is needed are, say, databases. When you run an update query, the database will normally update its in-memory state, and write the mutation to a transaction log. Reliability is a good thing for a database though, so it will write to the transaction log and do an fsync on that file before responding to the user (or will have opened the transaction log as unbuffered) so there's some level of guarantee that the transaction has been persisted.
In your example, the fsync will ensure that the rename has actually taken place and has been flushed to disk.

Transactionally writing files in Node.js

I have a Node.js application that stores some configuration data in a file. If you change some settings, the configuration file is written to disk.
At the moment, I am using a simple fs.writeFile.
Now my question is: What happens when Node.js crashes while the file is being written? Is there the chance to have a corrupt file on disk? Or does Node.js guarantee that the file is written in an atomic way, so that either the old or the new version is valid?
If not, how could I implement such a guarantee? Are there any modules for this?
What happens when Node.js crashes while the file is being written? Is
there the chance to have a corrupt file on disk? Or does Node.js
guarantee that the file is written in an atomic way, so that either
the old or the new version is valid?
Node implements only a (thin) async wrapper over system calls, thus it does not provide any guarantees about atomicity of writes. In fact, fs.writeAll repeatedly calls fs.write until all data is written. You are right that when Node.js crashes, you may end up with a corrupted file.
If not, how could I implement such a guarantee? Are there any modules for this?
The simplest solution I can come up with is the one used e.g. for FTP uploads:
Save the content to a temporary file with a different name.
When the content is written on disk, rename temporary file to destination file.
The man page says that rename guarantees to leave an instance of newpath in place (on Unix systems like Linux or OSX).
fs.writeFile, just like all the other methods in the fs module are implemented as simple wrappers around standard POSIX functions (as stated in the docs).
Digging a bit in nodejs' code, one can see that the fs.js, where all the wrappers are defined, uses fs.c for all its file system calls. More specifically, the write method is used to write the contents of the buffer. It turns out that the POSIX specification for write explicitly says that:
Atomic/non-atomic: A write is atomic if the whole amount written in
one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether
write requests for more than {PIPE_BUF} bytes are atomic, but requires
that writes of {PIPE_BUF} or fewer bytes shall be atomic.
So it seems it is pretty safe to write, as long as the size of the buffer is smaller than PIPE_BUF. This is a constant that is system-dependent though, so you might need to check it somewhere else.
write-file-atomic will do what you need. It writes to temporary file, then rename. That's safe.

using files as IPC on linux

I have one writer which creates and sometimes updates a file with some status information. The readers are implemented in lua (so I got only io.open) and possibly bash (cat, grep, whatever). I am worried about what would happen if the status information is updated (which means a complete file rewrite) while a reader has an open handle to the file: what can happen? I have also read that if the write/read operation is below 4KB, it is atomic: that would be perfectly fine for me, as the status info can fit well in such dimension. Can I make this assumption?
A read or write is atomic under 4Kbytes only for pipes, not for disk files (for which the atomic granularity may be the file system block size, usually 512 bytes).
In practice you could avoid bothering about such issues (assuming your status file is e.g. less than 512 bytes), and I believe that if the writer is opening and writing quickly that file (in particular, if you avoid open(2)-ing a file and keeping the opened file handle for a long time -many seconds-, then write(2)-ing later -once, a small string- inside it), you don't need to bother.
If you are paranoid, but do assume that readers are (like grep) opening a file and reading it quickly, you could write to a temporary file and rename(2)-ing it when written (and close(2)-ed) in totality.
As Duck suggested, locking the file in both readers and writers is also a solution.
I may be mistaken, in which case someone will correct me, but I don't think the external readers are going to pay any attention to whether the file is being simultaneously updated. They are are going to print (or possibly eof or error out) whatever is there.
In any case, why not avoid the whole mess and just use file locks. Have the writer flock (or similar) and the readers check the lock. If they get the lock they know they are ok to read.

Possible to implement journaling with a single fsync per commit?

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Resources