Why is fs.write string not documented in node.js docs?

Why is fs.write string not documented in node.js docs? - node.js

fs.write(fd, str, position, encoding='utf8', [callback])
seems to work just fine, but it's not documented in the File System docs (only the buffer version is). Does it cause problems if used?
Also, what is the proper fs.open() flag to use with fs write methods that use the position argument? (By trial and error, I've noticed that 'r+' seems to work the best.)
Finally, does writing to the middle of a file, using the position arg, impose a performance penalty when compared to just appending to the end of a file?
Thanks in advance for your help.

but it's not documented in the File System docs (only the buffer version is). Does it cause problems if used?
It's probably just an oversight. Try submitting an issue or a patch on github.
Also, what is the proper fs.open() flag to use with fs write methods that use the position argument?
That depends on what you're trying to do exactly.
Finally, does writing to the middle of a file, using the position arg, impose a performance penalty when compared to just appending to the end of a file?
Node doesn't know about such things, the filesystem in use does. What filesystem are you using? What sort of disks? Are the files on the order of 10's of bytes or 10's of gigabytes? How much are you writing to the middle of the file?

Related

What's the difference between `flush` and `sync_all`?

I'm curious whether I should use Write::flush or File::sync_all when I finish writing a file.

TL;DR: If you want to "ensure" that the data has been written to the device, then use File::sync_all if you use a File. Note that this isn't necessary though.
The Write::flush implementation for File uses the operating system dependent flush operation, for example std::sys::unix::File::flush, or std::sys::windows::File::flush. Those flush operations do... nothing. Both implementations just return Ok(()).
Why? Because the write() already uses the underlying system call for write() in both cases; the handle-based write on Windows, and the file descriptor-based write on Unix-like systems. At that point, it's out of reach of the Rust environment, save for a system call that's specific to files.
So what is Write::flush useful for? It's useful if you have any kind of buffer before the actual file, for example a BufWriter. If you have a File wrapped by a BufWriter, then you need to use flush to ensure that the bytes get written to the file. While it's useful to keep in mind that BufWriter's Drop implementation also tries(!) to write those bytes, it may or may not work, so you're supposed to call Write::flush there (see BufWriter's documentation).
That being said, sync_all isn't necessary and instead will block your program. The operating system will handle the file system synchronisation. While you can certainly wait for that synchronisation to happen via sync_data or sync_all, you're usually better of with not doing either.

Write::flush for on-disk file is actually a no-op [source]. It's useless for File, just impl for consistency. This interface is meant for stream that utilizes app-level in-memory buffer before writing into destination, as stated in the doc:
Flush this output stream, ensuring that all intermediately buffered contents reach their destination.
File::sync_data is the kinda like the useful version of flush for File. Under the hood, intermediate buffer is used on kernel-level, and sync_data delegates to fdatasync POSIX call, which does what flush does on app-level, .
File::sync_all does what File::sync_data does, and on top of that, it also ensure metadata about a file is written to disk. It delegates to fsync on POSIX system.
Sidenote: depending on system (e.g. macOS, android, etc.), implementation for File::sync_data and File::sync_all could be exactly the same.

Transactionally writing files in Node.js

I have a Node.js application that stores some configuration data in a file. If you change some settings, the configuration file is written to disk.
At the moment, I am using a simple fs.writeFile.
Now my question is: What happens when Node.js crashes while the file is being written? Is there the chance to have a corrupt file on disk? Or does Node.js guarantee that the file is written in an atomic way, so that either the old or the new version is valid?
If not, how could I implement such a guarantee? Are there any modules for this?

What happens when Node.js crashes while the file is being written? Is
there the chance to have a corrupt file on disk? Or does Node.js
guarantee that the file is written in an atomic way, so that either
the old or the new version is valid?
Node implements only a (thin) async wrapper over system calls, thus it does not provide any guarantees about atomicity of writes. In fact, fs.writeAll repeatedly calls fs.write until all data is written. You are right that when Node.js crashes, you may end up with a corrupted file.
If not, how could I implement such a guarantee? Are there any modules for this?
The simplest solution I can come up with is the one used e.g. for FTP uploads:
Save the content to a temporary file with a different name.
When the content is written on disk, rename temporary file to destination file.
The man page says that rename guarantees to leave an instance of newpath in place (on Unix systems like Linux or OSX).

fs.writeFile, just like all the other methods in the fs module are implemented as simple wrappers around standard POSIX functions (as stated in the docs).
Digging a bit in nodejs' code, one can see that the fs.js, where all the wrappers are defined, uses fs.c for all its file system calls. More specifically, the write method is used to write the contents of the buffer. It turns out that the POSIX specification for write explicitly says that:
Atomic/non-atomic: A write is atomic if the whole amount written in
one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether
write requests for more than {PIPE_BUF} bytes are atomic, but requires
that writes of {PIPE_BUF} or fewer bytes shall be atomic.
So it seems it is pretty safe to write, as long as the size of the buffer is smaller than PIPE_BUF. This is a constant that is system-dependent though, so you might need to check it somewhere else.

write-file-atomic will do what you need. It writes to temporary file, then rename. That's safe.

Can regular file reading benefited from nonblocking-IO?

It seems not to me and I found a link that supports my opinion. What do you think?

The content of the link you posted is correct. A regular file socket, opened in non-blocking mode, will always be "ready" for reading; when you actually try to read it, blocking (or more accurately as your source points out, sleeping) will occur until the operation can succeed.
In any case, I think your source needs some sedatives. One angry person, that is.

I've been digging into this quite heavily for the past few hours and can attest that the author of the link you cited is correct. However, the appears to be "better" (using that term very loosely) support for non-blocking IO against regular files in native Linux Kernel for v2.6+. The "libaio" package contains a library that exposes the functionality offered by the kernel, but it has some caveats about the different types of file systems which are supported and it's not portable to anything outside of Linux 2.6+.
And here's another good article on the subject.

You're correct that nonblocking mode has no benefit for regular files, and is not allowed to. It would be nice if there were a secondary flag that could be set, along with O_NONBLOCK, to change this, but due to the way cache and virtual memory work, it's actually not an easy task to define what correct "non-blocking" behavior for ordinary files would mean. Certainly there would be race conditions unless you allowed programs to lock memory associated with the file. (In fact, one way to implement a sort of non-sleeping IO for ordinary files would be to mmap the file and mlock the map. After that, on any reasonable implementation, read and write would never sleep as long as the file offset and buffer size remained within the bounds of the mapped region.)

Debugging under Linux: Is there a pseudo-tty-like circular buffer implementation?

I am developing under Linux with pretty tight constraints on disk usage. I'd like to be able to point logging to a fixed-size file. For example, if my application outputs all logs to stdout:
~/bin/myApp > /dev/debug1
and then, to see the last amount of output:
cat /dev/debug1
would write out however many bytes debug1 was setup to save (if at least that many had been written there).
This post suggests using expect or its library, but I was wondering if anyone has seen a "pseudo-tty" device driver-type implementation as I would prefer to not bind any more libraries to my executable.
I realize there are other mechanisms like logrotate, but I'd prefer to have a non-cron solution.
Pointers, suggestions, questions welcome!

Perhaps you could achieve what you want using mkfifo and something that reads the pipe with a suitable buffer. I haven't tried, but less --buffers=XXXXXX might work for this.

Linux/perl mmap performance

I'm trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes to work on the dataset concurrently (read-only). It isn't working as expected though.
As a simple test I simply mmap the file (using perl's Sys::Mmap module, using the "mmap" sub which I believe maps directly to the underlying C function) and have the process sleep. When doing this, the code spends more than a minute before it returns from the mmap call, despite this test doing nothing - not even a read - from the mmap'ed file.
Guessing, I though maybe linux required the whole file to be read when first mmap'ed, so after the file had been mapped in the first process (while it was sleeping), I invoked a simple test in another process which tried to read the first few megabytes of the file.
Suprisingly, it seems the second process also spends a lot of time before returning from the mmap call, about the same time as mmap'ing the file the first time.
I've made sure that MAP_SHARED is being used and that the process that mapped the file the first time is still active (that it has not terminated, and that the mmap hasn't been unmapped).
I expected a mmapped file would allow me to give multiple worker processes effective random access to the large file, but if every mmap call requires reading the whole file first, it's a bit harder. I haven't tested using long-running processes to see if access is fast after the first delay, but I expected using MAP_SHARED and another separate process would be sufficient.
My theory was that mmap would return more or less immediately, and that linux would load the blocks more or less on-demand, but the behaviour I am seeing is the opposite, indicating it requires reading through the whole file on each call to mmap.
Any idea what I'm doing wrong, or if I've completely misunderstood how mmap is supposed to work?

Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:
#!/usr/bin/perl
# Create 1 GB file if you do not have one:
# dd if=/dev/urandom of=test.bin bs=1048576 count=1000
use strict; use warnings;
use Sys::Mmap;
open (my $fh, "<test.bin")
|| die "open: $!";
my $t = time;
print STDERR "mmapping.. ";
mmap (my $mh, 0, PROT_READ, MAP_SHARED, $fh)
|| die "mmap: $!";
my $str = unpack ("A1024", substr ($mh, 0, 1024));
print STDERR " ", time-$t, " seconds\nsleeping..";
sleep (60*60);
If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.
The error was that I in my code treated the $mh scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store \$mh, and deref it when you need to use it like ${$hash->{mh}}, typically as the first parameter in a substr or similar.

If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.
Can you post the code you are using?

On 32-bit systems the address space for mmap()s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)

one thing that can help performance is the use of 'madvise(2)'. probably most easily
done via Inline::C. 'madvise' lets you tell the kernel what your access pattern will be like (e.g. sequential, random, etc).

If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.

That does sound surprising. Why not try a pure C version?
Or try your code on a different OS/perl version.

See Wide Finder for perl performance with mmap. But there is one big pitfall. If your dataset will be on classical HD and you will read from multiple processes, you can easily fall in random access and your IO will fall down to unacceptable values (20~40 times).

Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:
Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:
#62646: Maximum string length with substr
Using open(my $fh, "<:mmap", "bigfile.bin"), once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.
I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.

Your access to that file had better be well random to justify a full mmap. If your usage isn't evenly distributed, you're probably better off with a seek, read to a freshly malloced area and process that, free, rinse and repeat. And work with chunks of multiples of 4k, say 64k or so.
I once benchmarked a lot string pattern matching algorithms. mmaping the entire file was slow and pointless. Reading to a static 32kish buffer was better, but still not particularly good. Reading to freshly malloced chunk, processing that and then letting it go allows kernel to work wonders under the hood. The difference in speed was enormous, but then again pattern matching is very fast complexitywise and more emphasis must be put on handling efficiency than perhaps is usually needed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string