fs.createWriteStream over several processes - node.js

How can I implement a system where multiple Node.js processes write to the same file with fs.createWriteStream, such that they don't overwrite data? It looks like the default setup for fs.createWriteStream is that the file is cleared out when that method is called. My goal is to clear out the file once, and then have all other subsequent writers only append data.
Should I use fs.createWriteStream and then fs.appendFile? Or is there a way to open up a stream for each process, not just for the first process to open the file?

Should I use fs.createWriteStream and then fs.appendFile?
you can use either.
with fs.createWriteStream you have to change the flag like this:
fs.createWriteStream('your_file',{
flags: 'a+', // default is 'w' (just 'a' might be enough here, i'm not sure)
})
this should create the file if it doesn't exist or open it with write access if it exists and set the pointer to end. (append mode)
How to use fs.appendFile should be clear and it does pretty much the same.
Now the problem with multiple processes accessing the same file. Obviously only one process can open the same file with write access at the same time.
Therefore you need to wait for the file to be released if another process has the write access. You will probably need a library for that.
this one for example: https://www.npmjs.com/package/lockup
or this one: https://github.com/Perennials/mutex-node
you can also find alot more here: https://www.npmjs.com/browse/keyword/lock
or here: https://www.npmjs.com/browse/keyword/mutex
I have not tried any of those libraries but the one I posted and several others on the list should do exactly what you need.

Writing on a single file from multiple processes, ensuring data integrity, it is a fairly complex operation that you can orchestrate using File locking.
However, you have two simpler approaches:
Writing on a temporary file for each process, and then concatenate
the files at the end of the operations.
Transmitting what you need to write to a dedicated, single process and delegate the writing execution to it. Keep in mind that sending messages among processes can be expensive.

Related

"Just in time" read only filesystem using mkfifo and inotifywait

I am writing some gross middleware - basically, I have some old code that needs to open 100,000 files for reading only, expecting them all to be in one folder. It never writes. It is multiprocess so it can try to open ~30 files at the same time. The old way, I would have to actually copy the files into that folder (or use links, NFS, etc.). Worth noting I have no ability to change this old code - its just a binary.
I have some new, fancy code that can retrieve a file almost instantly. I want to tie these things together, so when the old code tries to open the file, it is actually, in real time, running the new code.
So I thought of mkfifo and inotifywait. Instead of a folder of 100,000 files, I can make a folder of 100,000 named pipes. So far so good. The legacy code goes to open the files, not knowing that they are indeed named pipes. The problem is, I don't know what order the legacy code is going to open the files (nice, right?). So I would like to TRIGGER the named pipe WRITE (from my fancy new code) when the legacy code goes in for the read. I can't spawn 100,000 writes and have them all block. So I thought hey - inotifywait makes sense. Every time the legacy goes to open the pipe, it triggers a read event, which can then be used to spawn the pipe writer in the background. The problem is.. inotifywait doesn't trigger the read event until AFTER the writer has been spawned!
Any ideas of how to solve this? Basically - I want to intercept a file open, block for a couple hundred ms while I retrieve the contents of the file, then return that contents. Ideally I don't have to create a custom FUSE filesystem to do this.. its just a read-only file open. The problem is this needs to run fast and in parallel.. and I don't know which files are going to be opened in what order. Gotta be a quick and dirty way!
Thanks in advance for everyone's time.

How to reliably read data from a file which is being continuously written by another process?

So, I am in the situation where one process is continuously (after each few seconds) writing data to a file (not appending). The data is in the form of json. Now another process has to read this file at regular intervals. Now it could be that the reading process reads it while the writing process is writing to the file.
A soluition to this problem that I can think of is for the writer process to also write a corresponding checksum file. The reader process would now have to read both the file and its checksum file. If the calculated checksum doesn't match, the reader process would repeat the process until the calculated checksum matches. In this way, now it would know that it has read the correct data.
Or maybe a better solution is to read the file twice after a certain time period (much less than the writing interval of the writing process), and see if the read data matches.
The third way could be to write some magic data at the end of the file, so that the reading process knows that it has read the whole file, if it has encoutered that magic data at the end.
What do you think? Are these solutions viable, or are there better methods to achieve this?
Create an entire new file each time, and rename() the new file once it's been completely written:
If newpath already exists, it will be atomically replaced, so that
there is no point at which another process attempting to access
newpath will find it missing. ...
Some copy of the file will always be there, and it will always be complete and correct:
So, instead of
writeDataFile( "/path/to/data/file.json" );
and then trying to figure out what to do in the reader process(es), you simply do
writeDataFile( "/path/to/data/file.json.new" );
rename( "/path/to/data/file.json.new", "/path/to/data/file.json" );
No locking is necessary, nor any reading of the file and computing checksums and hoping it's correct.
The only issue is any reader process has to open() the file each time it needs to read the latest copy - it can't keep and open file descriptor on the file and try to read new contents as the rename() call unlinks the original file and replaces it with an entirely new file.
If you want to guarantee that the reader always gets all data, consider using a name pipe.
mkfifo ./jsonoutput
Then set one program to write to and the other program to read from this file ./jsonoutput.
So long as the writer is regularly closing and reopening the file after writing each JSON, the reader will get an EOF and process the input.
However if that isn't the case, the reader will just keep reading and the writer will just keep writing. If the programs aren't designed to handle streams of data like that, then they might just never process the data and the programs will hang.
If that's the case then you could write a program that reads from one named pipe until it gets a complete JSON and then flushes it through a second named pipe to the final program.

Polling for readiness file

I work on Linux. How to know that a gzip file is ready? I have a server that polls files in directory /dir/. There is an another, independent process that gzip files to /dir/. How can my server know that file is ready?
There is no ready-made solution for this. Looking at the last modification timestamp of the file (mtime) is not reliable because writes could delayed if the system is overloaded (or the input to the gzip operation is not ready), or the generating process may stop writing because it has crashed.
Usually, when applications need to do something like this, they write the temporary file under a different name, following a specific pattern. The reading process recognizes the temporary files and skips them, assuming that they are still a work in process and incomplete. Once the writer is finished, it renames the file to its final name (which is an atomic operation), and only then, the reader picks it up. This approach became popular with Dan Bernstein's maildir format:
Using maildir format
In maildir, a different directory is used for staging, but the general principle is the same.
It is also possible to use lock files and POSIX advisory locking, but they lead to more complexity. However, in some cases, they can be employed in such a way that busy waiting/polling/periodic probing is not necessary.

Lua 4.0.1 appendto

Could someone please explain the proper way to use the appendto function?
I am trying to use it to write debug text to a file. I want it written immediately when I call the function, but for some reason the program waits until it exits, and then writes everything at once.
Am I using the right function? Do I need to open, then write, then close the file each time I write to it instead?
Thanks.
Looks like you are having an issue with buffering (this also a common question in other languages, btw). The data you want to write to the file is being held in a memory buffer and is only being written to disk in a latter time (this is done to batch writes to disk together, for better performance).
One possibility is to open and close the file as you already suggested. Closing a file handle will flush the contents of the buffer to disk.
A second possibility is to use the flush function to explicitly request that the data be written to disk. In Lua 4.0.1, you can either call flush passing a file handle
-- If you have opened your file with open:
local myfile = open("myfile.txt", "a")
flush(myfile)
-- If you used appendto the output file handle is in the _OUTPUT global variable
appendto("myfile.txt")
flush(_OUTPUT)
or you can call flush with no arguments, in which case it will flush all the files you have currently open.
flush()
For details, see the reference manual: http://www.lua.org/manual/4.0/manual.html#6.

Node JS is async Read/Write safe?

Probably a dumb question, but if the program is asynchronously writing to a file, and you access that file while it's still writing, are the contents messed up?
In fact, it does not matter whether you are synchronously or asynchronously accessing a file: if some other process (yours or someone else) modifies the file while you are in the middle of reading, you will get inconsistent results.
The exact kind of inconsistency you'll see depends on how the file is written and when reading starts.
In node's default mode (w), a file's existing contents are truncated when the file is opened.
An in-flight read will stop early (without erroring), meaning you'll only have a percentage of the original file.
A read started after the write begins will read up to the last written byte. Depending on how far along and fast the write is, and how you read the file, the read may or may not see the complete file.
If the file is written in r+ mode, the contents are not truncated when the file is opened for writing. This means a read will see part of the old data and part of the new data. Things are further muddied if the write changes the file size.
This is all true regardless of whether you use streams (ie createReadStream), readFile, or even readFileSync. Any part of the file on disk can be changed while node is in the process of buffering the file into memory. (The only notable exception here is if you use writeFileSync and then readFileSync in the same process, since the write call would prevent the read from starting until after the write is complete. However, this still doesn't prevent other processes from changing the file mid-read, and you shouldn't be using the sync methods anyway.)
In other words, reading and writing a file is non-atomic. To avoid inconsistency, you should write the file with a temporary name and then rename it when the write is complete.

Resources