Atomically Split File

Atomically Split File - linux

Suppose that I have multiple writers to a single regular file under linux. Multiple unsynchronized processes have opened this file in append mode (O_APPEND) and write lines of length <= PIPE_BUF bytes. These writes are synchronized by the operating system and do not interleave.
A single consumer process aims to periodically consume all lines from this file. I am curious if there is a way to atomically split the file or otherwise consume lines in a way that guarantees that incoming appends are not lost.
For instance, the following approach would be racy:
Slurp contents of file to memory
Truncate file to length 0
A new line could have been appended between these two steps.
This approach is also infeasible:
rename() file
process renamed file
Writers would still be appending to the original file descriptor.
I'm wondering if there is a way to split this file in a way that does not require modifying the writers. For instance, this could be accomplished using something like flock checks on the writers or using a unix socket instead of a file, but doing so would require writer modification.
The question: Is there any OS-provided synchronization mechanism or sequence available to accomplish this, analogously to how the OS synchronizes appends under PIPE_BUF length?

Related

How to reliably read data from a file which is being continuously written by another process?

So, I am in the situation where one process is continuously (after each few seconds) writing data to a file (not appending). The data is in the form of json. Now another process has to read this file at regular intervals. Now it could be that the reading process reads it while the writing process is writing to the file.
A soluition to this problem that I can think of is for the writer process to also write a corresponding checksum file. The reader process would now have to read both the file and its checksum file. If the calculated checksum doesn't match, the reader process would repeat the process until the calculated checksum matches. In this way, now it would know that it has read the correct data.
Or maybe a better solution is to read the file twice after a certain time period (much less than the writing interval of the writing process), and see if the read data matches.
The third way could be to write some magic data at the end of the file, so that the reading process knows that it has read the whole file, if it has encoutered that magic data at the end.
What do you think? Are these solutions viable, or are there better methods to achieve this?

Create an entire new file each time, and rename() the new file once it's been completely written:
If newpath already exists, it will be atomically replaced, so that
there is no point at which another process attempting to access
newpath will find it missing. ...
Some copy of the file will always be there, and it will always be complete and correct:
So, instead of
writeDataFile( "/path/to/data/file.json" );
and then trying to figure out what to do in the reader process(es), you simply do
writeDataFile( "/path/to/data/file.json.new" );
rename( "/path/to/data/file.json.new", "/path/to/data/file.json" );
No locking is necessary, nor any reading of the file and computing checksums and hoping it's correct.
The only issue is any reader process has to open() the file each time it needs to read the latest copy - it can't keep and open file descriptor on the file and try to read new contents as the rename() call unlinks the original file and replaces it with an entirely new file.

If you want to guarantee that the reader always gets all data, consider using a name pipe.
mkfifo ./jsonoutput
Then set one program to write to and the other program to read from this file ./jsonoutput.
So long as the writer is regularly closing and reopening the file after writing each JSON, the reader will get an EOF and process the input.
However if that isn't the case, the reader will just keep reading and the writer will just keep writing. If the programs aren't designed to handle streams of data like that, then they might just never process the data and the programs will hang.
If that's the case then you could write a program that reads from one named pipe until it gets a complete JSON and then flushes it through a second named pipe to the final program.

Polling for readiness file

I work on Linux. How to know that a gzip file is ready? I have a server that polls files in directory /dir/. There is an another, independent process that gzip files to /dir/. How can my server know that file is ready?

There is no ready-made solution for this. Looking at the last modification timestamp of the file (mtime) is not reliable because writes could delayed if the system is overloaded (or the input to the gzip operation is not ready), or the generating process may stop writing because it has crashed.
Usually, when applications need to do something like this, they write the temporary file under a different name, following a specific pattern. The reading process recognizes the temporary files and skips them, assuming that they are still a work in process and incomplete. Once the writer is finished, it renames the file to its final name (which is an atomic operation), and only then, the reader picks it up. This approach became popular with Dan Bernstein's maildir format:
Using maildir format
In maildir, a different directory is used for staging, but the general principle is the same.
It is also possible to use lock files and POSIX advisory locking, but they lead to more complexity. However, in some cases, they can be employed in such a way that busy waiting/polling/periodic probing is not necessary.

File append race condition (single thread!)

My program does something like:
Open a file (append mode)
Write some stuff
Close
[repeat]
The file is different most of the time but on certain occasions (not uncommon really) is repeated (either consecutively or in a very close iteration).
Is there any chance kernel can play tricks on me and opening the file not pointing to the end of the file? Say the write is not yet completed (buffered somewhere in the kernel) and opening the file again makes the fd point to a position that is not the real end of the file. That will result potentially in overlapping writes.
As I said, my program is single threaded, I see no reason why this would happen but I do not fully understand the kernel guarantees when it comes to this.
Thanks!

fs.createWriteStream over several processes

How can I implement a system where multiple Node.js processes write to the same file with fs.createWriteStream, such that they don't overwrite data? It looks like the default setup for fs.createWriteStream is that the file is cleared out when that method is called. My goal is to clear out the file once, and then have all other subsequent writers only append data.
Should I use fs.createWriteStream and then fs.appendFile? Or is there a way to open up a stream for each process, not just for the first process to open the file?

Should I use fs.createWriteStream and then fs.appendFile?
you can use either.
with fs.createWriteStream you have to change the flag like this:
fs.createWriteStream('your_file',{
flags: 'a+', // default is 'w' (just 'a' might be enough here, i'm not sure)
})
this should create the file if it doesn't exist or open it with write access if it exists and set the pointer to end. (append mode)
How to use fs.appendFile should be clear and it does pretty much the same.
Now the problem with multiple processes accessing the same file. Obviously only one process can open the same file with write access at the same time.
Therefore you need to wait for the file to be released if another process has the write access. You will probably need a library for that.
this one for example: https://www.npmjs.com/package/lockup
or this one: https://github.com/Perennials/mutex-node
you can also find alot more here: https://www.npmjs.com/browse/keyword/lock
or here: https://www.npmjs.com/browse/keyword/mutex
I have not tried any of those libraries but the one I posted and several others on the list should do exactly what you need.

Writing on a single file from multiple processes, ensuring data integrity, it is a fairly complex operation that you can orchestrate using File locking.
However, you have two simpler approaches:
Writing on a temporary file for each process, and then concatenate
the files at the end of the operations.
Transmitting what you need to write to a dedicated, single process and delegate the writing execution to it. Keep in mind that sending messages among processes can be expensive.

Nonblocking/asynchronous fifo/named pipe in shell/filesystem?

Is there a way to create non blocking/asynchronous named pipe or something similar in shell? So that programs could place lines in it, those lines would stay in ram, and when some program could read some lines from pipe, while leaving what it did not read in fifo? It is also very probable that programs can be writing and reading to this fifo at the same time. At first I though maybe this could be done using files, but after searching a web for a bit it seems nothing good can come from the fact that file is read and written at same time. Named pipes would almost work, just there are two problems: first they block reads/writes if there is no one at the other end, second even if I let writing to blocked and set two processes to write to pipe while no one is reading, by trying to write one line with each process, and then try head -n 1 <fifo> I get just one line as I need, but both writing processes terminate, and second line is lost. Any suggestions?
Edit: maybe some intermediate program could be used to help with this, acting like mediator between writers and readers?

You can use special program for this purpose - buffer. Buffer is designed to try and keep the writer side continuously busy so that it can stream when writing to tape drives, but you can use for other purposes. Internally buffer is a pair of processes communicating via a large circular queue held in shared memory, so your processes will work asynchronously. Your reader process will be blocked in case the queue is full and the writer process - in case the queue is empty. Example:
bzcat archive.bz2 | buffer -m 16000000 -b 100000 | processing_script | bzip2 > archive_processed.bz2
http://linux.die.net/man/1/buffer

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Atomically Split File - linux

Related

How to reliably read data from a file which is being continuously written by another process?

Polling for readiness file

File append race condition (single thread!)

fs.createWriteStream over several processes

Nonblocking/asynchronous fifo/named pipe in shell/filesystem?

Categories

Resources