Node.js read and write stream to the same file at the same time - node.js

TL;DR
I'm browsing through a number of solutions on npm and github looking for something that would allow me to read and write to the same file in two different places at the same time. So far I'm having trouble actually finding anything like this. Is there a module of some sort that will allow that?
Background
In essence my requirement is that in a large file I need to, in the following order:
read
transform
write
Ideally the usage would be something like:
const fd = fs.open(file, "r+");
const read = createReadStreamSomehowFrom(fd);
const write = createWriteStreamSomehowFrom(fd);
read
.pipe(new Transform(transform() {...}))
.pipe(write);
I could do that with standard fs.create[Read/Write]Stream but there's no way to control the flow of both streams and if my write position goes beyond read position then I'm reading something I just wrote...
The use case is the same as perl -p -i -e, read and write to the same file (meaning the same inode) asynchronously and replace the contents without loading everything into memory.
I would expect this a real world use case, yet all implementations I found actually load the whole file into memory and then save it. Am I missing a known module here or is there a need to actually write something like this?

Hmm... a tough one it seems. :)
So here's for the record - I found no such module and actually discussed this with some people responsible for a nice in-file replacing module. Seeing no way to solve this I decided to write it from scratch and here it is:
signicode/rw-stream repo on github
rw-stream at npm
The module works on a simple principle that no byte can be written until it has been consumed in the readable stream and it's fairly simple underneath (couple fs.read/write ops with keeping eye on the point of read and write).
If you find this useful then I'm happy. :)

Related

How to get back raw binary output from python fabric remote command?

I am using Python 3.8.10 and fabric 2.7.0.
I have a Connection to a remote host. I am executing a command such as follows:
resObj = connection.run("cat /usr/bin/binaryFile")
So in theory the bytes of /usr/bin/binaryFile are getting pumped into stdout but I can not figure out what wizardry is required to get them out of resObj.stdout and written into a file locally that would have a matching checksum (as in, get all the bytes out of stdout). For starters, len(resObj.stdout) !== binaryFile.size. Visually glancing at what is present in resObj.stdout and comparing to what is in /usr/bin/binaryFile via hexdump or similar makes them seem about similar, but something is going wrong.
May the record show, I am aware that this particular example would be better accomplished with...
connection.get('/usr/bin/binaryFile')
The point though is that I'd like to be able to get arbitrary binary data out of stdout.
Any help would be greatly appreciated!!!
I eventually gave up on doing this using the fabric library and reverted to straight up paramiko. People give paramiko a hard time for being "too low level" but the truth is that it offers a higher level API which is pretty intuitive to use. I ended up with something like this:
with SSHClient() as client:
client.set_missing_host_key_policy(AutoAddPolicy())
client.connect(hostname, **connectKwargs)
stdin, stdout, stderr = client.exec_command("cat /usr/bin/binaryFile")
In this setup, I can get the raw bytes via stdout.read() (or similarly, stderr.read()).
To do other things that fabric exposes, like put and get it is easy enough to do:
# client from above
with client.open_sftp() as sftpClient:
sftpClient.put(...)
sftpClient.get(...)
also was able to get the exit code per this SO answer by doing:
# stdout from above
stdout.channel.recv_exit_status()
The docs for recv_exit_status list a few gotchas that are worth being aware of too. https://docs.paramiko.org/en/latest/api/channel.html#paramiko.channel.Channel.recv_exit_status .
Moral of the story for me is that fabric ends up feeling like an over abstraction while Paramiko has an easy to use higher level API and also the low level primitives when appropriate.

Windows/Python check if file is open or in use

I am using python to monitor a folder and check if files are being copied in and if so, replicate those to a new location.
I am using the following to monitor the folder:
fsmonitor
The issue I am facing is that I am unable to discern if the file is in use and currently in the process of writing the contents onto disk. if so I want to wait till copying is complete and then start copying it to my new location.
So how do I find out if a file is in use/open?
I have seen some suggestions here where I try to write to the file question and if it fails then it indicates that the file is in use:
example answer (I've seen similar in python)
But I am reluctant to use such a method due to the fear that it might cause corruption and such issues.
Is there an alternative/safer way to do this? Or is testing write permissions safe?
Is anyone familiar with pywin32? Does it provide such tools? The site looks arcane, so wonder if it has the latest API provided by windows, even fsmointor mentioned above uses the same library and I wonder if there are newer/more efficient ways to do this.
Currently, I am using psutil, proc.open_files() to loop through all processes and all files to list out open files. if files that I am concerned about appear on this list I wait and try again. However, this process creates a humongous list of files and uses 12% of my CPU to create it, so I desperately need an alternative.
In response to Adrian McCarthy
I starting out assuming that it is safe to action whatever fsmonitor puts out, but if you see the following output which si for a single file copy:
0 86 0
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
0 86 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
0 160 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
So the conundrum is at which 'modify' do I start copying the file? I can wait a few minutes/seconds to see if another 'modified' appeared for that file but how do I decide the time to wait for a large file over SFTP may take 30 minutes, so I need something scalable.
Also, I would like not the make multiple copy actions for a file since that will make the script inefficient.
This can help you
check if a file is open in Python
here is a code:
try: # try to open the file
with open("file", "r") as file:
# some code here
except IOError:
# if it throws an error that means it is in use
I think you're unnecessarily concerned about working with the file while another process still has it open.
On Windows. fsmonitor using the ReadDirectoryChangesW mechanism. That means you'll get a notification about a change after it happens. So if a process writes to foo.log, you'll get a notification after the write operation is completed. (In fact, I think it's after the update of the directory metadata.)
To copy the file, you need read access. So just go ahead and open it for reading.
If it opens, then it's safe to read, even if another process has it open. You cannot corrupt a file by reading it even if another process is writing to it.
If it fails to open, then another process has it open and is intentionally preventing other processes from reading it (probably because they know they'll be actively updating it). In that case, you can try again later.
Trying to first check whether another process is using the file doesn't actually help because the answer could change between the moment you check and the moment you try to act on that information.
When you open a file, the system does the permission check and the opening under a mutex*, so the answer cannot change in between. There's no way for you to simulate that yourself from user-mode code. Once you have the file open, you can safely use it.
If you try to read from a file at the same moment another process tries to write to it, the system will ensure that the read will get the data as it was before the write or as it is after the write. It won't get a result that's a mixture of old and new.
That said, if you're reading the file with a bunch of small read operations while another process is writing to the file with a bunch of small write operations, it's possible you might capture some intermediate state of the file. But that's okay. The original file is unharmed, and those writes will trigger another fsmonitor notification, so you're code will start over and try to make another copy of the file.
* I'm using "mutex" in a generic sense: It uses some sort of synchronization mechanism, but it might not necessarily be a Windows Mutex object.

How to cache files with Perl while playing sound files using vlc?

I would like to manually cache files in Perl, so when playing a sound there is little to no delay.
I wrote a program in Perl, which plays an audio file by doing a system call to VLC. When executing it, I noticed a delay before the audio started playing. The delay is usually between about 1.0 and 1.5 seconds. However, when I create a loop which does the same VLC call multiple times in a row, the delay is only about 0.2 - 0.3 seconds. I assume this is because the sound file was cached by Linux. I found Cache::Cache on CPAN, but I don't understand how it works. I'm interested in a solution without using a module. If that's not possible, I'd like to know how to use Cache::Cache properly.
(I know it's a bad idea to use a system call to VLC regarding execution speed)
use Time::HiRes;
use warnings;
use strict;
while (1) {
my $start = Time::HiRes::time();
system('vlc -Irc ./media/audio/noise.wav vlc://quit');
my $end = Time::HiRes::time();
my $duration = $end - $start;
print "duration = $duration\n";
<STDIN>;
}
Its not as easy as just "caching" a file in perl.
vlc or whatever program needs to interpret the content of the data (in your case the .wav file).
Either you stick with calling an external program and just give it a file to execute or you need to implement the whole stack in perl (and probably Perl XS Modules). By whole stack I mean:
1. Keeping the Data (your .wav file) in Memory (inside the perl runtime).
2. Interpreting the Data inside Perl.
The second part is where it gets tricky you would probably need to write a lot of code and/or use 3rd Party modules to get where you want.
So if you just want to make it work fast, stick with system calls. You could also look into Nama which might give you what you need.
From your Question it looks like you are mostly into getting the runtime of a .wav file. If its just about getting information about the File and not about playing the sound then maybe Audio::Wav could be the module for you.
Cacheing internal to Perl does not help you here.
Prime the Linux file cache by reading the file once, for example at program initialisation time. It might happen that at the time you want to play it, it has already been made stale and removed from the cache, so if you want to guarantee low access time, then put the files on a RAM disk instead.
Find and use a different media player with a lower footprint that does not load dozens of libraries in order to reduce start-up time. Try paplay from package pulseaudio-utils, or gst123, or mpg123 with mpg123-pulse.

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.
I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

Resources