Node.js - How does a readable stream react to a file that is still being written?

Node.js - How does a readable stream react to a file that is still being written? - node.js

I have found a lot of information on how to pump, or pipe data from a read stream to a write stream in Node. The newest version even auto pauses, and resumes for you. However, I have a different need and would like some help.
I am writing a video file using ffmpeg (to a local file, not a writeable stream), and I would like to create a readstream that reads the data as it gets written. Obviously, the read stream speed will surpass how quickly ffmpeg encodes the file. What will happen when the read stream reaches the end of data before ffmpeg finishes writing the file? I assume it will stop the read stream before the file is fully encoded.
Anyone have any suggestions for the best way to pause/resume the read stream so that it doesn't reach the end of the locally encoding file until the encoding is 100% complete?
In summary:
This is what people normally do: readStream --> writeStream (using .pipe)
This is what I want to do: local file (in slow creation process) --> readStream
As always, thanks to the stackOverflow community.

The growing-file module is what you want.

Related

Node uploaded image save - stream vs buffer

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?

In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

How to read specific chunk from file Node JS?

How to read buffer by selecting a start position and end position from file while on streaming.

It solves my problem read-chunk

you could create a custom writable stream with the writable interface. But maybe that defeats the whole purpose of a stream... do you know upfront which positions you need to read or is it random? if do you need scan patterns?

What is the difference between async and steam writing files?

I now that it's possible to use async methods (like fs.appendFile) and streams (like fs.createWriteStream) to write files.
But why do we need both of them if streams are asynchronous as well and can provide us with better functionality?

Let's say you're downloading a file, a huge file, 1TB file, and you want to write that file to your filesystem.
You could download the whole file into a buffer in-memory, then fs.appendFile() or fs.writeFile() the buffer to a local file, or try, at least, you'd run out of memory.
Or you could create a read-stream for the downloading file, and pipe it to a write-stream for the write to your file-system:
const readStream = magicReadStreamFromUrl/*[1]*/('https://example.com/large.txt');
const writeStream = fs.createWriteStream('large.txt');
readStream.pipe(writeStream);
This means that the file is downloaded in chunks, and those chunks get piped to the writeStream (which would write them to disk), without having to store it in-memory yourself.
That is the reason for Streaming abstractions in general, and in Node in particular.
The http module supports streaming in this way, as well as most other HTTP libraries like request and axios, I've left out the specifics of how to create a read-stream as an exercise to the reader for brevity.

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.

I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

piping node.js object streams to multiple destinations is producing bizarre results -- why?

When piping one transform stream to two other transform streams, occasionally I'm getting a few of the objects from one destination stream appearing in place of the proper objects in the other destination stream. In a stream of 90,000 objects, in about 1 out of 3 runs about 10 objects starting at the sequence number about 10,000 are from the wrong stream (the start position of number of anomolous objects varies). What in the world could account for such bizarre results?
The setup:
sourceStream.pipe(processingStream1).pipe(check1);
processingStream1.pipe(check2).pipe(destinationStream1);
processingStream1.pipe(processingStream2).pipe(destinationStream2);
The sourceStream is a transform stream fed by a file read. The two destination streams are transform streams leading to file writes. Both the file read and file write are through the fs streaming API. All the streams rely on node.js automatic backpressure in piping.
Occasionally objects from processingStream2 are leaking into destinationStream1, as described above.
The checking streams (check1 a sink, check2 a passthrough) show the anomalous objects exist in the stream through check2 but not in the stream into check1.
The file reads and writes are of text (csv) files. I'm using Node.js version 8.6 on Windows 7 (though deserved, please don't throw rocks at me for the latter).
Suggestions on how to better isolate the problem also welcomed. The anomoly is structured enought that it doesn't seem like a generic memory leak, but not consistent enough to be a code error. I'm mystified.

Ugh! processingStream2 modifies the object in the stream coming through it (actually modifies a property of a sub-object). Apparently you can't count on the order of the pipes as controlling the order in changes in the streamed objects. Very occassionally, after sending the source objects through processingStream2, the input object to processingStream2 goes into processingStream1 via node internals. Probably as part of some optimization under the hood.
Lesson learned: don't change the input streamed object when piping to multiple destinations, even if you think you're making the change downstream. May you never have to learn this lesson the hard way!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string