NodeJS - fs.createReadStream() - how to cut off chunks at a certain point? - node.js

I have a large XML file (2GB) and I need to add a new line if a criterion is met. Example:
<chickens>
<chicken>
<name>sam</name>
<female>false</female>
</chicken>
<chicken>
<name>julia</name>
<female>true</female>
</chicken>
// many many more chickens
</chickens>
to:
<chickens>
<chicken>
<name>sam</name>
<female>false</female>
</chicken>
<chicken>
<name>julia</name>
<female>true</female>
<canLayEggs>true</canLayEggs> // <- Add this line if female is true;
</chicken>
// many many more chickens
</chickens>
However, the issue that I'm facing is that sometimes the chunk gets cut off like <female>true
and then the next chunk starts with </female>
Here is my code:
const fs = require("fs");
const input = "input.xml";
const MAX_CHUNK_SIZE = 50 * 1024 * 1024; //50 MB
const buffer = Buffer.alloc(MAX_CHUNK_SIZE);
let readStream = fs.createReadStream(input, "utf8", {
highWaterMark: MAX_CHUNK_SIZE,
});
let writeStream = fs.createWriteStream("output.xml");
readStream.on("data", (chunk) => {
let data = chunk;
if (data.includes("<category>f</category>")) {
data = data.replace(
/<female>true<\/female>/g,
"<female>true</female><canLayEggs>true</canLayEggs>"
);
}
writeStream.write(data, "utf-8");
});
readStream.on("end", () => {
writeStream.end();
})
I have tried Google but I can't seem to find the right term, and many tutorials out there doesn't really cover this. Any help is appreciated.

you are reading 50MB per chunk. so in the on data callback, you can call:
readStream.destroy();
also, you don't need to init the buffer with 50MB size, it is not used here and after the text replacement it is likely longer than 50MB.
It is good, that you close the writeStream when the readStream closes.

Related

How to close the file descriptor opened using fs.readFile/writeFile

I've legacy code version 0.12.7 which is working perfectly fine.
However it is giving EMFILE "Too many files open" error frequently.
How can I release the file descriptor opened using:
require("fs").readFile(resobj.name, 'utf8', function (err, data)
{
});
You will most likely need to read the files in batches, like this:
const fs = require('fs/promises');
const files = [...<array of millions of file paths>];
const MAX_FILES_TO_PROCESS = 1000;
let promises;
let contents;
// Process 1000 files at a time
(async () => {
for (let a = 0; a < files.length; a += MAX_FILES_TO_PROCESS) {
promises = (files.slice(a, MAX_FILES_TO_PROCESS)).map(path => fs.readFile(path));
contents = await Promise.all(promises);
//Process the contents, then continue on the next loop
}
})();
Two observations that may help you:
Use createReadStream instead of readFile because the latter reads the entire file into memory. If you're handling thousands or millions of objects it's not scalable.
readFile doesn't return a descriptor because it closes the file automatically.

NodeJS: What's the most efficient way to read the last X bytes of a very large file (+1GB)?

I would like to efficiently read the last X bytes of a very large file using node.js. What's the most efficient way of doing so?
As far as I know the only way of doing this is by creating a read stream and loop until a hit the byte index.
Example:
// lets assume I want the last 10 bytes;
// I would open a stream and loop until I reach the end of the file
// Once I did I would go to the last 10 bytes I kept in memory
let f = fs.createReadStream('file.xpto'); //which is a 1gb file
let data = [];
f.on('data', function(data){
for (d of data){
data.push(d)
data = data.slice(1,11); //keep only 10 elements
}
})
f.on('end', function(){
// check data
console.log('Last test bytes is', data)
})
f.resume();
You essentially want to seek to a certain position in the file. There's a way to do that. Please consult this question and the answers:
seek() equivalent in javascript/node.js?
Essentially, determine the starting position (using the file length from its metadata and the number of bytes you're interested in) and use one of the following approaches to read - as stream or via buffers - the portion you're interested in.
Using fs.read
fs.read(fd, buffer, offset, length, position, callback)
position is an argument specifying where to begin reading from in the file.
Using fs.createReadStream
Alternatively, if you want to use the createReadStream function, then specify the start and end options: https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
fs.createReadStream(path[, options])
options can include start and end values to read a range of bytes from the file instead of the entire file.
Here's the sample code based on Arash Motamedi answer.
This will let you read the last 10 bytes of a very large file in a few ms.
let fs = require('fs');
const _path = 'my-very-large-file.xpto';
const stats = fs.statSync(_path);
let size = stats.size;
let sizeStart = size-10;
let sizeEnd = size;
let options = {
start: sizeStart,
end: sizeEnd
}
let stream = fs.createReadStream(_path, options)
stream.on('data',(data)=>{
console.log({data});
})
stream.resume()
For a promised version of the read solution:
import FS from 'fs/promises';
async function getLastXBytesBuffer() {
const bytesToRead = 1024; // The x bytes you want to read
const handle = await FS.open(path, 'r');
const { size } = await handle.stat(path)
// Calculate the position x bytes from the end
const position = size - bytesToRead;
// Get the resulting buffer
const { buffer } = await handle.read(Buffer.alloc(bytesToRead), 0, bytesToRead, position);
// Dont forget to close filehandle
await handle.close()
return buffer
}

How to read large binary files in node js without a blocking loop?

I am trying to learn some basics of event driven programming. So for an exercise I am trying to write a program that reads a large binary file and does something with it but without ever making a blocking call. I have come up with the following:
var fs = require('fs');
var BUFFER_SIZE = 1024;
var path_of_file = "somefile"
fs.open(path_of_file, 'r', (error_opening_file, fd) =>
{
if (error_opening_file)
{
console.log(error_opening_file.message);
return;
}
var buffer = new Buffer(BUFFER_SIZE);
fs.read(fd, buffer, 0, BUFFER_SIZE, 0, (error_reading_file, bytesRead, buffer) =>
{
if (error_reading_file)
{
console.log(error_reading_file.message);
return;
}
// do something e.g. print or write to another file
})
})
I know I need to put a while loop in order to read complete file but in the above code I am reading just the first 1024 bytes of the file and cannot formulate how to continue reading the file without using a blocking loop. How could we do it?
Use fs.createReadStream instead. This will call your callback over and over again until it has finished reading the file, so you don't have to block.
var fs = require('fs');
var readStream = fs.createReadStream('./test.exe');
readStream.on('data', function (chunk) {
console.log(chunk.length);
})

fs.createReadStream - limit the amount of data streamed at a time

If I only want to read 10 bytes at a time, or one line of data at a time (looking for newline characters) is it possible to pass fs.createReadStream() options like so
var options = {}
var stream = fs.createReadStream('file.txt', options);
so that I can limit the amount of data streamed at a time?
looking at the fs docs, I don't see any options that would allow me to do that even though I am guessing that it's possible.
https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
You can use .read():
var stream = fs.createReadStream('file.txt', options);
var byteSize = 10;
stream.on("readable", function() {
var chunk;
while ( (chunk = stream.read(byteSize)) ) {
console.log(chunk.length);
}
});
The main benefit of knowing this one over just the highWaterMark option is that you can call it on streams you haven't created.
Here are the docs

Is it possible to register multiple listeners to a child process's stdout data event? [duplicate]

I need to run two commands in series that need to read data from the same stream.
After piping a stream into another the buffer is emptied so i can't read data from that stream again so this doesn't work:
var spawn = require('child_process').spawn;
var fs = require('fs');
var request = require('request');
var inputStream = request('http://placehold.it/640x360');
var identify = spawn('identify',['-']);
inputStream.pipe(identify.stdin);
var chunks = [];
identify.stdout.on('data',function(chunk) {
chunks.push(chunk);
});
identify.stdout.on('end',function() {
var size = getSize(Buffer.concat(chunks)); //width
var convert = spawn('convert',['-','-scale',size * 0.5,'png:-']);
inputStream.pipe(convert.stdin);
convert.stdout.pipe(fs.createWriteStream('half.png'));
});
function getSize(buffer){
return parseInt(buffer.toString().split(' ')[2].split('x')[0]);
}
Request complains about this
Error: You cannot pipe after data has been emitted from the response.
and changing the inputStream to fs.createWriteStream yields the same issue of course.
I don't want to write into a file but reuse in some way the stream that request produces (or any other for that matter).
Is there a way to reuse a readable stream once it finishes piping?
What would be the best way to accomplish something like the above example?
You have to create duplicate of the stream by piping it to two streams. You can create a simple stream with a PassThrough stream, it simply passes the input to the output.
const spawn = require('child_process').spawn;
const PassThrough = require('stream').PassThrough;
const a = spawn('echo', ['hi user']);
const b = new PassThrough();
const c = new PassThrough();
a.stdout.pipe(b);
a.stdout.pipe(c);
let count = 0;
b.on('data', function (chunk) {
count += chunk.length;
});
b.on('end', function () {
console.log(count);
c.pipe(process.stdout);
});
Output:
8
hi user
The first answer only works if streams take roughly the same amount of time to process data. If one takes significantly longer, the faster one will request new data, consequently overwriting the data still being used by the slower one (I had this problem after trying to solve it using a duplicate stream).
The following pattern worked very well for me. It uses a library based on Stream2 streams, Streamz, and Promises to synchronize async streams via a callback. Using the familiar example from the first answer:
spawn = require('child_process').spawn;
pass = require('stream').PassThrough;
streamz = require('streamz').PassThrough;
var Promise = require('bluebird');
a = spawn('echo', ['hi user']);
b = new pass;
c = new pass;
a.stdout.pipe(streamz(combineStreamOperations));
function combineStreamOperations(data, next){
Promise.join(b, c, function(b, c){ //perform n operations on the same data
next(); //request more
}
count = 0;
b.on('data', function(chunk) { count += chunk.length; });
b.on('end', function() { console.log(count); c.pipe(process.stdout); });
You can use this small npm package I created:
readable-stream-clone
With this you can reuse readable streams as many times as you need
For general problem, the following code works fine
var PassThrough = require('stream').PassThrough
a=PassThrough()
b1=PassThrough()
b2=PassThrough()
a.pipe(b1)
a.pipe(b2)
b1.on('data', function(data) {
console.log('b1:', data.toString())
})
b2.on('data', function(data) {
console.log('b2:', data.toString())
})
a.write('text')
I have a different solution to write to two streams simultaneously, naturally, the time to write will be the addition of the two times, but I use it to respond to a download request, where I want to keep a copy of the downloaded file on my server (actually I use a S3 backup, so I cache the most used files locally to avoid multiple file transfers)
/**
* A utility class made to write to a file while answering a file download request
*/
class TwoOutputStreams {
constructor(streamOne, streamTwo) {
this.streamOne = streamOne
this.streamTwo = streamTwo
}
setHeader(header, value) {
if (this.streamOne.setHeader)
this.streamOne.setHeader(header, value)
if (this.streamTwo.setHeader)
this.streamTwo.setHeader(header, value)
}
write(chunk) {
this.streamOne.write(chunk)
this.streamTwo.write(chunk)
}
end() {
this.streamOne.end()
this.streamTwo.end()
}
}
You can then use this as a regular OutputStream
const twoStreamsOut = new TwoOutputStreams(fileOut, responseStream)
and pass it to to your method as if it was a response or a fileOutputStream
If you have async operations on the PassThrough streams, the answers posted here won't work.
A solution that works for async operations includes buffering the stream content and then creating streams from the buffered result.
To buffer the result you can use concat-stream
const Promise = require('bluebird');
const concat = require('concat-stream');
const getBuffer = function(stream){
return new Promise(function(resolve, reject){
var gotBuffer = function(buffer){
resolve(buffer);
}
var concatStream = concat(gotBuffer);
stream.on('error', reject);
stream.pipe(concatStream);
});
}
To create streams from the buffer you can use:
const { Readable } = require('stream');
const getBufferStream = function(buffer){
const stream = new Readable();
stream.push(buffer);
stream.push(null);
return Promise.resolve(stream);
}
What about piping into two or more streams not at the same time ?
For example :
var PassThrough = require('stream').PassThrough;
var mybiraryStream = stream.start(); //never ending audio stream
var file1 = fs.createWriteStream('file1.wav',{encoding:'binary'})
var file2 = fs.createWriteStream('file2.wav',{encoding:'binary'})
var mypass = PassThrough
mybinaryStream.pipe(mypass)
mypass.pipe(file1)
setTimeout(function(){
mypass.pipe(file2);
},2000)
The above code does not produce any errors but the file2 is empty

Resources