Node.js read/write stream skipping the first line on write - node.js

I wrote a simple utility to convert a somewhat weird json file (multiple objects not in an array) to csv for some system testing purposes. The read and transformation themselves are fine, and the resulting string is logged to the console correctly, but sometimes the resulting csv file is missing the first data line (it shows header, 1 blank line, then rest of data). I'm using read and write streams, without any provisions for backpressure. I don't think the problem is backpressure, since only the 1st line gets skipped, but I could be wrong. Any ideas?
const fs = require('fs');
const readline = require('readline');
const JSONbig = require('json-bigint');
// Create read interface to stream each line
const readInterface = readline.createInterface({
input: fs.createReadStream('./confirm.json'),
// output: process.stdout,
console: false
});
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
readInterface.on('line', function(line) {
let task = JSONbig.parse(line);
task.businessData.MESSAGE.RECORD[0].DETAIL.REG_DETAIL.forEach(element => {
let csv = "I,PTB,0,VCO,PR9999999011,,cpicker1,2020121000000," + element.QUANTITYTOPICK.toString() + ",0,COMPLETED," +
task.businessData.MESSAGE.RECORD[0].ASSIGNMENTNUMBER.toString() + "," + element.LOCATIONNUMBER.toString() + "," +
element.ITEMNUMBER.toString() + ",,,," +
task.businessData.MESSAGE.RECORD[0].WAVE.toString() + ",N," + element.CARTONNUMBER.toString() + "\n";
console.log(csv);
try {
writeDetail.write(csv);
} catch (err) {
console.error(err);
}
});
});
Edit: Based on the feedback below, I consolidated the write streams into one (the missing line was still happening, but it's better coding anyway). I also added a try block around the JSON parse. Ran the code several times over different files, and no missing line. Maybe the write was happening before the parse was done? In any case, it seems my problem is resolved for the moment. I'll have to research how to properly handle backpressure later. Thanks for the help.

The code you show here is opening two separate writestreams on the same file and then writing to both of them without any timing coordination between them. That will clearly conflict.
You open one here:
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
And, you open one here:
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
And, then you write to the second one in your loop. Those clearly conflict. The write from the first is probably not complete when you open the second and it also may not be flushed to disk yet either. The second one opens for append, but doesn't accurately read the file position for appending because the first one hasn't yet succeeded.
This code doesn't show any reason for using separate write streams at all so the cleanest way to address this would be to just use one writestream that will accurately serialize the writes. Otherwise, you have to wait for the first writestream to finish and close before opening the second one.
And, your .forEach() loop needs to have backpressure support since you're repeatedly calling .write() and, at some data size, you can get backpressure. I agree that backpressure is not likely the cause of the issue you are asking about, but is something else you need to fix when rapdily writing in a loop.

Related

NodeJS + Electron - Optimizing Displaying Large Files

I'm trying to read large files. Currently, I'm following the NodeJS documentation on how to read the large files but when I read a somewhat large file (~1.1 MB, ~20k lines), my Electron app freezes up for about 6 minutes and then the app finishes loading all the lines.
Here's my current code
var fileContents = document.getElementById("fileContents")
//first clear out the existing text
fileContents.innerHTML = ""
if(fs.existsSync(pathToFile)){
const fileLine = readline.createInterface({
input: fs.createReadStream(pathToFile)
})
fileLine.on('line', (line) => {
fileContents.innerHTML += line + "\n"
})
} else {
fileContents.innerHTML += fileNotFound + "\n"
console.log('Could not find file!!')
}
And the tag I'm targeting is a <xmp> tag.
What are some ways that people have displayed large files?
Streams can often be useful for high performance as they allow you to process one line at a time without loading the whole file into memory.
In this case however, you are loading each line and then concatenating onto your existing string (fileContents.innerHTML) with +=. All that concatenating is likely to be slower than just loading the whole contents of the file as one string. Worse still, you are outputting HTML every time you read in a line. So with 20k lines you are asking the rendering engine to render HTML 20,000 times!
Instead, try reading in the file as one string, and outputting the HTML just once.
fs.readFile(pathToFile, (err, data) => {
if (err) throw err;
fileContents.innerHTML = data;
});
The problem with fs.readFile() is that you just won't be able to open large files, for instance 600Mb, you need to use stream anyway for very big files.
I'm writing a genomics app called AminoSee using Node and Electron. When I started trying to ingest bigger than 2 GB files I had to switch to streaming architecture as my program was trying to load the entire file into memory. Since I scan the file this is clearly ludicrous. Here is the core of my processor, from CLI app at:
sourced: https://github.com/tomachinz/AminoSee/blob/master/aminosee-cli.js
try {
var readStream = fs.createReadStream(filename).pipe(es.split()).pipe(es.mapSync(function(line){
readStream.pause(); // curious to test performance of removing
streamLineNr++;
processLine(line); // process line here and call readStream.resume() when ready
readStream.resume();
})
.on('error', function(err){
error('While reading file: ' + filename, err.reason);
error(err)
})
.on('end', function() {
log("Stream ending");
})
.on('close', function() {
log("Stream closed");
setImmediate( () => { // after a 2 GB file give the CPU 1 cycle breather!
calcUpdate() ;
saveDocuments();
});
}));
} catch(e) {
error("ERROR:" + e)
}
I used setImmediate a lot as my program would get quite far ahead of itself before I learnt about callbacks and promises! Was a great time to learn about race conditions that for sure. Still has a million bugs would make a good learning project.

NODE fs.readFile, JSON.parse and fs.writeFile

I'm writing an app in Node and have been running into a rare but detrimental occurrence.
So I have a schedule.txt and I write to it when the user makes a change but then also read it every second and then parse it for use throughout the program.
Rarely what happens is as a user is writing to the file (asynchronously) the app (based on the timer) reads the same file and attempts to parse it and fails.
I know from a design stand-point maybe this is just bound to happen... but I'm wondering if there is a quick fix I can do now. Would using writeFileSync help my situation? (make it more 'atomic'?) I just want to make sure that the app doesn't read the file while another process is still writing to the file.
TIA!
Niko
Seems like you'd want to serialize your read/writes. If it were me, I might try having a "manager" object which encapsulates the serialization, which you'd use like:
var fileManager = require('./file-manager');
// somewhere in the program
fileManager.scheduleWrite(data, function(err){
// now the write is done
});
// somewhere else in the program
fileManager.scheduleRead(function(err, data){
// `data` contains the data
});
Then implement it using Q or a similar promises lib, like:
// in file-manager.js
var wait = Q();
module.exports = {
scheduleWrite: function(data, cb){
wait = wait.then(function(){
// write data and call cb()
});
},
scheduleRead: function(){
wait = wait.then(function(){
// read data and call cb(data)
});
}
};
The wait var will "stack up" into a serialized chain of tasks where the next one won't start until the previous one completes.

Read File in Node and process the same

I wanted to read a file and process each line of the file. I have used the readStream to read the file and then invoke the processRecord method. The processMethod need to make multiple calls and need to make the final data before its written to the store.
The file has 500K records.
The issue that Im facing is that, the files are read at a significant pace and I believe the node is not getting enough priority to actually process the processLine method. Hence the memory shoots upto 800MB and then slows down.
Any help is appreciated.
The code that Im using is given below -
var instream = fs.createReadStream('C:/data.txt');
var outstream = new stream;
var rl = readline.createInterface({
input: instream,
output: outstream,
terminal: false
});
outstream.readable = true;
rl.on('line', function(line) {
processRecord(line);
}
The Node.js readline module is intended more for user interaction than line-by-line streaming from files. You may have better luck with the popular byline package.
var fs = require('fs');
var byline = require('byline');
// You'll need to check the encoding.
var lineStream = byline(fs.createReadStream('C:/data.txt', { encoding: 'utf8' }));
lineStream.on('data', function (line) {
processRecord(line);
});
You'll have a better chance of avoiding memory leaks if the data is piped to another stream. I'm assuming here that processRecord is feeding into one. If you make it a transform stream object, then you can use pipes.
var out = fs.createWriteStream('output.txt');
lineStream.pipe(processRecordStream).pipe(out);

How should I avoid out of memory using nodejs?

var pass = require('./pass.js');
var fs = require('fs');
var path = "password.txt";
var name ="admin";
var
remaining = "",
lineFeed = "\r\n",
lineNr = 0;
var log =
fs.createReadStream(path, { encoding: 'utf-8' })
.on('data', function (chunk) {
// store the actual chunk into the remaining
remaining = remaining.concat(chunk);
// look that we have a linefeed
var lastLineFeed = remaining.lastIndexOf(lineFeed);
// if we don't have any we can continue the reading
if (lastLineFeed === -1) return;
var
current = remaining.substring(0, lastLineFeed),
lines = current.split(lineFeed);
// store from the last linefeed or empty it out
remaining = (lastLineFeed > remaining.length)
? remaining.substring(lastLineFeed + 1, remaining.length)
: "";
for (var i = 0, length = lines.length; i < length; i++) {
// process the actual line
var account={
username:name,
password:lines[i],
};
pass.test(account);
}
})
.on('end', function (close) {
// TODO I'm not sure this is needed, it depends on your data
// process the reamining data if needed
if (remaining.length > 0) {
var account={
username:name,
password:remaining,
};
pass.test(account);
};
});
I tried to do something like test password of account "admin", pass.test is a function to test the password, I download a weak password dictionary with a large number of lines,so I search for way to read that many lines of weak password,but with code above, the lines array became too large ,and run out of memory,what should I do?
Insofar as my limited understanding goes, you need to watch a 1GB limit, which I believe is imposed by the V8 engine, actually. (Here's a link, actually saying the limit is 1.4 GB, currently, and lists the different params used to change this manually.) Depending on where you host your node app(s), you can increase this limit, by a param set on the command line when node is started. Again, see the linked article for a few ways to do this.
Also, you might want to make sure that, whenever possible, you use buffers, instead of converting things like data streams (from a DB or other things, for instance) to arrays/whatever, as this will then load the entire dataset into memory. As long as it lives in a buffer, it doesn't contribute to the total memory footprint of your app.
And actually, one thing that doesn't make sense, and that seems to be very inefficient in your app, is that, on reading each chunk of data in, you then check your username against EVERY username you've amassed so far, in your lines array, instead of the LAST one. What your app should do is keep track of the last username and password combo you've read in, and then delete all data before this user, in your remaining variable, so you keep your memory down. And since it's not a hold all repository for every line of your password file anymore, you should probably retitle it something like buffer or something. This means that you'd remove your for loop, since you're already "looping" through the data in your password file, by reading it in, chunk by chunk.

Reading file in segments of X number of lines

I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.
I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}

Resources