How should I avoid out of memory using nodejs?

How should I avoid out of memory using nodejs? - node.js

var pass = require('./pass.js');
var fs = require('fs');
var path = "password.txt";
var name ="admin";
var
remaining = "",
lineFeed = "\r\n",
lineNr = 0;
var log =
fs.createReadStream(path, { encoding: 'utf-8' })
.on('data', function (chunk) {
// store the actual chunk into the remaining
remaining = remaining.concat(chunk);
// look that we have a linefeed
var lastLineFeed = remaining.lastIndexOf(lineFeed);
// if we don't have any we can continue the reading
if (lastLineFeed === -1) return;
var
current = remaining.substring(0, lastLineFeed),
lines = current.split(lineFeed);
// store from the last linefeed or empty it out
remaining = (lastLineFeed > remaining.length)
? remaining.substring(lastLineFeed + 1, remaining.length)
: "";
for (var i = 0, length = lines.length; i < length; i++) {
// process the actual line
var account={
username:name,
password:lines[i],
};
pass.test(account);
}
})
.on('end', function (close) {
// TODO I'm not sure this is needed, it depends on your data
// process the reamining data if needed
if (remaining.length > 0) {
var account={
username:name,
password:remaining,
};
pass.test(account);
};
});
I tried to do something like test password of account "admin", pass.test is a function to test the password, I download a weak password dictionary with a large number of lines,so I search for way to read that many lines of weak password,but with code above, the lines array became too large ,and run out of memory,what should I do?

Insofar as my limited understanding goes, you need to watch a 1GB limit, which I believe is imposed by the V8 engine, actually. (Here's a link, actually saying the limit is 1.4 GB, currently, and lists the different params used to change this manually.) Depending on where you host your node app(s), you can increase this limit, by a param set on the command line when node is started. Again, see the linked article for a few ways to do this.
Also, you might want to make sure that, whenever possible, you use buffers, instead of converting things like data streams (from a DB or other things, for instance) to arrays/whatever, as this will then load the entire dataset into memory. As long as it lives in a buffer, it doesn't contribute to the total memory footprint of your app.
And actually, one thing that doesn't make sense, and that seems to be very inefficient in your app, is that, on reading each chunk of data in, you then check your username against EVERY username you've amassed so far, in your lines array, instead of the LAST one. What your app should do is keep track of the last username and password combo you've read in, and then delete all data before this user, in your remaining variable, so you keep your memory down. And since it's not a hold all repository for every line of your password file anymore, you should probably retitle it something like buffer or something. This means that you'd remove your for loop, since you're already "looping" through the data in your password file, by reading it in, chunk by chunk.

Related

Node.js read/write stream skipping the first line on write

I wrote a simple utility to convert a somewhat weird json file (multiple objects not in an array) to csv for some system testing purposes. The read and transformation themselves are fine, and the resulting string is logged to the console correctly, but sometimes the resulting csv file is missing the first data line (it shows header, 1 blank line, then rest of data). I'm using read and write streams, without any provisions for backpressure. I don't think the problem is backpressure, since only the 1st line gets skipped, but I could be wrong. Any ideas?
const fs = require('fs');
const readline = require('readline');
const JSONbig = require('json-bigint');
// Create read interface to stream each line
const readInterface = readline.createInterface({
input: fs.createReadStream('./confirm.json'),
// output: process.stdout,
console: false
});
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
readInterface.on('line', function(line) {
let task = JSONbig.parse(line);
task.businessData.MESSAGE.RECORD[0].DETAIL.REG_DETAIL.forEach(element => {
let csv = "I,PTB,0,VCO,PR9999999011,,cpicker1,2020121000000," + element.QUANTITYTOPICK.toString() + ",0,COMPLETED," +
task.businessData.MESSAGE.RECORD[0].ASSIGNMENTNUMBER.toString() + "," + element.LOCATIONNUMBER.toString() + "," +
element.ITEMNUMBER.toString() + ",,,," +
task.businessData.MESSAGE.RECORD[0].WAVE.toString() + ",N," + element.CARTONNUMBER.toString() + "\n";
console.log(csv);
try {
writeDetail.write(csv);
} catch (err) {
console.error(err);
}
});
});
Edit: Based on the feedback below, I consolidated the write streams into one (the missing line was still happening, but it's better coding anyway). I also added a try block around the JSON parse. Ran the code several times over different files, and no missing line. Maybe the write was happening before the parse was done? In any case, it seems my problem is resolved for the moment. I'll have to research how to properly handle backpressure later. Thanks for the help.

The code you show here is opening two separate writestreams on the same file and then writing to both of them without any timing coordination between them. That will clearly conflict.
You open one here:
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
And, you open one here:
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
And, then you write to the second one in your loop. Those clearly conflict. The write from the first is probably not complete when you open the second and it also may not be flushed to disk yet either. The second one opens for append, but doesn't accurately read the file position for appending because the first one hasn't yet succeeded.
This code doesn't show any reason for using separate write streams at all so the cleanest way to address this would be to just use one writestream that will accurately serialize the writes. Otherwise, you have to wait for the first writestream to finish and close before opening the second one.
And, your .forEach() loop needs to have backpressure support since you're repeatedly calling .write() and, at some data size, you can get backpressure. I agree that backpressure is not likely the cause of the issue you are asking about, but is something else you need to fix when rapdily writing in a loop.

Node.js memory leak when reading and writing large files

I am currently trying to implement SPIMI index construction method in Node and I have ran into an issue.
The code is the following:
let fs = require("fs");
let path = require("path");
module.exports = {
fileStream: function (dirPath, fileStream) {
return buildFileStream(dirPath, fileStream);
},
buildSpimi: function (fileStream, outDir) {
let invIndex = {};
let sortedInvIndex = {};
let fileNameCount = 1;
let outputTXT = "";
let entryCounter = 0;
let resString = "";
fileStream.forEach((filePath, fileIndex) => {
let data = fs.readFileSync(filePath).toString('utf-8');
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(function (ch) { return ch.length != 0; });
data.forEach(token => {
//CHANGE THE SIZE IF NECESSARY (4e+?)
if (entryCounter > 100000) {
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
resString = "";
entryCounter = 0;
sortedInvIndex = {};
invIndex = {};
console.log(outputTXT + " - written;");
fileNameCount++;
};
if (invIndex[token] == undefined) {
invIndex[token] = [];
entryCounter++;
};
if (!invIndex[token].includes(fileIndex)) {
invIndex[token].push(fileIndex);
entryCounter++;
};
});
});
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
console.log(outputTXT + " - written;");
}
}
function buildFileStream(dirPath, fileStream) {
fileStream = fileStream || 0;
fs.readdirSync(dirPath).forEach(function (file) {
let filepath = path.join(dirPath, file);
let stat = fs.statSync(filepath);
if (stat.isDirectory()) {
fileStream = buildFileStream(filepath, fileStream);
} else {
fileStream.push(filepath);
}
});
return fileStream;
}
I am using the exported functions in a separate file:
let spimi = require("./spimi");
let outputDir = "/Users/me/Desktop/SPIMI_OUT/"
let inputDir = "/Users/me/Desktop/gutenberg/2/2";
fileStream = [];
let result = spimi.fileStream(inputDir, fileStream);
console.table(result)
console.log("Finished building the filestream");
let t0 = new Date();
spimi.buildSpimi(result, outputDir);
let t1 = new Date();
console.log(t1 - t0);
While this code kind of works when trying on relatively small volumes of data (I tested up to 1.5 GB), there is obviously a memory leak somewhere, as when monitoring the RAM usage I can see it going up as far as to 4-5 GB).
I spent quite a lot of time trying to figure out what might be the cause, but I still couldn't find the issue.
I would appreciate any hints on this!
Thanks!

Something to understand about the language and garbage collection in general is that this:
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(...)
creates three additional copies of your data. First, an uppercase copy. Then, a split array copy. Then, a filtered copy of the split array.
So, at this point, you have four copies of your data all in memory. All, but the filtered array are now eligible for garbage collection when the GC gets a chance to run, but if this data was initially large, you're going to be using at least 3x-4x as much memory as the filesize (depending upon how many array items are removed in your .filter() operation).
None of this is a leak, but it's a very big peak memory usage which can be a problem.
A more memory efficient way to process large files is to process them as a stream (not read them all into memory at once). You read a small size chunk (say 1024 bytes), process it, read a chunk, process it while being careful about chunk boundaries. If your file naturally has line boundaries, there are already pre-built solutions for processing line by line. If not, you can create your own chunk processing mechanism. We would have to see a sample of your data to make more specific chunk processing suggestions.
As another point, if you end up with a lot of keys in invIndex, then this line of code starts to become inefficient and you're doing it in your loop:
Object.keys(invIndex).sort()
This takes your object and gets all the keys in a temporary array which you use only for the purposes of updating the sortedInvIndex which is yet another copy of your data. So, right there alone, this set of code makes three copies of all your keys and two copies of all the values. And, it does it every time through your loop. Again, lots of peak memory usage that the GC won't normally clean up until your function is done.
A redesign to the way you process this data could probably reduce the peak memory usage by a factor of 100x. For memory efficiency, you want only the initial data, the final data representation and then just a little more used for temporary transformations to over be in use at the same time. You don't want to EVER be processing all the data multiple times because each time you do that, it creates yet another entire copy of all the data that contributes to peak memory usage.
If you show what the data input looks like and what data structure you're trying to end up with, I could probably take a crack at a much more efficient implementation.

Mykhailo, adding on to what jfriend said, it's actually not a memory leak. It's working as intended.
Something to consider is that readFile buffers the entire file! This will cause the huge memory bloat. Better alternative is to implement fs.createReadStream() which will only buffer the part of the file you're currently reading. Unfortunately, implementing that solution may require a full rewrite of your code as it returns fs.ReadStream which won't behave the way you're currently handling files Checkout this link and read the bottom of the section to see what I'm referencing

node-maxmind usage for streaming data with IP address

I have streaming data coming in with IP address. I want to translate the IP to longitude and latitude before putting the data into my database.
This is what I was doing but it is causing some issues. I also tried putting locationObject outside the for loop. That weirdly is using a lot of memory. I know this is blocking code but it should be fast. Though I see memory issue as data object is coming from a stream continuously ans each data object is huge.
for (var i ==0; i < data.length; i++){
if (data.client_ip !== null) {
var locationLookup = maxmind.openSync('./GeoIP2-City.mmdb');
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
Again trying to put maxmind.openSync('./GeoIP2-City.mmdb'); outside fr loop is causing huge increase in memory.
The Other option is to use nonblocking code
maxmind.open('/path/to/GeoLite2-City.mmdb', (err, cityLookup) => {
var city = cityLookup.get('66.6.44.4');
});
But I don't think this is a good dea to put this inside a loop.
How can I handle this? I am getting data object every minute
https://github.com/runk/node-maxmind

I'm not sure why you think reading the database file for each iteration of the loop would be fast ("blocking code" doesn't equal "fast code"), it's much better to read the database file once and then loop over data.
maxmind.openSync() will read the entire database into memory, which is mentioned in the README:
Be careful with sync version! Since mmdb files are quite large
(city database is about 100Mb) fs.readFileSync blocks whole
process while it reads file into buffer.
If you don't have memory to spare, the only other option would be to open the file asynchronously. Again, not inside the loop, but outside of it:
maxmind.open("./GeoIP2-City.mmdb", (err, locationLookup) => {
for (var i = 0; i < data.length; i++) {
if (data.client_ip !== null) {
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
});

The only thing I am worried is over time I call this function so many times. every time my consumers read jsonObject from kakfa (happening every minute). is there a much better way to optimize that as well. so I call this function every minute. How can I better optimize this further
function processData(jsonObject) {
maxmind.open('./GeoIP2-City.mmdb', function(err, locationLookup) {
if (err) {
logger.error('something went wrong on maxmind fetch', err);
}
for (var i = 0; i < jsonObject.length; i++) { ...}
})
}

Reading file in segments of X number of lines

I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.

I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}

fs.readFileSync seems faster than fs.readFile - is it OK to use for a web app in production?

I know that when developing in node, you should always try to avoid blocking (sync) functions and go with async functions, however, I did a little test to see how they compare.
I need to open a JSON file that contains i18n data (like date and time formats, etc) and pass that data to a class that uses this data to format numbers, etc in my view.
It would be kind of awkward to start wrapping all the class's methods inside callbacks, so if possible, I would use the synchronous version instead.
console.time('one');
console.time('two');
fs.readFile( this.dir + "/" + locale + ".json", function (err, data) {
if (err) cb( err );
console.timeEnd('one');
});
var data = fs.readFileSync( this.dir + "/" + locale + ".json" );
console.timeEnd('two');
This results in the following lines in my console:
two: 1ms
one: 159ms
It seems that fs.readFileSync is about 150 times faster than fs.readFile - it takes about 1 ms to load a 50KB JSON file (minified). All my JSON files are around 50-100KB.
I was also thinking maybe somehow memoizing or saving this JSON data to the session so that the file is read-only once per session (or when the user changes their locale). I'm not entirely sure how to do that, it's just an idea.
Is it okay to use fs.readFileSync in my case or will I get in trouble later?

No, it is not OK to use a blocking API call in a node server as you describe. Your site's responsiveness to many concurrent connections will take a huge hit. It's also just blatantly violating the #1 principle of node.
The key to node working is that while it is waiting on IO, it is doing CPU/memory processing at the same time. This requires asynchronous calls exclusively. So if you have 100 clients reading 100 JSON files, node can ask the OS to read those 100 files but while waiting for the OS to return the file data when it is available, node can be processing other aspects of those 100 network requests. If you have a single synchronous call, ALL of your client processing stops entirely while that operation completes. So client number 100's connection waits with no processing whatsoever while you read files for clients 1, 2, 3, 4, and so on sequentially. This is Failville.
Here's another analogy. If you went to a restaurant and were the only customer, you would probably get faster service if a single person sat you, took your order, cooked it, served it to you, and handled the bill without the coordination overhead of dealing with the host/hostess, server, head chef, line cooks, cashiers, etc. However, with 100 customers in the restaurant, the extra coordination means things happen in parallel and the overall responsiveness of the restaurant is increased way beyond what it would be if a single person were trying to handle 100 customers on their own.

You are blocking the callback of the asynchronous read with your synchronous read, remember single thread.
Now I understand that the time difference is still amazing, but you should try with a file that is much, much longer to read and imagine that many, many clients will do the same, only then the overhead will pay off.
That should answer your question, yes you will run into trouble if you are serving thousands
of requests with blocking IO.

After a lot of time and a lot of learn & practice I've tried once more and I've found the answer and I can show some example:
const fs = require('fs');
const syncTest = () => {
let startTime = +new Date();
const results = [];
const files = [];
for (let i=0, len=4; i<len; i++) {
files.push(fs.readFileSync(`file-${i}.txt`));
};
for (let i=0, len=360; i<len; i++) results.push(Math.sin(i), Math.cos(i));
console.log(`Sync version: ${+new Date() - startTime}`);
};
const asyncTest = () => {
let startTime = +new Date();
const results = [];
const files = [];
for (let i=0, len=4; i<len; i++) {
fs.readFile(`file-${i}.txt`, file => files.push(file));
};
for (let i=0, len=360; i<len; i++) results.push(Math.sin(i), Math.cos(i));
console.log(`Async version: ${+new Date() - startTime}`);
};
syncTest();
asyncTest();

Yes, it's correct, to deal with the asynchronous way in a server-side environment. But if their use case is different like to generating the build as in client-side JS project, meanwhile reading and writing the JSON files for different flavors.
It doesn't affect that much. Although we needed a rapid manner to create a minified build for deployment (here synchronous comes into the picture).
for more info and library

I've tried to check the real, measurable difference in a speed between fs.readFileSync() and fs.readFile() for downloading 3 different files which are on SD card and I've added between this downloads some math calculation and I don't understand where is the difference in speed which is always showed on node pictures when node is faster also in simple operation like downloading 3 times the same file and the time for this operation is close to time which is needed for downloading 1 time this file.
I understand that this is no doubtly useful that server during downloading some file is able to doing other job but a lot of time on youtube or in books there are some diagrams which are not precise because when you have a situation like below async node is slower then sync in reading small files(like below: 85kB, 170kB, 255kB).
var fs = require('fs');
var startMeasureTime = () => {
var start = new Date().getTime();
return start;
};
// synch version
console.log('Start');
var start = startMeasureTime();
for (var i = 1; i<=3; i++) {
var fileName = `Lorem-${i}.txt`;
var fileContents = fs.readFileSync(fileName);
console.log(`File ${1} was downloaded(${fileContents.length/1000}KB) after ${new Date().getTime() - start}ms from start.`);
if (i === 1) {
var hardMath = 3*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9;
};
};
// asynch version
setImmediate(() => {
console.log('Start');
var start = startMeasureTime();
for (var i = 1; i<=3; i++) {
var fileName = `Lorem-${i}.txt`;
fs.readFile(fileName, {encoding: 'utf8'}, (err, fileContents) => {
console.log(`File ${1} was downloaded(${fileContents.length/1000}KB) after ${new Date().getTime() - start}ms from start.`);
});
if (i === 1) {
var hardMath = 3*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9;
};
};
});
This is from console:
Start
File 1 was downloaded(255.024KB) after 2ms from start.
File 1 was downloaded(170.016KB) after 5ms from start.
File 1 was downloaded(85.008KB) after 6ms from start.
Start
File 1 was downloaded(255.024KB) after 10ms from start.
File 1 was downloaded(85.008KB) after 11ms from start.
File 1 was downloaded(170.016KB) after 12ms from start.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string