I am currently trying to implement SPIMI index construction method in Node and I have ran into an issue.
The code is the following:
let fs = require("fs");
let path = require("path");
module.exports = {
fileStream: function (dirPath, fileStream) {
return buildFileStream(dirPath, fileStream);
},
buildSpimi: function (fileStream, outDir) {
let invIndex = {};
let sortedInvIndex = {};
let fileNameCount = 1;
let outputTXT = "";
let entryCounter = 0;
let resString = "";
fileStream.forEach((filePath, fileIndex) => {
let data = fs.readFileSync(filePath).toString('utf-8');
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(function (ch) { return ch.length != 0; });
data.forEach(token => {
//CHANGE THE SIZE IF NECESSARY (4e+?)
if (entryCounter > 100000) {
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
resString = "";
entryCounter = 0;
sortedInvIndex = {};
invIndex = {};
console.log(outputTXT + " - written;");
fileNameCount++;
};
if (invIndex[token] == undefined) {
invIndex[token] = [];
entryCounter++;
};
if (!invIndex[token].includes(fileIndex)) {
invIndex[token].push(fileIndex);
entryCounter++;
};
});
});
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
console.log(outputTXT + " - written;");
}
}
function buildFileStream(dirPath, fileStream) {
fileStream = fileStream || 0;
fs.readdirSync(dirPath).forEach(function (file) {
let filepath = path.join(dirPath, file);
let stat = fs.statSync(filepath);
if (stat.isDirectory()) {
fileStream = buildFileStream(filepath, fileStream);
} else {
fileStream.push(filepath);
}
});
return fileStream;
}
I am using the exported functions in a separate file:
let spimi = require("./spimi");
let outputDir = "/Users/me/Desktop/SPIMI_OUT/"
let inputDir = "/Users/me/Desktop/gutenberg/2/2";
fileStream = [];
let result = spimi.fileStream(inputDir, fileStream);
console.table(result)
console.log("Finished building the filestream");
let t0 = new Date();
spimi.buildSpimi(result, outputDir);
let t1 = new Date();
console.log(t1 - t0);
While this code kind of works when trying on relatively small volumes of data (I tested up to 1.5 GB), there is obviously a memory leak somewhere, as when monitoring the RAM usage I can see it going up as far as to 4-5 GB).
I spent quite a lot of time trying to figure out what might be the cause, but I still couldn't find the issue.
I would appreciate any hints on this!
Thanks!
Something to understand about the language and garbage collection in general is that this:
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(...)
creates three additional copies of your data. First, an uppercase copy. Then, a split array copy. Then, a filtered copy of the split array.
So, at this point, you have four copies of your data all in memory. All, but the filtered array are now eligible for garbage collection when the GC gets a chance to run, but if this data was initially large, you're going to be using at least 3x-4x as much memory as the filesize (depending upon how many array items are removed in your .filter() operation).
None of this is a leak, but it's a very big peak memory usage which can be a problem.
A more memory efficient way to process large files is to process them as a stream (not read them all into memory at once). You read a small size chunk (say 1024 bytes), process it, read a chunk, process it while being careful about chunk boundaries. If your file naturally has line boundaries, there are already pre-built solutions for processing line by line. If not, you can create your own chunk processing mechanism. We would have to see a sample of your data to make more specific chunk processing suggestions.
As another point, if you end up with a lot of keys in invIndex, then this line of code starts to become inefficient and you're doing it in your loop:
Object.keys(invIndex).sort()
This takes your object and gets all the keys in a temporary array which you use only for the purposes of updating the sortedInvIndex which is yet another copy of your data. So, right there alone, this set of code makes three copies of all your keys and two copies of all the values. And, it does it every time through your loop. Again, lots of peak memory usage that the GC won't normally clean up until your function is done.
A redesign to the way you process this data could probably reduce the peak memory usage by a factor of 100x. For memory efficiency, you want only the initial data, the final data representation and then just a little more used for temporary transformations to over be in use at the same time. You don't want to EVER be processing all the data multiple times because each time you do that, it creates yet another entire copy of all the data that contributes to peak memory usage.
If you show what the data input looks like and what data structure you're trying to end up with, I could probably take a crack at a much more efficient implementation.
Mykhailo, adding on to what jfriend said, it's actually not a memory leak. It's working as intended.
Something to consider is that readFile buffers the entire file! This will cause the huge memory bloat. Better alternative is to implement fs.createReadStream() which will only buffer the part of the file you're currently reading. Unfortunately, implementing that solution may require a full rewrite of your code as it returns fs.ReadStream which won't behave the way you're currently handling files Checkout this link and read the bottom of the section to see what I'm referencing
Related
I'm trying to write a live websocket feed line-by-line to a file - I think for this I should be using a writeable stream.
My problem here is that the data received is in the region of 10 lines per second, which quickly fills the buffer.
I understand when using streams from sources you control, you would normally add some sort of backpressure logic here, but what should I do if I do not control the source? Should I be batching up the writes and writing, say 500 lines at a time, instead of per line, or should I be using some other way to save this data?
I'm wondering how big are the lines? 10 lines per second sounds trivial to stream to a disk unless the lines are gigantic or the disk really slow. Ultimately, if you have no ability to apply backpressure logic, the source can overwhelm you if they go fast or your storage goes slow and you'd have to decide how much you can reasonably buffer and eventually just drop some of the data if you get behind.
But, you should be able to write a lot of data. On a my regular hard disk (using the generic stream code below with no additional buffering) I can do sequential writes of 100,000,000 bytes at a speed of 55 MBytes/sec:
So, if you have 10 lines per second coming in, as long as the lines were below 10,000,000 bytes each, my hard drive could keep up.
Here's the code I used to test it:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.write(lineData, done);
} else {
readyForMore = stream.write(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
Theoretically, it is more efficient for your disk to do fewer writes that are larger, than tons of small writes. In practice, because of the way the writeStream works, as soon as an inefficient write gets slow, the next write will get buffered and it kind of self corrects. If you were really trying to minimize the load on the disk, you would buffer writes until you had at least something like 4k to write. The issue is that each write has potentially allocate some bytes to the file (which involves writing to a table on the disk), then seek to where the bytes should be written on the disk, then write the bytes. Fewer and larger writes that are larger (up to some limit that depends upon internal implementation) will reduce the number of times it has to do the file allocation overhead.
So, I ran a test. I modified the above code (shown below) to buffer into 4k chunks and write them out in 4k chunks. The write through increased from 55 MBytes/sec to 284.2 MBytes/sec.
So, the theory holds true that you will write faster if you buffer into larger chunks.
But, even the simpler, non-buffered version may be plenty fast.
Here's the test code for the buffered version:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
let cache = [];
let cacheTotal = 0;
const maxBuffered = 4 * 1024;
stream.myWrite = function(data, callback) {
if (callback) {
cache.push(data);
return stream.write(Buffer.concat(cache), callback);
} else {
cache.push(data);
cacheTotal += data.length;
if (cacheTotal >= maxBuffered) {
let ready = stream.write(Buffer.concat(cache));
cache.length = 0;
cacheTotal = 0;
return ready;
} else {
return true;
}
}
}
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.myWrite(lineData, done);
} else {
readyForMore = stream.myWrite(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
This code uses a couple of my local libraries for measuring the time and formatting the output. If you want to run this yourself, you can substitute your own logic for those.
I'm generating a CSV file that I'd like to save.
It's a bit large, but the code is very simple.
I use streams as to prevent out of memory errors, but it's happening regardless.
Any tips?
const fs = require('fs');
var noOfRows = 2000000000;
var stream = fs.createWriteStream('myFile.csv', {flags: 'a'});
for (var i=0;i<=noOfRows;i++){
var col = '';
col += i;
stream.write(col)
}
add a drain eventlistener.
const fs = require("fs");
var noOfRows = 2000000000;
var stream = fs.createWriteStream("myFile.csv", { flags: "a" });
var i = 0;
function write() {
var ok = true;
do {
var data = i + "";
if (i === noOfRows) {
// last time!
stream.write(data);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = stream.write(data);
}
i++;
} while (i<=noOfRows && ok);
if (i < noOfRows) {
// had to stop early!
// write some more once it drains
stream.once("drain", write);
}
}
write();
And noOfRows is so big, it may cause your .csv file size out of disk size
Your .csv file has too much data to be kept in stream. Streams basically uses your computer's physical memory so it can store only upto the free physical memory. e.g. if your computer has 8GB of RAM of which lets say 6 GB is free then the stream can't store more than 6GB. You can break it up into chunks and then merge it back at the destination later.
There is no hard size limit on .csv files. The limit in any scenario would be the file system / hdd size.
The maximum file size of any file on a filesystem is determined by the
filesystem itself - not by the file type or filename suffix.
To prevent out memory errors check you file size limit as per your filesystem partition.
I have streaming data coming in with IP address. I want to translate the IP to longitude and latitude before putting the data into my database.
This is what I was doing but it is causing some issues. I also tried putting locationObject outside the for loop. That weirdly is using a lot of memory. I know this is blocking code but it should be fast. Though I see memory issue as data object is coming from a stream continuously ans each data object is huge.
for (var i ==0; i < data.length; i++){
if (data.client_ip !== null) {
var locationLookup = maxmind.openSync('./GeoIP2-City.mmdb');
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
Again trying to put maxmind.openSync('./GeoIP2-City.mmdb'); outside fr loop is causing huge increase in memory.
The Other option is to use nonblocking code
maxmind.open('/path/to/GeoLite2-City.mmdb', (err, cityLookup) => {
var city = cityLookup.get('66.6.44.4');
});
But I don't think this is a good dea to put this inside a loop.
How can I handle this? I am getting data object every minute
https://github.com/runk/node-maxmind
I'm not sure why you think reading the database file for each iteration of the loop would be fast ("blocking code" doesn't equal "fast code"), it's much better to read the database file once and then loop over data.
maxmind.openSync() will read the entire database into memory, which is mentioned in the README:
Be careful with sync version! Since mmdb files are quite large
(city database is about 100Mb) fs.readFileSync blocks whole
process while it reads file into buffer.
If you don't have memory to spare, the only other option would be to open the file asynchronously. Again, not inside the loop, but outside of it:
maxmind.open("./GeoIP2-City.mmdb", (err, locationLookup) => {
for (var i = 0; i < data.length; i++) {
if (data.client_ip !== null) {
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
});
The only thing I am worried is over time I call this function so many times. every time my consumers read jsonObject from kakfa (happening every minute). is there a much better way to optimize that as well. so I call this function every minute. How can I better optimize this further
function processData(jsonObject) {
maxmind.open('./GeoIP2-City.mmdb', function(err, locationLookup) {
if (err) {
logger.error('something went wrong on maxmind fetch', err);
}
for (var i = 0; i < jsonObject.length; i++) { ...}
})
}
While attempting to experiment with Node.JS streams I ran into an interesting conundrum. When the input (Readable) stream pushes more data then the destination (Writable) cares about I was unable to apply back-pressure correctly.
The two methods I attempted was to return false from the Writable.prototype._write and to retain a reference to the Readable so I can call Readable.pause() from the Writable. Neither solution helped much which I'll explain.
In my exercise (which you can view the full source as a Gist) I have three streams:
Readable - PasscodeGenerator
util.inherits(PasscodeGenerator, stream.Readable);
function PasscodeGenerator(prefix) {
stream.Readable.call(this, {objectMode: true});
this.count = 0;
this.prefix = prefix || '';
}
PasscodeGenerator.prototype._read = function() {
var passcode = '' + this.prefix + this.count;
if (!this.push({passcode: passcode})) {
this.pause();
this.once('drain', this.resume.bind(this));
}
this.count++;
};
I thought that the return code from this.push() was enough to self pause and wait for the drain event to resume.
Transform - Hasher
util.inherits(Hasher, stream.Transform);
function Hasher(hashType) {
stream.Transform.call(this, {objectMode: true});
this.hashType = hashType;
}
Hasher.prototype._transform = function(sample, encoding, next) {
var hash = crypto.createHash(this.hashType);
hash.setEncoding('hex');
hash.write(sample.passcode);
hash.end();
sample.hash = hash.read();
this.push(sample);
next();
};
Simply add the hash of the passcode to the object.
Writable - SampleConsumer
util.inherits(SampleConsumer, stream.Writable);
function SampleConsumer(max) {
stream.Writable.call(this, {objectMode: true});
this.max = (max != null) ? max : 10;
this.count = 0;
}
SampleConsumer.prototype._write = function(sample, encoding, next) {
this.count++;
console.log('Hash %d (%s): %s', this.count, sample.passcode, sample.hash);
if (this.count < this.max) {
next();
} else {
return false;
}
};
Here I want to consume the data as fast as possible until I reach my max number of samples and then end the stream. I tried using this.end() instead of return false but that caused the dreaded write called after end problem. Returning false does stop everything if the sample size is small but when it is large I get an out of memory error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted (core dumped)
According to this SO answer in theory the Write stream would return false causing the streams to buffer until the buffers were full (16 by default for objectMode) and eventually the Readable would call it's this.pause() method. But 16 + 16 + 16 = 48; that's 48 objects in buffer till things fill up and the system is clogged. Actually less because there is no cloning involved so the objects passed between them is the same reference. Would that not mean only 16 objects in memory till the high water mark halts everything?
Lastly I realize I could have the Writable reference the Readable to call it's pause method using closures. However, this solution means the Writable stream knows to much about another object. I'd have to pass in a reference:
var foo = new PasscodeGenerator('foobar');
foo
.pipe(new Hasher('md5'))
.pipe(new SampleConsumer(samples, foo));
And this feels out of norm for how streams would work. I thought back-pressure was enough to cause a Writable to stop a Readable from pushing data and prevent out of memory errors.
An analogous example would be the Unix head command. Implementing that in Node I would assume that the destination could end and not just ignore causing the source to keep pushing data even if the destination has enough data to satisfy the beginning portion of the file.
How do I idiomatically construct custom streams such that when the destination is ready to end the source stream doesn't attempt to push more data?
This is a known issue with how _read() is called internally. Since your _read() is always pushing synchronously/immediately, the internal stream implementation can get into a loop in the right conditions. _read() implementations are generally expected to do some sort of async I/O (e.g. reading from disk or network).
The workaround for this (as noted in the link above) is to make your _read() asynchronous at least some of the time. You could also just make it async every time it's called with:
PasscodeGenerator.prototype._read = function(n) {
var passcode = '' + this.prefix + this.count;
var self = this;
// `setImmediate()` delays the push until the beginning
// of the next tick of the event loop
setImmediate(function() {
self.push({passcode: passcode});
});
this.count++;
};
var pass = require('./pass.js');
var fs = require('fs');
var path = "password.txt";
var name ="admin";
var
remaining = "",
lineFeed = "\r\n",
lineNr = 0;
var log =
fs.createReadStream(path, { encoding: 'utf-8' })
.on('data', function (chunk) {
// store the actual chunk into the remaining
remaining = remaining.concat(chunk);
// look that we have a linefeed
var lastLineFeed = remaining.lastIndexOf(lineFeed);
// if we don't have any we can continue the reading
if (lastLineFeed === -1) return;
var
current = remaining.substring(0, lastLineFeed),
lines = current.split(lineFeed);
// store from the last linefeed or empty it out
remaining = (lastLineFeed > remaining.length)
? remaining.substring(lastLineFeed + 1, remaining.length)
: "";
for (var i = 0, length = lines.length; i < length; i++) {
// process the actual line
var account={
username:name,
password:lines[i],
};
pass.test(account);
}
})
.on('end', function (close) {
// TODO I'm not sure this is needed, it depends on your data
// process the reamining data if needed
if (remaining.length > 0) {
var account={
username:name,
password:remaining,
};
pass.test(account);
};
});
I tried to do something like test password of account "admin", pass.test is a function to test the password, I download a weak password dictionary with a large number of lines,so I search for way to read that many lines of weak password,but with code above, the lines array became too large ,and run out of memory,what should I do?
Insofar as my limited understanding goes, you need to watch a 1GB limit, which I believe is imposed by the V8 engine, actually. (Here's a link, actually saying the limit is 1.4 GB, currently, and lists the different params used to change this manually.) Depending on where you host your node app(s), you can increase this limit, by a param set on the command line when node is started. Again, see the linked article for a few ways to do this.
Also, you might want to make sure that, whenever possible, you use buffers, instead of converting things like data streams (from a DB or other things, for instance) to arrays/whatever, as this will then load the entire dataset into memory. As long as it lives in a buffer, it doesn't contribute to the total memory footprint of your app.
And actually, one thing that doesn't make sense, and that seems to be very inefficient in your app, is that, on reading each chunk of data in, you then check your username against EVERY username you've amassed so far, in your lines array, instead of the LAST one. What your app should do is keep track of the last username and password combo you've read in, and then delete all data before this user, in your remaining variable, so you keep your memory down. And since it's not a hold all repository for every line of your password file anymore, you should probably retitle it something like buffer or something. This means that you'd remove your for loop, since you're already "looping" through the data in your password file, by reading it in, chunk by chunk.