Node.js - Out of memory when making calls to search index - node.js

I am trying to read from a csv file and insert the data into an elasticsearch index. As below, I use a readstream and listen in on the "data" event. My problem is, I quickly run out of memory using this approach. I'm guessing it's because the elasticsearch module (elastical) is making a REST every time, and the number of such requests build up.
I am pretty new, so is there a way for me to fix this so it doesn't run out of memory? Any general patterns or techniques?
stream.on('data', function (doc) {
// create a json from doc
client.index('entities', 'command', json, function (err, res) {
console.log(res);
});
}

Pause the stream when you get data and resume it when the request completes.
stream.on('data', function (doc) {
stream.pause();
// create a json from doc
client.index('entities', 'command', json, function (err, res) {
stream.resume();
console.log(res);
});
}
Weird thing about your code is you're not using doc anywhere in that function. I'm guessing you're not posting your entire code.

Related

Node CSV pull parser

I need to parse a CSV document from Node.JS, performing database operations for each record (= each line). However, I'm having trouble finding a suitable CSV parser using a pull approach, or at least a push approach that waits for my record operations before parsing the next row.
I've looked at csv-parse, csvtojson, csv-streamify, but they all seem to push events in a continuous stream without any flow control. If parsing a 1000 line CSV document, I basically get all 1000 callbacks in quick sequence. For each record, I perform an operation returning a promise. Currently I've had to resort to pushing all my promises into an array and after getting the done/end event I also wait for Promise.all(myOperations) to know when the document has been fully processed. But this is not very nice, and also, I'd prefer parsing one line at a time and fully processing it, before getting the next record, instead of concurrently processing all records - it's hard to debug and uses a lot of memory as opposed to simply dealing with each record sequentially.
So, is there a CSV parser that supports pull mode, or a way to get any stream-based CSV parser (preferably csvtojson as that's the one I'm using at the moment) to only produce events for new records when my handler for the previous record is finished (using promises)?
I solved this myself by creating my own Writable and piping the CSV parser to it. My write method does its stuff and wraps a promise to the node callback passed to _write() (here implemented using Q.nodeify):
class CsvConsumer extends stream.Writable {
_write(data, encoding, cb) {
console.log('Got data: ', data);
Q.delay(1000).then(() => {
console.log('Waited 1 s');
}).nodeify(cb);
}
}
csvtojson()
.fromStream(is)
.pipe(new CsvConsumer())
.on('finish', err => {
if (err) {
console.log('Error!');
} else {
console.log('Done!');
}
});
This will process lines one by one:
Got data: {"a": "1"}
Waited 1 s
Got data: {"a": "2"}
Waited 1 s
Got data: {"a": "3"}
Waited 1 s
Done!
If you want to process each line asynchronously you can do that with node's native LineReader.
const lineStream = readline.createInterface({
input: fs.createReadStream('data/test.csv'),
});
lineStream.on('line', (eachLine) =>{
//process each line
});
If you want to do the same in synchronous fashion you can use line-by-line. It doesn't buffer the entire file into memory. It provides event handlers to pause and resume the 'line' emit event.
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines. (1 sec delay)
lr.resume();
}, 1000);
});

Redis Node - Get from hash - Not inserting into array

My goal is to insert the values gotten from a redis hash. I am using the redis package for node js.
My code is the following:
getFromHash(ids) {
const resultArray = [];
ids.forEach((id) => {
common.redisMaster.hget('mykey', id, (err, res) => {
resultArray.push(res);
});
});
console.log(resultArray);
},
The array logged at the end of the function is empty and res is not empty. What could i do to fill this array please ?
You need to use some control flow, either the async library or Promises (as described in reds docs)
Put your console.log inside the callback when the results return from the redis call. Then you will see more print out. Use one of the control flow patterns for your .forEach as well, as that is currently synchronous.
If you modify your code to something like this, it will work nicely:
var getFromHash = function getFromHash(ids) {
const resultArray = [];
ids.forEach((id) => {
common.redisMaster.hget('mykey', id, (err, res) => {
resultArray.push(res);
if (resultArray.length === ids.length) {
// All done.
console.log('getFromHash complete: ', resultArray);
}
});
});
};
In your original code you're printing the result array before any of the hget calls have returned.
Another approach will be to create an array of promises and then do a Promise.all on it.
You'll see this kind of behavior a lot with Node, remember it uses asynchronous calls for almost all i/o. When you're coming from a language where most function calls are synchronous you get tripped up by this kind of problem a lot!

Node.js Iterate CSV File for Requests, but Wait for Response Before Continuing the Iterations

I have some code (somewhat simplified for this discussion) that is something like this
var inputFile='inputfile.csv';
var parser = parse({delimiter: ','}, function (err, data) {
async.eachSeries(data, function (line, callback) {
SERVER.Request(line[0], line[1]);
SERVER.on("RequestResponse", function(response) {
console.log(response);
});
callback();
});
});
SERVER.start()
SERVER.on("ready", function() {
fs.createReadStream(inputFile).pipe(parser);
});
and what I am trying to do is run a CSV file through a command line node program that will iterate over each line and then make a request to a server which responds with an event RequestResponse and I then log the response. the RequestResponse takes a second of so, and the way I have the code set up now it just flies through the CSV file and I get an output for each iteration but it is mostly the output I would expect for the first iteration with a little of the output of the second iteration. I need to know how to make iteration wait until there has been a RequestResponse event before continuing on to the next iteration. is this possible?
I have based this code largely in part on
NodeJs reading csv file
but I am a little lost tbh with Node.js and with async.foreach. any help would be greatly appreciated
I suggest that you bite the bullet and take your time learning promises and async/await. Then you can just use a regular for loop and await a web response promise.
Solution is straight forward. You need to call the "callback" after the server return thats it.
async.eachSeries(data, function (line, callback) {
SERVER.Request(line[0], line[1]);
SERVER.on("RequestResponse", function(response) {
console.log(response);
SERVER.removeAllListeners("RequestResponse", callback)
});
})
What is happening is that eachSeries is expecting callback AFTER you are down with the particular call.

How to access data only once ALL of it is ready

I have seen some code such as this:
.on('error', console.error)
.on('data', function (data) {})
.on('info', function(info) {})
.on('end', function() {
// All data retrieved.
});
I have read some docs about streams, but am having trouble understanding them. Say I only want to do the operations once all the data is received (not partial). How can I do this? I would think I would have to read the data object inside of the 'end' function, but the data object is not accessible from there.
From my understanding if I put some logic inside of the 'data' function I could be operating on incomplete data? Is this true? Say data is a list of friends (some lists have 1 friend some can have 10,000 so the size of the data returned back will be different). How can I only do operation once ALL the friends are returned no matter the size of the data coming back?
The data handler will usually be called multiple times, each time with a fraction of the complete data.
If you want to perform an action once with all data, the usual way is as follows:
Buffer all the items received in the data handler in some variable (e.g. add to an array) and perform your final action in the end handler. (although the idea of a stream naturally is, to "act" right away).
var allData = [];
stream
.on('error', console.error)
.on('data', function (data) {
allData.push(data);
})
.on('info', function(info) {})
.on('end', function() {
// TODO do something more intelligent,
// where buffering in memory makes sense
console.log(allData.join());
});

nodejs express fs iterating files into array or object failing

So Im trying to use the nodejs express FS module to iterate a directory in my app, store each filename in an array, which I can pass to my express view and iterate through the list, but Im struggling to do so. When I do a console.log within the files.forEach function loop, its printing the filename just fine, but as soon as I try to do anything such as:
var myfiles = [];
var fs = require('fs');
fs.readdir('./myfiles/', function (err, files) { if (err) throw err;
files.forEach( function (file) {
myfiles.push(file);
});
});
console.log(myfiles);
it fails, just logs an empty object. So Im not sure exactly what is going on, I think it has to do with callback functions, but if someone could walk me through what Im doing wrong, and why its not working, (and how to make it work), it would be much appreciated.
The myfiles array is empty because the callback hasn't been called before you call console.log().
You'll need to do something like:
var fs = require('fs');
fs.readdir('./myfiles/',function(err,files){
if(err) throw err;
files.forEach(function(file){
// do something with each file HERE!
});
});
// because trying to do something with files here won't work because
// the callback hasn't fired yet.
Remember, everything in node happens at the same time, in the sense that, unless you're doing your processing inside your callbacks, you cannot guarantee asynchronous functions have completed yet.
One way around this problem for you would be to use an EventEmitter:
var fs=require('fs'),
EventEmitter=require('events').EventEmitter,
filesEE=new EventEmitter(),
myfiles=[];
// this event will be called when all files have been added to myfiles
filesEE.on('files_ready',function(){
console.dir(myfiles);
});
// read all files from current directory
fs.readdir('.',function(err,files){
if(err) throw err;
files.forEach(function(file){
myfiles.push(file);
});
filesEE.emit('files_ready'); // trigger files_ready event
});
As several have mentioned, you are using an async method, so you have a nondeterministic execution path.
However, there is an easy way around this. Simply use the Sync version of the method:
var myfiles = [];
var fs = require('fs');
var arrayOfFiles = fs.readdirSync('./myfiles/');
//Yes, the following is not super-smart, but you might want to process the files. This is how:
arrayOfFiles.forEach( function (file) {
myfiles.push(file);
});
console.log(myfiles);
That should work as you want. However, using sync statements is not good, so you should not do it unless it is vitally important for it to be sync.
Read more here: fs.readdirSync
fs.readdir is asynchronous (as with many operations in node.js). This means that the console.log line is going to run before readdir has a chance to call the function passed to it.
You need to either:
Put the console.log line within the callback function given to readdir, i.e:
fs.readdir('./myfiles/', function (err, files) { if (err) throw err;
files.forEach( function (file) {
myfiles.push(file);
});
console.log(myfiles);
});
Or simply perform some action with each file inside the forEach.
I think it has to do with callback functions,
Exactly.
fs.readdir makes an asynchronous request to the file system for that information, and calls the callback at some later time with the results.
So function (err, files) { ... } doesn't run immediately, but console.log(myfiles) does.
At some later point in time, myfiles will contain the desired information.
You should note BTW that files is already an Array, so there is really no point in manually appending each element to some other blank array. If the idea is to put together the results from several calls, then use .concat; if you just want to get the data once, then you can just assign myfiles = files directly.
Overall, you really ought to read up on "Continuation-passing style".
I faced the same problem, and basing on answers given in this post I've solved it with Promises, that seem to be of perfect use in this situation:
router.get('/', (req, res) => {
var viewBag = {}; // It's just my little habit from .NET MVC ;)
var readFiles = new Promise((resolve, reject) => {
fs.readdir('./myfiles/',(err,files) => {
if(err) {
reject(err);
} else {
resolve(files);
}
});
});
// showcase just in case you will need to implement more async operations before route will response
var anotherPromise = new Promise((resolve, reject) => {
doAsyncStuff((err, anotherResult) => {
if(err) {
reject(err);
} else {
resolve(anotherResult);
}
});
});
Promise.all([readFiles, anotherPromise]).then((values) => {
viewBag.files = values[0];
viewBag.otherStuff = values[1];
console.log(viewBag.files); // logs e.g. [ 'file.txt' ]
res.render('your_view', viewBag);
}).catch((errors) => {
res.render('your_view',{errors:errors}); // you can use 'errors' property to render errors in view or implement different error handling schema
});
});
Note: you don't have to push found files into new array because you already get an array from fs.readdir()'c callback. According to node docs:
The callback gets two arguments (err, files) where files is an array
of the names of the files in the directory excluding '.' and '..'.
I belive this is very elegant and handy solution, and most of all - it doesn't require you to bring in and handle new modules to your script.

Resources