Bad performance on combination of streams - node.js

I want to stream the results of a PostgreSQL query to a client via a websocket.
Data is fetched from the database using pg-promise and pg-query-stream. To stream data via a websocket I use socket.io-stream.
Individually, all components perform quite wel. Though when I pipe the pg-query-stream to the socket.io-stream, performance drops drastically.
I've started with:
var QueryStream = require('pg-query-stream');
var ss = require('socket.io-stream');
// Query with a lot of results
var qs = new QueryStream('SELECT...');
db.stream(qs, s => {
var socketStream = ss.createStream({objectMode: true});
ss(socket).emit('data', socketStream);
s.pipe(socketStream);
})
.then(data => {
console.log('Total rows processed:', data.processed,
'Duration in milliseconds:', data.duration);
});
I have tried to use non-object streams:
var socketStream = ss.createStream();
ss(socket).emit('data', socketStream);
s.pipe(JSONStream.stringify()).pipe(socketStream);
Or:
var socketStream = ss.createStream();
ss(socket).emit('data', socketStream);
s.pipe(JSONStream.stringify(false)).pipe(socketStream);
It takes roughly one minute to query and transfer the data for all solutions.
The query results can be written to a file within one second:
s.pipe(fs.createWriteStream('temp.txt'));
And that file can be transmitted within one second:
var socketStream = ss.createStream();
fs.createReadStream('temp.txt').pipe(socketStream);
So somehow, these streams don't seem to combine well.
As a silly experiment, I've tried placing something in between:
var socketStream = ss.createStream();
ss(socket).emit('data', socketStream);
var zip = zlib.createGzip();
var unzip = zlib.createGunzip();
s.pipe(JSONStream.stringify(false)).pipe(zip).pipe(unzip).pipe(socketStream);
And suddenly data can be queried and transfered within one second...
Unfortunately this is not going to work as my final solution. It would waste too much CPU. What is causing performance to degrade on this combination of streams? How can this be fixed?

Related

Stream large JSON from REST API using NodeJS/ExpressJS

I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.
Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

Load 1 million records from a file and save to PSQL database

I have a file of 1 million records in which I have to pass one by one record to elastic search and save the resulted data into the database.
But the issue is, it is taking very long time to do that as the records are streaming one by one to elasticsearch then it saves the data into PSQL database.
I want some suggestions that how can I improve on this or should use some other tools.
Right now I am using Nodejs with some packages:
I upload the file in nodejs application and convert it to json file using
const csv=require('csvtojson')
I use
const StreamArray = require('stream-json/streamers/StreamArray');
const {Writable} = require('stream');
For reading json and parsing it through these packages using stream as the file is too big.
I use this code
const fileStream = fs.createReadStream(this.fileName);
const jsonStream = StreamArray.withParser();
const incomingThis = this;
const processingStream = new Writable({
write({key, value}, encoding, callback) {
incomingThis.recordParser(value, (val, data) => { // pass the data to elasticsearch to get search data
incomingThis.processQueue(data); // save the data to the PSQL database
callback();
});
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', async () => {
console.log('stream end');
const statistics = new Statistics(jobId);
await statistics.update(); // update the job table for completion of data
});
Please suggest How can I improve on this to parse 1 million records file in couple of hours rather than days or minimum less time.
I am open to use any other tools too like redis, spark if these will help me out.
Thanks.
Instead of one by one pressing from stream . use batch approach ( create multiple batches ) to get data in elastic and save in batch .

fs.createReadStream - limit the amount of data streamed at a time

If I only want to read 10 bytes at a time, or one line of data at a time (looking for newline characters) is it possible to pass fs.createReadStream() options like so
var options = {}
var stream = fs.createReadStream('file.txt', options);
so that I can limit the amount of data streamed at a time?
looking at the fs docs, I don't see any options that would allow me to do that even though I am guessing that it's possible.
https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
You can use .read():
var stream = fs.createReadStream('file.txt', options);
var byteSize = 10;
stream.on("readable", function() {
var chunk;
while ( (chunk = stream.read(byteSize)) ) {
console.log(chunk.length);
}
});
The main benefit of knowing this one over just the highWaterMark option is that you can call it on streams you haven't created.
Here are the docs

HTTP POST elastic search event stream bulk

I have a node.js program that is using streams to read a file (nodejs event stream setting a variable per stream )
I would like to use the same program to write this data into elastic search . I wrote up a small write function
var writeFunction = function(data) {
//console.log(data);
var client = request.newClient("http://localhost:9200");
client.post('/newtest3/1',data,function(err,res,body) {
return console.log(res.statusCode);
});
};
and hooked this up with the streaming
var processMyFile = function(file) {
var stream = getStream(file);
var nodeName = stream.nodeName;
stream
.pipe(es.split())
.on('end',endFunction)
.pipe(es.map(function(data,cb) {
processFunction(nodeName,data,cb);
}))
.pipe(es.map(function(data,cb) {
writeFunction(data);
}));
}
The above works as expected asynchronously and writes the data except that it takes a long time .It also seems to work as a buffer since the write takes a much longer time than the read.( advantage of using the pipe )
I know there is a bulk interface in elastic search and I can import using that . The shakesphere.json example in the Kibana getting started guide (http://www.elasticsearch.org/guide/en/kibana/current/using-kibana-for-the-first-time.html)
This means I would need to create a file in the format needed by the bulk import and then run a curl program etc. I would like to avoid creating a temporary file .
Is there an easier way to import data into elasticsearch faster as part of the streaming process
elasticsearch-streams Will help you to use the bulk interface with streaming, without the need of write a json file first.
I believe that your code would be more or less like this:
var TransformToBulk = require('elasticsearch-streams').TransformToBulk
var WritableBulk = require('elasticsearch-streams').WritableBulk;
var client = new require('elasticsearch').Client();
var bulkExec = function(bulkCmds, callback) {
client.bulk({
index : 'newtest3',
type : '1',
body : bulkCmds
}, callback);
};
var ws = new WritableBulk(bulkExec);
var toBulk = new TransformToBulk(function getIndexTypeId(doc) { return { _id: doc.id }; });
var processMyFile = function(file) {
var stream = getStream(file);
stream
.pipe(toBulk)
.pipe(ws)
.on('close', endFunction)
.on('err', endFunction);
}

In Meteor, how do I get a node read stream from a collection's find curser?

In Meteor, on the server side, I want to use the .find() function on a Collection and then get a Node ReadStream interface from the curser that is returned. I've tried using .stream() on the curser as described in the mongoDB docs Seen Here. However I get the error "Object [object Object] has no method 'stream'" So it looks like Meteor collections don't have this option. Is there a way to get a stream from a Meteor Collection's curser?
I am trying to export some data to CSV and I want to pipe the data directly from the collections stream into a CSV parser and then into the response going back to the user. I am able to get the response stream from the Router package we are using, and it's all working except for getting a stream from the collection. Fetching the array from the find to push it into the stream manually would defeat the purpose of a stream since it would put everything in memory. I guess my other option is to use a foreach on the collection and push the rows into the stream one by one, but this seems dirty when I could pipe the stream directly through the parser with a transform on it.
Here's some sample code of what I am trying to do:
response.writeHead(200,{'content-type':'text/csv'});
// Set up a future
var fut = new Future();
var users = Users.find({}).stream();
CSV().from(users)
.to(response)
.on('end', function(count){
log.verbose('finished csv export');
response.end();
fut.ret();
});
return fut.wait();
Have you tried creating a custom function and piping to it?
Though this would only work if Users.find() supported .pipe()(again, only if Users.find inherited from node.js streamble object).
Kind of like
var stream = require('stream')
var util = require('util')
streamreader = function (){
stream.Writable.call(this)
this.end = function() {
console.log(this.data) //this.data contains raw data in a string so do what you need to to make it usable, i.e, do a split on ',' or something or whatever it is you need to make it usable
db.close()
})
}
util.inherits(streamreader,stream.Writeable)
stream.prototype._write = function (chunk, encoding, callback) {
this.data = this.data + chunk.toString('utf8')
callback()
}
Users.find({}).pipe(new streamReader())

Resources