I have seen some code such as this:
.on('error', console.error)
.on('data', function (data) {})
.on('info', function(info) {})
.on('end', function() {
// All data retrieved.
});
I have read some docs about streams, but am having trouble understanding them. Say I only want to do the operations once all the data is received (not partial). How can I do this? I would think I would have to read the data object inside of the 'end' function, but the data object is not accessible from there.
From my understanding if I put some logic inside of the 'data' function I could be operating on incomplete data? Is this true? Say data is a list of friends (some lists have 1 friend some can have 10,000 so the size of the data returned back will be different). How can I only do operation once ALL the friends are returned no matter the size of the data coming back?
The data handler will usually be called multiple times, each time with a fraction of the complete data.
If you want to perform an action once with all data, the usual way is as follows:
Buffer all the items received in the data handler in some variable (e.g. add to an array) and perform your final action in the end handler. (although the idea of a stream naturally is, to "act" right away).
var allData = [];
stream
.on('error', console.error)
.on('data', function (data) {
allData.push(data);
})
.on('info', function(info) {})
.on('end', function() {
// TODO do something more intelligent,
// where buffering in memory makes sense
console.log(allData.join());
});
Related
I've just started playing with streaming data in Expressjs.
Not entirely sure, but I think the request will start to execute the handler again. For example, here is my handler:
import getDataAsync from "./somewhere";
function handler(req, res) {
console.log('requesting', req.path);
getDataAsync()
.then(data => {
let stream = renderContent(data);
stream.pipe(res);
})
.catch(err => {
res.end();
})
}
What I found was, it continue to print out console.log('requesting', req.path) (which I think will re-execute getDataAsync).
My question is:
Is it true it will re-execute getDataAsync?
If it does, what's your approach?
Thank heaps!
Node JS is non-blocking, so if you were to make a request to an endpoint with this handler again then it will execute. The handler will call getDataAsync() and then the handler gets removed from call stack. The process is repeated for each request.
If you want the handler to wait out the stream before it calls it again you could do:
import getDataAsync from "./somewhere";
let streamComplete = true;
function handler(req, res) {
if(!streamComplete) {
res.end();
}
console.log('requesting', req.path);
getDataAsync()
.then(data => {
streamComplete = false;
let stream = renderContent(data);
stream.pipe(res);
stream.on('end', () => streamComplete = true);
})
.catch(err => {
res.end();
})
}
I did need to sort this problem out in one of my projects. Node or in fact any other environment/language will have the same issue, that once you start streaming the data to one client, it's rather hard to stream it to another. This is due to the fact that once you do this:
inputStream.pipe(outputStream);
...the input data will be pushed out to output and will be removed from memory. So if you just pipe the inputStream again, you'll have some initial part of the data missing.
The solution I came up with was to write a Transform stream that kept the data in memory and you can reuse it afterwards. Such a stream will have all the original chunks and at the same time when it catches up with the first request, it will just keep pushing the chunks directly. I packaged the solution as a npm module and published so now you can use it.
This is how you use it:
const {ReReadable} = require("rereadable-stream");
// We'll use this for caching - you can use a Map if you have more streams
let cachedStream;
// This function will get the stream and
const getCachedStream = () =>
(cachedStream || (cachedStream =
getDataAsync()
.then(
data => renderContent(data).pipe(new ReReadable())
))
)
.then(readable => readable.rewind())
Such a function will call you getDataAsync once and then will push the data to a the rewindable stream, but every time the function is executed the stream will be rewound to the begining.
You can read a bit more about the rereadable-stream module here.
A word of warning though - remember, that you will keep all that data in memory now, so be careful to clean it up if there's more chunks there and control your memory usage.
I'm using the promise-ftp Node.js library to download a file from an FTP server. Function ftpClient.get(...) returns a Promise of ReadableStream, data of which should be accessible using data and end events, however, I encounter a weird behaviour that is not described in the documentation.
//Some code that returns a to-be-downloaded file's name
.then((fileName) => {
return ftpClient.get(fileName)
})
.then((fileStream) => {
let chunks = [];
return new Promise((resolve, reject) => {
fileStream.on('data', (chunk) => {
chunks.push(chunk);
});
fileStream.on('end', () => {
resolve(Buffer.concat(chunks).toString('utf8'));
});
});
});
This is a solution on how to read a stream into a variable that I derived from answers to this question. However, it doesn't work: the data and end events are just never emitted.
Comparing my code to this example in library documentation, the only difference is that the ReadableStream is piped somewhere beforehand.
Now, if I add fileStream.pipe(...),
//...
.then((fileStream) => {
fileStream.pipe(fs.createWriteStream('foo.txt'));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
let chunks = [];
return new Promise((resolve, reject) => {
//...
everything works perfectly.
So my question is: what exactly is the purpose of adding a .pipe() and how is it connected to data and end events?
Pipe is the correct way to read data from a stream. The idea comes from Linux Pipes. Something like:
fileStream | stdout
Here fileStream is being pipe'd to stdout
on is to register an event listener i.e. if some part of the program has the pipe set up, you can listen in to the same.
Not streaming any data when no one is reading it (i.e. its not being piped to anything) is an obvious optimisation (it could have other constraints too).
Yes I understand the mentioned answer can be misleading. Office node docs on streams is better. And there is a new upward trending better answer.
I am piping the response from a Node request into a transform stream using through2Concurrent. This response comes in as a buffer and is parsed to an object using JSONStream. That then gets piped into my transform stream. The transform stream function then makes an HTTP requests, formats the response and stores it into a MongoDB. We are using concurrent streams because it would take an unacceptably long time to handle everything otherwise.
response Stream -> JSONStream.parse() -> Transform Stream
Problem Description
The initial response stream contains roughly 18,000 objects once parsed. However, the stream terminates and an finish event is received before all 18,000 objects are handled. No error is thrown, but only about 2,000 - 5,000 objects are actually handled before the stream ends. The exact number handled varies.
Here is the relevant code:
const analyticsTransformer = through2Concurrent.obj({
maxConcurrency: 15
}, async (doc, enc, cb) => {
// Make an http request. This is a relatively long request.
const res = await apim.getAnalytics(doc);
// Save response to mongo.
await UsageData.save(res);
cb();
});
// Kick off the streaming.
broker.getInstances()
.pipe(JSONStream.parse('*')
.pipe(analyticsTransformer)
.on('finish', () => {
// We reach this way too quickly before we have handled all 18,000 objects
})
.on('error', err => {
// No errors are caught.
})
What I have Tried
Waiting for an 'end' event: Same result. Unhandled objects and early termination.
Using through2 (not through2Concurrent): Receive ETIMEOUT after several thousand objects have come through.
Setting the highWaterMark to 18,000:
This is the only thing that has worked. I can handle all of the objects if I change this highWatermark value, but this is really just a bandaid on the problem. I want to know why this works and what I can do to fix my streaming problems in a robust way.
Setting the highWaterMark looks like this:
const analyticsTransformer = through2Concurrent.obj({
highWaterMark: 18,000,
maxConcurrency: 15
}, async (doc, enc, cb) => {
// ...
});
Why does changing the highWaterMark value work?
What is the real cause of my early terminated stream?
How can I fix it?
Thanks in advance to anyone that can help! :)
I need to parse a CSV document from Node.JS, performing database operations for each record (= each line). However, I'm having trouble finding a suitable CSV parser using a pull approach, or at least a push approach that waits for my record operations before parsing the next row.
I've looked at csv-parse, csvtojson, csv-streamify, but they all seem to push events in a continuous stream without any flow control. If parsing a 1000 line CSV document, I basically get all 1000 callbacks in quick sequence. For each record, I perform an operation returning a promise. Currently I've had to resort to pushing all my promises into an array and after getting the done/end event I also wait for Promise.all(myOperations) to know when the document has been fully processed. But this is not very nice, and also, I'd prefer parsing one line at a time and fully processing it, before getting the next record, instead of concurrently processing all records - it's hard to debug and uses a lot of memory as opposed to simply dealing with each record sequentially.
So, is there a CSV parser that supports pull mode, or a way to get any stream-based CSV parser (preferably csvtojson as that's the one I'm using at the moment) to only produce events for new records when my handler for the previous record is finished (using promises)?
I solved this myself by creating my own Writable and piping the CSV parser to it. My write method does its stuff and wraps a promise to the node callback passed to _write() (here implemented using Q.nodeify):
class CsvConsumer extends stream.Writable {
_write(data, encoding, cb) {
console.log('Got data: ', data);
Q.delay(1000).then(() => {
console.log('Waited 1 s');
}).nodeify(cb);
}
}
csvtojson()
.fromStream(is)
.pipe(new CsvConsumer())
.on('finish', err => {
if (err) {
console.log('Error!');
} else {
console.log('Done!');
}
});
This will process lines one by one:
Got data: {"a": "1"}
Waited 1 s
Got data: {"a": "2"}
Waited 1 s
Got data: {"a": "3"}
Waited 1 s
Done!
If you want to process each line asynchronously you can do that with node's native LineReader.
const lineStream = readline.createInterface({
input: fs.createReadStream('data/test.csv'),
});
lineStream.on('line', (eachLine) =>{
//process each line
});
If you want to do the same in synchronous fashion you can use line-by-line. It doesn't buffer the entire file into memory. It provides event handlers to pause and resume the 'line' emit event.
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines. (1 sec delay)
lr.resume();
}, 1000);
});
I am trying to read from a csv file and insert the data into an elasticsearch index. As below, I use a readstream and listen in on the "data" event. My problem is, I quickly run out of memory using this approach. I'm guessing it's because the elasticsearch module (elastical) is making a REST every time, and the number of such requests build up.
I am pretty new, so is there a way for me to fix this so it doesn't run out of memory? Any general patterns or techniques?
stream.on('data', function (doc) {
// create a json from doc
client.index('entities', 'command', json, function (err, res) {
console.log(res);
});
}
Pause the stream when you get data and resume it when the request completes.
stream.on('data', function (doc) {
stream.pause();
// create a json from doc
client.index('entities', 'command', json, function (err, res) {
stream.resume();
console.log(res);
});
}
Weird thing about your code is you're not using doc anywhere in that function. I'm guessing you're not posting your entire code.