Balancing slow I/O in a fast stream read stream

Balancing slow I/O in a fast stream read stream - node.js

In node.js I have a read stream that I wish to reformat and write to a database. As the read stream is fast and the write is slow the node.js queue could be overwhelmed as the queue of writes builds up (assume the stream is gb's of data). How do I force the read to wait for the write part of the code so this does not happen without blocking ?
var request = http.get({
host: 'api.geonames.org',
port: 80,
path: '/children?' + qs.stringify({
geonameId: geonameId,
username: "demo"
})
}).on('response', function(response) {
response.setEncoding('utf8');
var xml = new XmlStream(response, 'utf8');
xml.on('endElement: geoname ', function(input) {
console.log('geoname');
var output = new Object();
output.Name = input.name;
output.lat = input.lat;
output.lng = input.lng;
output._key = input.geonameId;
data.db.document.create(output, data.doc, function(callback){
//this is really slow.
}
// i do not want to return from here and receive more data until the 'create' above has completed
});
});

I just ran into this problem last night, and in my hackathon induced sleep deprived state, here is how I solved it:
I would increment a counter whenever I sent a job out to be processed, and decremented the counter when the operation completed. To keep the outbound traffic from overwhelming the other service, I would pause the stream when there was a certain number of pending outbound requests. The code is very similar to the following.
var instream = fs.createReadStream('./combined.csv');
var outstream = new stream;
var inProcess = 0;
var paused = false;
var rl = readline.createInterface(instream, outstream);
rl.on('line', function(line) {
inProcess++;
if(inProcess > 100) {
console.log('pausing input to clear queue');
rl.pause();
paused = true;
}
someService.doSomethingSlow(line, function() {
inProcess--;
if(paused && inProcess < 10) {
console.log('resuming stream');
paused = false;
rl.resume();
}
if (err) throw err;
});
});
rl.on('end', function() {
rl.close();
});
Not the most elegant solution, but it worked and allowed me to process the million+ lines without running out of memory or throttling the other service.

My solution simply extends an empty stream.Writable and is fundamentally identical to #Timothy's, but uses events and
doesn't rely on Streams1 .pause() and .resume() (which didn't seem to be having any effect on my data pipeline,
anyway).
var stream = require("stream");
var liveRequests = 0;
var maxLiveRequests = 100;
var streamPaused = false;
var requestClient = new stream.Writable();
function requestCompleted(){
liveRequests--;
if(streamPaused && liveRequests < maxLiveRequests){
streamPaused = false;
requestClient.emit("resumeStream");
}
}
requestClient._write = function (data, enc, next){
makeRequest(data, requestCompleted);
liveRequests++;
if(liveRequests >= maxLiveRequests){
streamPaused = true;
requestClient.once("resumeStream", function resume(){
next();
});
}
else {
next();
}
};
A counter, liveRequests, keeps track of the number of concurrent requests, and is incremented whenever
makeRequest() is called and decremented when it completes (ie, when requestCompleted()) is called. If a request has
just been made and liveRequests exceeds maxLiveRequests, we pause the stream with streamPaused. If a request
completes, the stream is paused, and liveRequests is now less than maxLiveRequests, we can resume the stream. Since
subsequent data items are read by _write() when its next() callback is called, we can simply defer the latter with
an event-listener on our custom "resumeStream" event, which mimics pausing/resuming.
Now, simply readStream.pipe(requestClient).
Edit: I abstracted this solution, along with automatic batching of input data, in a package.

Related

Replay a log file with NodeJS as if it were happening in real-time

I have a log file with about 14.000 aircraft position datapoints captured from a system called Flarm, it looks like this:
{"addr":"A","time":1531919658.578100,"dist":902.98,"alt":385,"vs":-8}
{"addr":"A","time":1531919658.987861,"dist":914.47,"alt":384,"vs":-7}
{"addr":"A","time":1531919660.217471,"dist":925.26,"alt":383,"vs":-7}
{"addr":"A","time":1531919660.623466,"dist":925.26,"alt":383,"vs":-7}
What I need to do is find a way to 'play' this file back in real-time (as if it were occuring right now, even though it's pre-recorded), and emit an event whenever a log entry 'occurs'. The file is not being added to, it's pre-recorded and the playing back would occur at a later stage.
The reason for doing this is that I don't have access to the receiving equipment when I'm developing.
The only way I can think to do it is to set a timeout for every log entry, but that doesn't seem like the right way to do it. Also, this process would have to scale to longer recordings (this one was only an hour long).
Are there other ways of doing this?

If you want to "play them back" with the actual time difference, a setTimeout is pretty much what you have to do.
const processEntry = (entry, index) => {
index++;
const nextEntry = getEntry(index);
if (nextEntry == null) return;
const timeDiff = nextEntry.time - entry.time;
emitEntryEvent(entry);
setTimeout(processEntry, timeDiff, nextEntry, index);
};
processEntry(getEntry(0), 0);
This emits the current entry and then sets a timeout based on the difference until the next entry.
getEntry could either fetch lines from a prefilled array or fetch lines individually based on the index. In the latter case only two lines of data would only be in memory at the same time.

Got it working in the end! setTimeout turned out to be the answer, and combined with the input of Lucas S. this is what I ended up with:
const EventEmitter = require('events');
const fs = require('fs');
const readable = fs.createReadStream("./data/2018-07-18_1509log.json", {
encoding: 'utf8',
fd: null
});
function read_next_line() {
var chunk;
var line = '';
// While this is a thing we can do, assign chunk
while ((chunk = readable.read(1)) !== null) {
// If chunk is a newline character, return the line
if (chunk === '\n'){
return JSON.parse(line);
} else {
line += chunk;
}
}
return false;
}
var lines = [];
var nextline;
const processEntry = () => {
// If lines is empty, read a line
if (lines.length === 0) lines.push(read_next_line());
// Quit here if we've reached the last line
if ((nextline = read_next_line()) == false) return true;
// Else push the just read line into our array
lines.push(nextline);
// Get the time difference in milliseconds
var delay = Number(lines[1].time - lines[0].time) * 1000;
// Remove the first line
lines.shift();
module.exports.emit('data', lines[0]);
// Repeat after the calculated delay
setTimeout(processEntry, delay);
}
var ready_to_start = false;
// When the stream becomes readable, allow starting
readable.on('readable', function() {
ready_to_start = true;
});
module.exports = new EventEmitter;
module.exports.start = function() {
if (ready_to_start) processEntry();
if (!ready_to_start) return false;
}

Assuming you want to visualize the flight logs, you can use fs watch as below, to watch the log file for changes:
fs.watch('somefile', function (event, filename) {
console.log('event is: ' + event);
if (filename) {
console.log('filename provided: ' + filename);
} else {
console.log('filename not provided');
}
});
Code excerpt is from here. For more information on fs.watch() check out here
Then, for seamless update on frontend, you can setup a Websocket to your server where you watch the log file and send newly added row via that socket to frontend.
After you get the data in frontend you can visualize it there. While I haven't done any flight visualization project before, I've used D3js to visualize other stuff (sound, numerical data, metric analysis and etc.) couple of times and it did the job every time.

setTimeout or child_process.spawn?

I have a REST service in Node.js with one specific request running a bunch of DB commands and other file processing that could take 10-15 seconds to run. Since I didn't want to hold up my browser request thread, I wrote a separate .js script to do the needful, called the script using child_process.spawn() in my Node.js code and immediately returned OK back to the client. This works fine, but then so does calling the same script (as a local function) by just using a simple setTimeout.
router.post("/longRequest", function(req, res) {
console.log("Started long request with id: " + req.body.id);
var longRunningFunction = function() {
// Usually runs a bunch of things that take time.
// Simulating a 10 sec delay for sample code.
setTimeout(function() {
console.log("Done processing for 10 seconds")
}, 10000);
}
// Below line used to be
// child_process.spawn('longRunningFunction.js'
setTimeout(longRunningFunction, 0);
res.json({status: "OK"})
})
So, this works for my purpose. But what's the downside ? I probably can't monitor the offline process easily as child_process.spawn which would give me a process id. But, does this cause problems in the long run ? Will it hold up Node.js processing if the 10 second processing increases to a lot more in the future ?
The actual longRunningFunction is something that reads an Excel file, parses it and does a bulk load using tedious to a MS SQL Server.
var XLSX = require('xlsx');
var FileAPI = require('file-api'), File = FileAPI.File, FileList = FileAPI.FileList, FileReader = FileAPI.FileReader;
var Connection = require('tedious').Connection;
var Request = require('tedious').Request;
var TYPES = require('tedious').TYPES;
var importFile = function() {
var file = new File(fileName);
if (file) {
var reader = new FileReader();
reader.onload = function (evt) {
var data = evt.target.result;
var workbook = XLSX.read(data, {type: 'binary'});
var ws = workbook.Sheets[workbook.SheetNames[0]];
var headerNames = XLSX.utils.sheet_to_json( ws, { header: 1 })[0];
var data = XLSX.utils.sheet_to_json(ws);
var bulkLoad = connection.newBulkLoad(tableName, function (error, rowCount) {
if (error) {
console.log("bulk upload error: " + error);
} else {
console.log('inserted %d rows', rowCount);
}
connection.close();
});
// setup your columns - always indicate whether the column is nullable
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
bulkLoad.addColumn(columnName, columnsAndDataTypes[columnName].dataType, { length: columnsAndDataTypes[columnName].len, nullable: true });
})
data.forEach(function(row) {
var addRow = {}
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
addRow[columnName] = row[columnName];
})
bulkLoad.addRow(addRow);
})
// execute
connection.execBulkLoad(bulkLoad);
};
reader.readAsBinaryString(file);
} else {
console.log("No file!!");
}
};

So, this works for my purpose. But what's the downside ?
If you actually have a long running task capable of blocking the event loop, then putting it on a setTimeout() is not stopping it from blocking the event loop at all. That's the downside. It's just moving the event loop blocking from right now until the next tick of the event loop. The event loop will be blocked the same amount of time either way.
If you just did res.json({status: "OK"}) before running your code, you'd get the exact same result.
If your long running code (which you describe as file and database operations) is actually blocking the event loop and it is properly written using async I/O operations, then the only way to stop blocking the event loop is to move that CPU-consuming work out of the node.js thread.
That is typically done by clustering, moving the work to worker processes or moving the work to some other server. You have to have this work done by another process or another server in order to get it out of the way of the event loop. A setTimeout() by itself won't accomplish that.
child_process.spawn() will accomplish that. So, if you have an actual event loop blocking problem to solve and the I/O is already as async optimized as possible, then moving it to a worker process is a typical node.js solution. You can communicate with that child process in a number of ways, but one possibility would be via stdin and stdout.

Why does Node.js exit before completing a non-flowing stream copy

I'm just learning node.js and wanted to write a simple test program that copied a file from a source folder to a destination folder. I piped a fs.ReadStream to a fs.WriteStream and that worked perfectly. I next tried to use non-flowing mode but the following program fails 99% of the time on larger files (anything over 1MB.) I'm assuming that given certain timing the event queue becomes empty and so exits. Should the following program work?
var sourcePath = "./source/test.txt";
var destinationPath = "./destination/test.txt";
// Display number of times 'readable' callback fired
var callbackCount = 0;
process.on('exit', function() {
console.log('Readable callback fired %d times', callbackCount);
})
var fs = require('fs');
var sourceStream = fs.createReadStream(sourcePath);
var destinationStream = fs.createWriteStream(destinationPath);
copyStream(sourceStream, destinationStream);
function copyStream(src, dst) {
var drained = true;
// read chunk of data when ready
src.on('readable', function () {
++callbackCount;
if (!drained) {
dst.once('drain', function () {
writeToDestination();
});
} else {
writeToDestination();
}
function writeToDestination() {
var chunk = src.read();
if (chunk !== null) {
drained = dst.write(chunk);
}
}
});
src.on('end', function () {
dst.end();
});
}
NOTE: If I remove the drain related code the program always works but the node.js documentation indicates that you should wait on a drain event if the write function returns false.
So should the above program work as is? If it shouldn't how should I reorganize it to work with both readable and drain events?

It looks like you're most of the way there; there are just a couple of things you need to change.
When writing to dst you need to keep reading from src until either you get a null chunk, or dst.write() returns false.
Instead of listening for all readable events on src, you should only be listening for those events when it's ok to write to dst and you currently have nothing to write.
Something like this:
function copyStream(src, dst) {
function writeToDestination() {
var chunk = src.read(),
drained = true;
// write until dst is saturated or there's no more data available in src
while (drained && (chunk !== null)) {
if (drained = dst.write(chunk)) {
chunk = src.read();
}
}
if (!drained) {
// if dst is saturated, wait for it to drain and then write again
dst.once('drain', function() {
writeToDestination();
});
} else {
// if we ran out of data in src, wait for more and then write again
src.once('readable', function() {
++callbackCount;
writeToDestination();
});
}
}
// trigger the initial write when data is available in src
src.once('readable', function() {
++callbackCount;
writeToDestination();
});
src.on('end', function () {
dst.end();
});
}

Is it possible to register multiple listeners to a child process's stdout data event? [duplicate]

I need to run two commands in series that need to read data from the same stream.
After piping a stream into another the buffer is emptied so i can't read data from that stream again so this doesn't work:
var spawn = require('child_process').spawn;
var fs = require('fs');
var request = require('request');
var inputStream = request('http://placehold.it/640x360');
var identify = spawn('identify',['-']);
inputStream.pipe(identify.stdin);
var chunks = [];
identify.stdout.on('data',function(chunk) {
chunks.push(chunk);
});
identify.stdout.on('end',function() {
var size = getSize(Buffer.concat(chunks)); //width
var convert = spawn('convert',['-','-scale',size * 0.5,'png:-']);
inputStream.pipe(convert.stdin);
convert.stdout.pipe(fs.createWriteStream('half.png'));
});
function getSize(buffer){
return parseInt(buffer.toString().split(' ')[2].split('x')[0]);
}
Request complains about this
Error: You cannot pipe after data has been emitted from the response.
and changing the inputStream to fs.createWriteStream yields the same issue of course.
I don't want to write into a file but reuse in some way the stream that request produces (or any other for that matter).
Is there a way to reuse a readable stream once it finishes piping?
What would be the best way to accomplish something like the above example?

You have to create duplicate of the stream by piping it to two streams. You can create a simple stream with a PassThrough stream, it simply passes the input to the output.
const spawn = require('child_process').spawn;
const PassThrough = require('stream').PassThrough;
const a = spawn('echo', ['hi user']);
const b = new PassThrough();
const c = new PassThrough();
a.stdout.pipe(b);
a.stdout.pipe(c);
let count = 0;
b.on('data', function (chunk) {
count += chunk.length;
});
b.on('end', function () {
console.log(count);
c.pipe(process.stdout);
});
Output:
8
hi user

The first answer only works if streams take roughly the same amount of time to process data. If one takes significantly longer, the faster one will request new data, consequently overwriting the data still being used by the slower one (I had this problem after trying to solve it using a duplicate stream).
The following pattern worked very well for me. It uses a library based on Stream2 streams, Streamz, and Promises to synchronize async streams via a callback. Using the familiar example from the first answer:
spawn = require('child_process').spawn;
pass = require('stream').PassThrough;
streamz = require('streamz').PassThrough;
var Promise = require('bluebird');
a = spawn('echo', ['hi user']);
b = new pass;
c = new pass;
a.stdout.pipe(streamz(combineStreamOperations));
function combineStreamOperations(data, next){
Promise.join(b, c, function(b, c){ //perform n operations on the same data
next(); //request more
}
count = 0;
b.on('data', function(chunk) { count += chunk.length; });
b.on('end', function() { console.log(count); c.pipe(process.stdout); });

You can use this small npm package I created:
readable-stream-clone
With this you can reuse readable streams as many times as you need

For general problem, the following code works fine
var PassThrough = require('stream').PassThrough
a=PassThrough()
b1=PassThrough()
b2=PassThrough()
a.pipe(b1)
a.pipe(b2)
b1.on('data', function(data) {
console.log('b1:', data.toString())
})
b2.on('data', function(data) {
console.log('b2:', data.toString())
})
a.write('text')

I have a different solution to write to two streams simultaneously, naturally, the time to write will be the addition of the two times, but I use it to respond to a download request, where I want to keep a copy of the downloaded file on my server (actually I use a S3 backup, so I cache the most used files locally to avoid multiple file transfers)
/**
* A utility class made to write to a file while answering a file download request
*/
class TwoOutputStreams {
constructor(streamOne, streamTwo) {
this.streamOne = streamOne
this.streamTwo = streamTwo
}
setHeader(header, value) {
if (this.streamOne.setHeader)
this.streamOne.setHeader(header, value)
if (this.streamTwo.setHeader)
this.streamTwo.setHeader(header, value)
}
write(chunk) {
this.streamOne.write(chunk)
this.streamTwo.write(chunk)
}
end() {
this.streamOne.end()
this.streamTwo.end()
}
}
You can then use this as a regular OutputStream
const twoStreamsOut = new TwoOutputStreams(fileOut, responseStream)
and pass it to to your method as if it was a response or a fileOutputStream

If you have async operations on the PassThrough streams, the answers posted here won't work.
A solution that works for async operations includes buffering the stream content and then creating streams from the buffered result.
To buffer the result you can use concat-stream
const Promise = require('bluebird');
const concat = require('concat-stream');
const getBuffer = function(stream){
return new Promise(function(resolve, reject){
var gotBuffer = function(buffer){
resolve(buffer);
}
var concatStream = concat(gotBuffer);
stream.on('error', reject);
stream.pipe(concatStream);
});
}
To create streams from the buffer you can use:
const { Readable } = require('stream');
const getBufferStream = function(buffer){
const stream = new Readable();
stream.push(buffer);
stream.push(null);
return Promise.resolve(stream);
}

What about piping into two or more streams not at the same time ?
For example :
var PassThrough = require('stream').PassThrough;
var mybiraryStream = stream.start(); //never ending audio stream
var file1 = fs.createWriteStream('file1.wav',{encoding:'binary'})
var file2 = fs.createWriteStream('file2.wav',{encoding:'binary'})
var mypass = PassThrough
mybinaryStream.pipe(mypass)
mypass.pipe(file1)
setTimeout(function(){
mypass.pipe(file2);
},2000)
The above code does not produce any errors but the file2 is empty

Pausing readline in Node.js

Consider the code below ... I am trying to pause the stream after reading the first 5 lines:
var fs = require('fs');
var readline = require('readline');
var stream = require('stream');
var numlines = 0;
var instream = fs.createReadStream("myfile.json");
var outstream = new stream;
var readStream = readline.createInterface(instream, outstream);
readStream.on('line', function(line){
numlines++;
console.log("Read " + numlines + " lines");
if (numlines >= 5) {
console.log("Pausing stream");
readStream.pause();
}
});
The output (copied next) suggests that it keeps reading lines after the pause. Perhaps readline has queued up a few more lines in the buffer, and is feeding them to me anyway ... this would make sense if it continues to read asynchronously in the background, but based on the documentation, I don't know what the proper behavior should be. Any recommendations on how to achieve the desired effect?
Read 1 lines
Read 2 lines
Read 3 lines
Read 4 lines
Read 5 lines
Pausing stream
Read 6 lines
Pausing stream
Read 7 lines

Somewhat unintuitively, the pause methods does not stop queued up line events:
Calling rl.pause() does not immediately pause other events (including 'line') from being emitted by the readline.Interface instance.
There is however a 3rd-party module named line-by-line where pause does pause the line events until it is resumed.
var LineByLineReader = require('line-by-line'),
lr = new LineByLineReader('big_file.txt');
lr.on('error', function (err) {
// 'err' contains error object
});
lr.on('line', function (line) {
// pause emitting of lines...
lr.pause();
// ...do your asynchronous line processing..
setTimeout(function () {
// ...and continue emitting lines.
lr.resume();
}, 100);
});
lr.on('end', function () {
// All lines are read, file is closed now.
});
(I have no affiliation with the module, just found it useful for dealing with this issue.)

So, it turns out that the readline stream tends to "drip" (i.e., leak a few extra lines) even after a pause(). The documentation does not make this clear, but it's true.
If you want the pause() toggle to appear immediate, you'll have to create your own line buffer and accumulate the leftover lines yourself.

add some points:
.on('pause', function() {
console.log(numlines)
})
You will get the 5. It mentioned in the node.js document :
The input stream is not paused and receives the SIGCONT event. (See events SIGTSTP and SIGCONT)
So, I created a tmp buffer in the line event. Use a flag to determine whether it is triggered paused.
.on('line', function(line) {
if (paused) {
putLineInBulkTmp(line);
} else {
putLineInBulk(line);
}
}
then in the on pause, and resume:
.on('pause', function() {
paused = true;
doSomething(bulk, function(resp) {
// clean up bulk for the next.
bulk = [];
// clone tmp buffer.
bulk = clone(bulktmp);
bulktmp = [];
lr.resume();
});
})
.on('resume', () => {
paused = false;
})
Use this way to handle this kind of situation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Balancing slow I/O in a fast stream read stream - node.js

Related

Replay a log file with NodeJS as if it were happening in real-time

setTimeout or child_process.spawn?

Why does Node.js exit before completing a non-flowing stream copy

Is it possible to register multiple listeners to a child process's stdout data event? [duplicate]

Pausing readline in Node.js

Categories

Resources