Does the new way to use streams (streams2) in node create blocking? - node.js

I realize that node is non-blocking, however, I also realize that because node has only one thread, putting a three second while loop in the middle of your event loop will cause blocking. I.e.:
var start = new Date();
console.log('Test 1');
function sleep(time, words) {
while(new Date().getTime() < start.getTime() + time);
console.log(words);
}
sleep(3000, 'Test 2'); //This will block
console.log('Test 3') //Logs Test 1, Test 2, Test 3
Many of the examples I have seen dealing with the new "Streams2" interface look like they would cause this same blocking. For instance this one, borrowed from here:
var crypto = require('crypto');
var fs = require('fs');
var readStream = fs.createReadStream('myfile.txt');
var hash = crypto.createHash('sha1');
readStream
.on('readable', function () {
var chunk;
while (null !== (chunk = readStream.read())) {
hash.update(chunk); //DOESN'T This Cause Blocking?
}
})
.on('end', function () {
console.log(hash.digest('hex'));
});
If I am following right, the readStream will emit the readable event when there is data in the buffer. So it seems that once the readable event is emitted, the entire event loop would be stopped until the readStream.read() emits null. This seems less desirable than the old way (because it would not block). Can somebody please tell me why I am wrong. Thanks.

You don't have to read until the internal stream buffer is empty. You could just read once if you wanted and then read another chunk some time later.
readStream.read() itself is not blocking, but hash.update(chunk) is (for a brief amount of time) because the hashing is done on the main thread (there is a github issue about adding an async interface that would execute crypto functions in the thread pool though).
Also, you can simplify the code you have to use the crypto stream interface:
var crypto = require('crypto'),
fs = require('fs');
var readStream = fs.createReadStream('myfile.txt'),
hasher = crypto.createHash('sha1');
readStream.pipe(hasher).on('readable', function() {
// the hash stream automatically pushes the digest
// to the readable side once the writable side is ended
console.log(this.read());
}).setEncoding('hex');

All JS code is single-threaded, so a loop will block, but you are misunderstanding how long that loop will run for. Calling .read() takes a readable item from the stream, just like a 'data' handler would be called with the item. It will stop executing and unblock as soon as there are no items. 'readable' is triggered whenever there is data, and then it empties the buffer and waits for another 'readable'. So where your first while loop relies on the time to be updated, which could be some unbounded amount of time, the other loop is basically doing:
while (items.length > 0) items.pop()
which is pretty much the minimum amount of work you need to do to process items from a stream.

Related

_read() is not implemented on Readable stream

This question is how to really implement the read method of a readable stream.
I have this implementation of a Readable stream:
import {Readable} from "stream";
this.readableStream = new Readable();
I am getting this error
events.js:136
throw er; // Unhandled 'error' event
^
Error [ERR_STREAM_READ_NOT_IMPLEMENTED]: _read() is not implemented
at Readable._read (_stream_readable.js:554:22)
at Readable.read (_stream_readable.js:445:10)
at resume_ (_stream_readable.js:825:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)
at process._tickCallback (internal/process/next_tick.js:180:9)
at Function.Module.runMain (module.js:684:11)
at startup (bootstrap_node.js:191:16)
at bootstrap_node.js:613:3
The reason the error occurs is obvious, we need to do this:
this.readableStream = new Readable({
read(size) {
return true;
}
});
I don't really understand how to implement the read method though.
The only thing that works is just calling
this.readableStream.push('some string or buffer');
if I try to do something like this:
this.readableStream = new Readable({
read(size) {
this.push('foo'); // call push here!
return true;
}
});
then nothing happens - nothing comes out of the readable!
Furthermore, these articles says you don't need to implement the read method:
https://github.com/substack/stream-handbook#creating-a-readable-stream
https://medium.freecodecamp.org/node-js-streams-everything-you-need-to-know-c9141306be93
My question is - why does calling push inside the read method do nothing? The only thing that works for me is just calling readable.push() elsewhere.
why does calling push inside the read method do nothing? The only thing that works for me is just calling readable.push() elsewhere.
I think it's because you are not consuming it, you need to pipe it to an writable stream (e.g. stdout) or just consume it through a data event:
const { Readable } = require("stream");
let count = 0;
const readableStream = new Readable({
read(size) {
this.push('foo');
if (count === 5) this.push(null);
count++;
}
});
// piping
readableStream.pipe(process.stdout)
// through the data event
readableStream.on('data', (chunk) => {
console.log(chunk.toString());
});
Both of them should print 5 times foo (they are slightly different though). Which one you should use depends on what you are trying to accomplish.
Furthermore, these articles says you don't need to implement the read method:
You might not need it, this should work:
const { Readable } = require("stream");
const readableStream = new Readable();
for (let i = 0; i <= 5; i++) {
readableStream.push('foo');
}
readableStream.push(null);
readableStream.pipe(process.stdout)
In this case you can't capture it through the data event. Also, this way is not very useful and not efficient I'd say, we are just pushing all the data in the stream at once (if it's large everything is going to be in memory), and then consuming it.
From documentation:
readable._read:
"When readable._read() is called, if data is available from the resource, the implementation should begin pushing that data into the read queue using the this.push(dataChunk) method. link"
readable.push:
"The readable.push() method is intended be called only by Readable implementers, and only from within the readable._read() method. link"
Implement the _read method after your ReadableStream's initialization:
import {Readable} from "stream";
this.readableStream = new Readable();
this.readableStream.read = function () {};
readableStream is like a pool:
.push(data), It's like pumping water to a pool.
.pipe(destination), It's like connecting the pool to a pipe and pump water to other place
The _read(size) run as a pumper and control how much water flow and when the data is end.
The fs.createReadStream() will create read stream with the _read() function has been auto implemented to push file data and end when end of file.
The _read(size) is auto fire when the pool is attached to a pipe. Thus, if you force calling this function without connect a way to destination, it will pump to ?where? and it affect the machine status inside _read() (may be the cursor move to wrong place,...)
The read() function must be create inside new Stream.Readable(). It's actually a function inside an object. It's not readableStream.read(), and implement readableStream.read=function(size){...} will not work.
The easy way to understand implement:
var Reader=new Object();
Reader.read=function(size){
if (this.i==null){this.i=1;}else{this.i++;}
this.push("abc");
if (this.i>7){ this.push(null); }
}
const Stream = require('stream');
const renderStream = new Stream.Readable(Reader);
renderStream.pipe(process.stdout)
You can use it to reder what ever stream data to POST to other server.
POST stream data with Axios :
require('axios')({
method: 'POST',
url: 'http://127.0.0.1:3000',
headers: {'Content-Length': 1000000000000},
data: renderStream
});

setTimeout or child_process.spawn?

I have a REST service in Node.js with one specific request running a bunch of DB commands and other file processing that could take 10-15 seconds to run. Since I didn't want to hold up my browser request thread, I wrote a separate .js script to do the needful, called the script using child_process.spawn() in my Node.js code and immediately returned OK back to the client. This works fine, but then so does calling the same script (as a local function) by just using a simple setTimeout.
router.post("/longRequest", function(req, res) {
console.log("Started long request with id: " + req.body.id);
var longRunningFunction = function() {
// Usually runs a bunch of things that take time.
// Simulating a 10 sec delay for sample code.
setTimeout(function() {
console.log("Done processing for 10 seconds")
}, 10000);
}
// Below line used to be
// child_process.spawn('longRunningFunction.js'
setTimeout(longRunningFunction, 0);
res.json({status: "OK"})
})
So, this works for my purpose. But what's the downside ? I probably can't monitor the offline process easily as child_process.spawn which would give me a process id. But, does this cause problems in the long run ? Will it hold up Node.js processing if the 10 second processing increases to a lot more in the future ?
The actual longRunningFunction is something that reads an Excel file, parses it and does a bulk load using tedious to a MS SQL Server.
var XLSX = require('xlsx');
var FileAPI = require('file-api'), File = FileAPI.File, FileList = FileAPI.FileList, FileReader = FileAPI.FileReader;
var Connection = require('tedious').Connection;
var Request = require('tedious').Request;
var TYPES = require('tedious').TYPES;
var importFile = function() {
var file = new File(fileName);
if (file) {
var reader = new FileReader();
reader.onload = function (evt) {
var data = evt.target.result;
var workbook = XLSX.read(data, {type: 'binary'});
var ws = workbook.Sheets[workbook.SheetNames[0]];
var headerNames = XLSX.utils.sheet_to_json( ws, { header: 1 })[0];
var data = XLSX.utils.sheet_to_json(ws);
var bulkLoad = connection.newBulkLoad(tableName, function (error, rowCount) {
if (error) {
console.log("bulk upload error: " + error);
} else {
console.log('inserted %d rows', rowCount);
}
connection.close();
});
// setup your columns - always indicate whether the column is nullable
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
bulkLoad.addColumn(columnName, columnsAndDataTypes[columnName].dataType, { length: columnsAndDataTypes[columnName].len, nullable: true });
})
data.forEach(function(row) {
var addRow = {}
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
addRow[columnName] = row[columnName];
})
bulkLoad.addRow(addRow);
})
// execute
connection.execBulkLoad(bulkLoad);
};
reader.readAsBinaryString(file);
} else {
console.log("No file!!");
}
};
So, this works for my purpose. But what's the downside ?
If you actually have a long running task capable of blocking the event loop, then putting it on a setTimeout() is not stopping it from blocking the event loop at all. That's the downside. It's just moving the event loop blocking from right now until the next tick of the event loop. The event loop will be blocked the same amount of time either way.
If you just did res.json({status: "OK"}) before running your code, you'd get the exact same result.
If your long running code (which you describe as file and database operations) is actually blocking the event loop and it is properly written using async I/O operations, then the only way to stop blocking the event loop is to move that CPU-consuming work out of the node.js thread.
That is typically done by clustering, moving the work to worker processes or moving the work to some other server. You have to have this work done by another process or another server in order to get it out of the way of the event loop. A setTimeout() by itself won't accomplish that.
child_process.spawn() will accomplish that. So, if you have an actual event loop blocking problem to solve and the I/O is already as async optimized as possible, then moving it to a worker process is a typical node.js solution. You can communicate with that child process in a number of ways, but one possibility would be via stdin and stdout.

Better way to write a simple Node redis loop (using ioredis)?

So, I'm stilling learning the JS/Node way from a long time in other languages.
I have a tiny micro-service that reads from a redis channel, temp stores it in a working channel, does the work, removes it, and moves on. If there is more in the channel it re-runs immediately. If not, it sets a timeout and checks again in 1 second.
It works fine...but timeout polling doesn't seem to be the "correct" way to approach this. And I haven't found much about using BRPOPLPUSH to try to block (vs. RPOPLPUSH) and wait in Node....or other options like that. (Pub/Sub isn't an option here...this is the only listener, and it may not always be listening.)
Here's the short essence of what I'm doing:
var Redis = require('ioredis');
var redis = new Redis();
var redisLoop = function () {
redis.rpoplpush('channel', 'channel-working').then(function (result) {
if (result) {
processJob(result); //do stuff
//delete the item from the working channel, and check for another item
redis.lrem('channel-working', 1, result).then(function (result) { });
redisLoop();
} else {
//no items, wait 1 second and try again
setTimeout(redisLoop, 1000);
}
});
};
redisLoop();
I feel like I'm missing something really obvious. Thanks!
BRPOPLPUSH doesn't block in Node, it blocks in the client. In this instance I think it's exactly what you need to get rid of the polling.
var Redis = require('ioredis');
var redis = new Redis();
var redisLoop = function () {
redis.brpoplpush('channel', 'channel-working', 0).then(function (result) {
// because we are using BRPOPLPUSH, the client promise will not resolve
// until a 'result' becomes available
processJob(result);
// delete the item from the working channel, and check for another item
redis.lrem('channel-working', 1, result).then(redisLoop);
});
};
redisLoop();
Note that redis.lrem is asynchronous, so you should use lrem(...).then(redisLoop) to ensure that your next tick executes only after the item is successfully removed from channel-working.

What are the semantics of the return value of nodejs stream.Readable.push()

I have this code:
"use strict";
var fs = require("fs");
var stream = require('readable-stream');
var Transform = require('stream').Transform,
util = require('util');
var TransformStream = function() {
Transform.call(this, {objectMode: true});
};
util.inherits(TransformStream, Transform);
TransformStream.prototype._transform = function(chunk, encoding, callback) {
if(this.push(chunk)) {
console.log("push returned true");
} else {
console.log("push returned false");
}
callback();
};
var inStream = fs.createReadStream("in.json");
var outStream = fs.createWriteStream("out.json");
var transform = new TransformStream();
inStream.pipe(transform).pipe(outStream);
in.json is 88679467 bytes in size. The first 144 writes state that push returned true. The remaining writes (1210 of them) all state that push returned false.
out.json ends up being a full copy of in.json - so no bytes were dropped.
Which leaves me with no clue about what to do with the return value of push.
What is the right thing to do?
The push() docs (https://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding) say:
return Boolean Whether or not more pushes should be performed
The purpose is "backpressure" meaning when push() returns false there is probably no space left in an outbound buffer. If your stream were reading from somewhere like a file on disk, you would stop reading when push() returns false and wait to be called again. But in your case you are implementing _transform() rather than _read(), so you don't have a lot of choice in the matter--you have received a chunk and should push() it. TransformStream will buffer any excess internally, and it can take the initiative to delay future calls to your _transform() method.
So when you implement a TransformStream you can safely ignore the return value from push().
For more on this, see:
Node.js Readable Stream _read Usage
What's the proper way to handle back-pressure in a node.js Transform stream?

Is there a better way to execute a node.js callback at regular intervals than setInterval?

I have a node.js script that writes a stream to an array like this:
var tempCrossSection = [];
stream.on('data', function(data) {
tempCrossSection.push(data);
});
and another callback that empties the array and does some processing on the data like this:
var crossSection = [];
setInterval(function() {
crossSection = tempCrossSection;
tempCrossSection = [];
someOtherFunction(crossSection, function(data) {
console.log(data);
}
}, 30000);
For the most part this works, but sometimes the setInterval callback will execute more than once in a 30000ms interval (and it is not a queued call sitting on the event loop). I have also done this as a cronJob with same results. I am wondering if there is a way to ensure that setInterval executes only once per 30000ms. Perhaps there is a better solution altogether. Thanks.
When you have something async, you should use setTimeout instead, otherwise if the asynchonous function takes to long you'll end up with issues.
var crossSection = [];
setTimeout(function someFunction () {
crossSection = tempCrossSection;
tempCrossSection = [];
someOtherFunction(crossSection, function(data) {
console.log(data);
setTimeout(someFunction, 30000);
}
}, 30000);
Timers in javascript are not as reliable as many think. It sounds like you found that out already! The key is to measure the time elapsed since the last invocation of the timer's callback to decide if you should run in this cycle or not.
See http://www.sitepoint.com/creating-accurate-timers-in-javascript/
The ultimate goal there was to build a timer that fires, say every second (higher precision than your timeout value), that then decides if it is going to fire your function.

Resources