Passing a stream only if digest passes - node.js

I've got a pipeline in an express.js module in which I take a file, decrypt it, pass it through a digest to ensure it is valid, and then want to return it as the response if the digest passes. The code looks something like this:
function GetFile(req,res) {
...
}).then(() => {
var p1 = new Promise(function(resolve,reject) {
digester = digestStream("md5", "hex", function(md5,len) {
// compare md5 and length against expected values
// what do i do if they don't match?
resolve()
}
}
infile.pipe(decrypter).pipe(digester).pipe(res)
return p1
}).then(() => {
...
}
The problem is, once I pipe the output to res, it pipes it whether or not the digest passes. But if I don't pipe the output of the digester to anything, then nothing happens - I guess there isn't pressure from the right end to move the data through.
I could simply run the decryption pipeline twice, and in fact this was what was previously done, but I'm trying to speed things up so everything only happens once. One idea I had was to pipe the digester output to a buffer, and if the digest matches, then send the buffer to res. This will require memory proportional to the size of the file, which isn't horrible in most cases. However, I couldn't find much on how to .pipe() directly to a buffer. The closest thing I could find was the bl module, however in the section in which it demonstrates piping to a function which collects the data, there is this caveat mentioned:
Note that when you use the callback method like this, the resulting
data parameter is a concatenation of all Buffer objects in the list.
If you want to avoid the overhead of this concatenation (in cases of
extreme performance consciousness), then avoid the callback method and
just listen to 'end' instead, like a standard Stream.
I'm not familiar enough with bl to understand what this really means with regards to how efficient this is. Specifically, I don't understand why it is talking about concatenating buffer objects - why is there more than one buffer object that must be concatenated, for example?). I'm not sure how I can follow its advice and still have a simple pipe either.

The bl module is going to collect buffers when it is piped to. How many buffers depends on what the input stream does. If you don't want to concatenate them together, store them in the BufferList, and if the hash passes, then pipe the BufferList to your output.
Something like this works for me:
function GetFile(req,res) {
...
var bl
}).then(() => {
var p1 = new Promise(function(resolve,reject) {
digester = digestStream("md5", "hex", function(md5,len) {
if (md5 != expectedmd5) throw "bad md5"
if (len != expectedlen) throw "bad length"
resolve()
}
}
bl = new BufferList()
infile.pipe(decrypter).pipe(digester).pipe(bl)
return p1
}).then(() => {
bl.pipe(res)
...
}

Related

IORedis: how to publish ArrayBuffer

I'm trying to publish an ArrayBuffer to a IORedis stream.
I do so as follow:
const ab = new ArrayBuffer(1); // ArrayBuffer of length = 1 byte
const dv = new DataView(ab);
dv.setInt8(0, 7); // Write the number 7 in the buffer
const buffer = Buffer.from(ab); // Convert to Buffer since that's what `publish` expects
redisPublisher.publish('buffer-test', buffer);
It's a toy example, in practice I'll want to encode complex stuff in the ArrayBuffer, not just a number. Anyway, then I try to read with
redisSubscriber.on('message', async (channel, data) => {
logger.info(`Redis message: channel: ${channel}, data: ${data}, ${typeof data}`);
// ... do something with it
});
The problem is that data is empty, and its type is considered as string. As per the documentation I tried redisSubscriber.on('messageBuffer', ... instead, but it behaves exactly the same, so much so that I'm failing to understand the difference between the two.
Also confusing is that if I encode a Buffer, e.g.
const buffer = Buffer.from("I'm a string!", 'utf-8');
redisPublisher.publish('buffer-test', buffer);
Upon reception, data will again be a string, decoded from the Buffer, which in that toy case is ok but generally is not for me. I'd like to send an Buffer in, containing more complex data that just a string (an ArrayBuffer in my case), and get a Buffer out, that I could properly parse based on my needs and not have automatically read as a string.
Any help is welcome!

Reading data a block at a time, synchronously

What is the nodejs (typescript) equivalent of the following Python snippet? I've put an attempt at corresponding nodejs below the Python.
Note that I want to read a chunk at a time (later that is, in this example I'm just reading the first kilobyte), synchronously.
Also, I do not want to read the entire file into virtual memory at once; some of my input files will (eventually) be too big for that.
The nodejs snippet always returns null. I want it to return a string or buffer or something along those lines. If the file is >= 1024 bytes long, I want a 1024 character long return, otherwise I want the entire file.
I googled about this for an hour or two, but all I found was things synchronously reading an entire file at a time, or reading pieces at a time asynchronously.
Thanks!
Here's the Python:
def readPrefix(filename: str) -> str:
with open(filename, 'rb') as infile:
data = infile.read(1024)
return data
Here's the nodejs attempt:
const readPrefix = (filename: string): string => {
const readStream = fs.createReadStream(filename, { highWaterMark: 1024 });
const data = readStream.read(1024);
readStream.close();
return data;
};
To read synchronously, you would use fs.openSync(), fs.readSync() and fs.closeSync().
Here's some regular Javascript code (hopefully you can translate it to TypeScript) that synchronously reads a certain number of bytes from a file and returns a buffer object containing those bytes (or throws an exception in case of error):
const fs = require('fs');
function readBytesSync(filePath, filePosition, numBytesToRead) {
const buf = Buffer.alloc(numBytesToRead, 0);
let fd;
try {
fd = fs.openSync(filePath, "r");
fs.readSync(fd, buf, 0, numBytesToRead, filePosition);
} finally {
if (fd) {
fs.closeSync(fd);
}
}
return buf;
}
For your application, you can just pass 1024 as the bytes to read and if there are less than that in the file, it will just read up until the end of the file. The returns buffer object will contain the bytes read which you can access as binary or convert to a string.
For the benefit of others reading this, I mentioned in earlier comments that synchronous I/O should never be used in a server environment (servers should always use asynchronous I/O except at startup time). Synchronous I/O can be used for stand-alone scripts that only do one thing (like build scripts, as an example) and don't need to be responsive to multiple incoming requests.
Do I need to loop on readSync() in case of EINTR or something?
Not that I'm aware of.

How do I apply back-pressure to node streams?

While attempting to experiment with Node.JS streams I ran into an interesting conundrum. When the input (Readable) stream pushes more data then the destination (Writable) cares about I was unable to apply back-pressure correctly.
The two methods I attempted was to return false from the Writable.prototype._write and to retain a reference to the Readable so I can call Readable.pause() from the Writable. Neither solution helped much which I'll explain.
In my exercise (which you can view the full source as a Gist) I have three streams:
Readable - PasscodeGenerator
util.inherits(PasscodeGenerator, stream.Readable);
function PasscodeGenerator(prefix) {
stream.Readable.call(this, {objectMode: true});
this.count = 0;
this.prefix = prefix || '';
}
PasscodeGenerator.prototype._read = function() {
var passcode = '' + this.prefix + this.count;
if (!this.push({passcode: passcode})) {
this.pause();
this.once('drain', this.resume.bind(this));
}
this.count++;
};
I thought that the return code from this.push() was enough to self pause and wait for the drain event to resume.
Transform - Hasher
util.inherits(Hasher, stream.Transform);
function Hasher(hashType) {
stream.Transform.call(this, {objectMode: true});
this.hashType = hashType;
}
Hasher.prototype._transform = function(sample, encoding, next) {
var hash = crypto.createHash(this.hashType);
hash.setEncoding('hex');
hash.write(sample.passcode);
hash.end();
sample.hash = hash.read();
this.push(sample);
next();
};
Simply add the hash of the passcode to the object.
Writable - SampleConsumer
util.inherits(SampleConsumer, stream.Writable);
function SampleConsumer(max) {
stream.Writable.call(this, {objectMode: true});
this.max = (max != null) ? max : 10;
this.count = 0;
}
SampleConsumer.prototype._write = function(sample, encoding, next) {
this.count++;
console.log('Hash %d (%s): %s', this.count, sample.passcode, sample.hash);
if (this.count < this.max) {
next();
} else {
return false;
}
};
Here I want to consume the data as fast as possible until I reach my max number of samples and then end the stream. I tried using this.end() instead of return false but that caused the dreaded write called after end problem. Returning false does stop everything if the sample size is small but when it is large I get an out of memory error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted (core dumped)
According to this SO answer in theory the Write stream would return false causing the streams to buffer until the buffers were full (16 by default for objectMode) and eventually the Readable would call it's this.pause() method. But 16 + 16 + 16 = 48; that's 48 objects in buffer till things fill up and the system is clogged. Actually less because there is no cloning involved so the objects passed between them is the same reference. Would that not mean only 16 objects in memory till the high water mark halts everything?
Lastly I realize I could have the Writable reference the Readable to call it's pause method using closures. However, this solution means the Writable stream knows to much about another object. I'd have to pass in a reference:
var foo = new PasscodeGenerator('foobar');
foo
.pipe(new Hasher('md5'))
.pipe(new SampleConsumer(samples, foo));
And this feels out of norm for how streams would work. I thought back-pressure was enough to cause a Writable to stop a Readable from pushing data and prevent out of memory errors.
An analogous example would be the Unix head command. Implementing that in Node I would assume that the destination could end and not just ignore causing the source to keep pushing data even if the destination has enough data to satisfy the beginning portion of the file.
How do I idiomatically construct custom streams such that when the destination is ready to end the source stream doesn't attempt to push more data?
This is a known issue with how _read() is called internally. Since your _read() is always pushing synchronously/immediately, the internal stream implementation can get into a loop in the right conditions. _read() implementations are generally expected to do some sort of async I/O (e.g. reading from disk or network).
The workaround for this (as noted in the link above) is to make your _read() asynchronous at least some of the time. You could also just make it async every time it's called with:
PasscodeGenerator.prototype._read = function(n) {
var passcode = '' + this.prefix + this.count;
var self = this;
// `setImmediate()` delays the push until the beginning
// of the next tick of the event loop
setImmediate(function() {
self.push({passcode: passcode});
});
this.count++;
};

When calling Edge.js from C#, how do you hook stdout and stderr?

Background
I am working on a C# program which currently runs Node via Process.Start(). I am capturing the stdout and stderr from this child process and redirecting it for my own reasons. I am looking into replacing the invocation of Node.exe with a call to Edge.js instead. In order to be able to do this I must be able to reliably capture stdout and stderr from the Javascript running within Edge, and get the messages back into my C# application.
Approach 1
I'll describe this approach for completeness in case anybody recommends it :)
If the Edge process terminates, it is fairly easy to deal with this by simply declaring a msgs array and overwriting process.stdout.write and process.stderr.write with new functions that accumulate messages on that array, then at the end, simply return the msgs array. Example:
var msgs = [];
process.stdout.write = function (string) {
msgs.push({ stream: 'o', message : string });
};
process.stderr.write = function (string) {
msgs.push({ stream: 'e', message: string });
};
// Return to caller.
var result = { messages: msgs; ...other stuff... };
callback(null, result);
Obviously this only works if the Edge code terminates, and msgs may grow large in the worst case. However, it is likely to perform well because only one marshalling call is necessary to get all the messages back.
Approach 2
This is a little harder to explain. Instead of accumulating messages, we "hook" stdout and stderr using a delegate we send in from C#. In the C#, we create an object that we will pass into Edge, and that object has a property called stdoutHook:
dynamic payload = new ExpandoObject();
payload.stdoutHook = GetStdoutHook();
public Func<object, Task<object>> GetStdoutHook()
{
Func<object, Task<object>> hook = (message) =>
{
TheLogger.LogMessage((message as string).Trim());
return Task.FromResult<object>(null);
};
return hook;
}
I could really get away with an Action, but Edge appears to require the Func<object, Task<object>>, it won't proxy the function otherwise. Then, in the Javascript, we can detect that function and use it like this:
var func = Edge.Func(#"
return function(payload, callback) {
if (typeof (payload.stdoutHook) === 'function') {
process.stdout.write = payload.stdoutHook;
}
// do lots of stuff while stdout and stderr are hooked...
var what = require('whatever');
what.futz();
// terminate.
callback(null, result);
}");
dynamic result = func(payload).Result;
Questions
Q1. Both of these techniques seem to work, but is there a better way of doing this, something built-in to Edge perhaps that I have missed? Both solutions are invasive - they require some shim code to wrap the actual work that is to be done in Edge. This is not the end of the world, but it would be better if there was a non-invasive method.
Q2. In approach 2, where I have to return a task here
return Task.FromResult<object>(null);
it feels wrong to be returning an already completed "null task". But is there another way of writing this?
Q3. Do I need to be more rigorous in the Javascript code when hooking stdout and stderr? I note in double-edge.js there is this code, frankly I am not sure what is happening here, but it is quite a bit more complex than my crude overwriting of process.stdout.write :-)
// Fix #176 for GUI applications on Windows
try {
var stdout = process.stdout;
}
catch (e) {
// This is a Windows GUI application without stdout and stderr defined.
// Define process.stdout and process.stderr so that all output is discarded.
(function () {
var stream = require('stream');
var NullStream = function (o) {
stream.Writable.call(this);
this._write = function (c, e, cb) { cb && cb(); };
}
require('util').inherits(NullStream, stream.Writable);
var nullStream = new NullStream();
process.__defineGetter__('stdout', function () { return nullStream; });
process.__defineGetter__('stderr', function () { return nullStream; });
})();
}
Q1: There isn't anything built into Edge that would make capturing stdout or stderr of Node.js code automatic when calling Node from CLR. At some point I thought of writing an extension of Edge that would make marshaling Streams across CLR/V8 boundary easy. Under the hood it would be very similar to your Approach 2. It could be done as a standalone module on top of Edge.
Q2: Returning a completed task is very appropriate in this case. Your function has captured the Node.js output, processed it, and has in fact "completed" in that sense. Returning a task completed with Null is really a moral equivalent of returning from an Action.
Q3: The code you are pointing to is only relevant in Windows GUI applications, not Console applications. If you are writing a Console application, simply overriding write should suffice at the level of the Node.js code you pass to Edge.js. Note that the signature of write in Node allows an optional encoding parameter to be passed in. You seem to ignore it both in Approach 1 and 2. In particular in Approach 2 I would suggest wrapping the JavaScript proxy to C# callback into a JavaScript function that normalizes the parameters before assigning it to process.stdout.write. Otherwise Edge.js code may assume that the encoding parameter passed to a write call is a callback function which would follow the Edge.js calling convention.

Basic streams issue: Difficulty sending a string to stdout

I'm just starting learning about streams in node. I have a string in memory and I want to put it in a stream that applies a transformation and pipe it through to process.stdout. Here is my attempt to do it:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.push('asdf');
stream.pipe(process.stdout);
stream.end();
It does not work. When I run the script on the cli via node, nothing is sent to stdout and no errors are thrown. A few questions I have:
If you have a value in memory that you want to put into a stream, what is the best way to do it?
What is the difference between push and queue?
Does it matter if I call end() before or after calling pipe()?
Is end() equivalent to push(null)?
Thanks!
Just use the vanilla stream API
var Transform = require("stream").Transform;
// create a new Transform stream
var stream = new Transform({
decodeStrings: false,
encoding: "ascii"
});
// implement the _transform method
stream._transform = function _transform(str, enc, done) {
this.push(str.toUpperCase() + "\n";
done();
};
// connect to stdout
stream.pipe(process.stdout);
// write some stuff to the stream
stream.write("hello!");
stream.write("world!");
// output
// HELLO!
// WORLD!
Or you can build your own stream constructor. This is really the way the stream API is intended to be used
var Transform = require("stream").Transform;
function MyStream() {
// call Transform constructor with `this` context
// {decodeStrings: false} keeps data as `string` type instead of `Buffer`
// {encoding: "ascii"} sets the encoding for our strings
Transform.call(this, {decodeStrings: false, encoding: "ascii"});
// our function to do "work"
function _transform(str, encoding, done) {
this.push(str.toUpperCase() + "\n");
done();
}
// export our function
this._transform = _transform;
}
// extend the Transform.prototype to your constructor
MyStream.prototype = Object.create(Transform.prototype, {
constructor: {
value: MyStream
}
});
Now use it like this
// instantiate
var a = new MyStream();
// pipe to a destination
a.pipe(process.stdout);
// write data
a.write("hello!");
a.write("world!");
Output
HELLO!
WORLD!
Some other notes about .push vs .write.
.write(str) adds data to the writable buffer. It is meant to be called externally. If you think of a stream like a duplex file handle, it's just like fwrite, only buffered.
.push(str) adds data to the readable buffer. It is only intended to be called from within our stream.
.push(str) can be called many times. Watch what happens if we change our function to
function _transform(str, encoding, done) {
this.push(str.toUpperCase());
this.push(str.toUpperCase());
this.push(str.toUpperCase() + "\n");
done();
}
Output
HELLO!HELLO!HELLO!
WORLD!WORLD!WORLD!
First, you want to use write(), not push(). write() puts data in to the stream, push() pushes data out of the stream; you only use push() when implementing your own Readable, Duplex, or Transform streams.
Second, you'll only want to write() data to the stream after you've setup the pipe() (or added some event listeners). If you write to a stream with nothing wired to the other end, the data you've written will be lost. As #naomik pointed out, this isn't true in general since a Writable stream will buffer write()s. In your example you do need to write() after pipe() though. Otherwise, the process will end before writing anything to STDOUT. This is possibly due to how the through module is implemented, but I don't know that for sure.
So, with that in mind, you can make a couple simple changes to your example to get it to work:
var through = require('through');
var stream = through(function write(data) {
this.push(data.toUpperCase());
});
stream.pipe(process.stdout);
stream.write('asdf');
stream.end();
Now, for your questions:
The easiest way to get data from memory in to a writable stream is to simply write() it, just like we're doing with stream.wrtie('asdf') in your example.
As far as I know, the stream doesn't have a queue() function, did you mean write()? Like I said above, write() is used to put data in to a stream, push() is used to push data out of the stream. Only call push() in your owns stream implementations.
Only call end() after all your data has been written to your stream. end() basically says: "Ok, I'm done now. Please finish what you're doing and close the stream."
push(null) is pretty much equivalent to end(). That being said, don't call push(null) unless you're doing it inside your own stream implementation (as stated above). It's almost always more appropriate to call end().
Based on the examples for stream (http://nodejs.org/api/stream.html#stream_readable_pipe_destination_options)
and through (https://www.npmjs.org/package/through)
it doesn't look like you are using your stream correctly... What happens if you use write(...) instead of push(...)?

Resources