How to write batches of data in NodeJS stream pipeline? - node.js

I have a function in which I read CSV file as a readable stream using the "pipeline" method, splitting it by rows and transforming the data of each row, then I add the data to an array. When the pipeline is finished, I insert all the data to a database.
This is the relevant part of the code:
pipeline(storageStream as Readable, split(), this.FilterPipe(), this.MapData(result));
public MapData(result: Array<string>): MapStream {
return mapSync((filteredData: string) => {
const trimmed: string = filteredData.trim();
if (trimmed.length !== 0) {
result.push(trimmed);
}
});
}
We have encountered sometimes with memory limits since we uploaded a big amount of very large CSV files, so we have decided to try to split the logic into insertion batches so we won't use a lot of memory at the same time.
So I thought to handle the readed data by batches, in which per every batch (let's say 100 rows in the file), I will trigger the "MapData" function and insert the result array to the DB.
Is there any option to add a condition so the MapData will be triggered every X rows?
Or, if there is any other solution that might meet the requirement?
Thanks in advance!

The following code shows a transform stream that buffers incoming objects (or arrays of objects) until it has 100 of them and then pushes them onwards as an array:
var t = new stream.Transform({
objectMode: true,
transform(chunk, encoding, callback) {
this.buffer = (this.buffer || []).concat(chunk);
if (this.buffer.length >= 100) {
this.push(this.buffer);
this.buffer = [];
}
callback();
},
flush(callback) {
if (this.buffer.length > 0) this.push(this.buffer);
callback();
}
}).on("data", console.log);
for (var i = 0; i < 250; i++) t.write(i);
t.end();
You can include such a transform stream in your pipeline.
And here's the same in Typescript. It can very probably be done more elegantly, but I am no Typescript expert.
class MyTransform extends Transform {
buffer: Array<any>;
}
var t = new MyTransform({
objectMode: true,
transform(chunk, encoding, callback) {
var that = this as MyTransform;
that.buffer = (that.buffer || []).concat(chunk);
if (that.buffer.length >= 100) {
this.push(that.buffer);
that.buffer = [];
}
callback();
},
flush(callback) {
var that = this as MyTransform;
if (that.buffer.length > 0) this.push(that.buffer);
callback();
}
}).on("data", console.log);
for (var i = 0; i < 250; i++) t.write(i);
t.end();

Related

Why is the first function call is executed two times faster than all other sequential calls?

I have a custom JS iterator implementation and code for measuring performance of the latter implementation:
const ITERATION_END = Symbol('ITERATION_END');
const arrayIterator = (array) => {
let index = 0;
return {
hasValue: true,
next() {
if (index >= array.length) {
this.hasValue = false;
return ITERATION_END;
}
return array[index++];
},
};
};
const customIterator = (valueGetter) => {
return {
hasValue: true,
next() {
const nextValue = valueGetter();
if (nextValue === ITERATION_END) {
this.hasValue = false;
return ITERATION_END;
}
return nextValue;
},
};
};
const map = (iterator, selector) => customIterator(() => {
const value = iterator.next();
return value === ITERATION_END ? value : selector(value);
});
const filter = (iterator, predicate) => customIterator(() => {
if (!iterator.hasValue) {
return ITERATION_END;
}
let currentValue = iterator.next();
while (iterator.hasValue && currentValue !== ITERATION_END && !predicate(currentValue)) {
currentValue = iterator.next();
}
return currentValue;
});
const toArray = (iterator) => {
const array = [];
while (iterator.hasValue) {
const value = iterator.next();
if (value !== ITERATION_END) {
array.push(value);
}
}
return array;
};
const test = (fn, iterations) => {
const times = [];
for (let i = 0; i < iterations; i++) {
const start = performance.now();
fn();
times.push(performance.now() - start);
}
console.log(times);
console.log(times.reduce((sum, x) => sum + x, 0) / times.length);
}
const createData = () => Array.from({ length: 9000000 }, (_, i) => i + 1);
const testIterator = (data) => () => toArray(map(filter(arrayIterator(data), x => x % 2 === 0), x => x * 2))
test(testIterator(createData()), 10);
The output of the test function is very weird and unexpected - the first test run is constantly executed two times faster than all the other runs. One of the results, where the array contains all execution times and the number is the mean (I ran it on Node):
[
147.9088459983468,
396.3472499996424,
374.82447600364685,
367.74555300176144,
363.6300039961934,
362.44370299577713,
363.8418449983001,
390.86111199855804,
360.23125199973583,
358.4788999930024
]
348.6312940984964
Similar results can be observed using Deno runtime, however I could not reproduce this behaviour on other JS engines. What can be the reason behind it on the V8?
Environment:
Node v13.8.0, V8 v7.9.317.25-node.28,
Deno v1.3.3, V8 v8.6.334
(V8 developer here.) In short: it's inlining, or lack thereof, as decided by engine heuristics.
For an optimizing compiler, inlining a called function can have significant benefits (e.g.: avoids the call overhead, sometimes makes constant folding possible, or elimination of duplicate computations, sometimes even creates new opportunities for additional inlining), but comes at a cost: it makes the compilation itself slower, and it increases the risk of having to throw away the optimized code ("deoptimize") later due to some assumption that turns out not to hold. Inlining nothing would waste performance, inlining everything would waste performance, inlining exactly the right functions would require being able to predict the future behavior of the program, which is obviously impossible. So compilers use heuristics.
V8's optimizing compiler currently has a heuristic to inline functions only if it was always the same function that was called at a particular place. In this case, that's the case for the first iterations. Subsequent iterations then create new closures as callbacks, which from V8's point of view are new functions, so they don't get inlined. (V8 actually knows some advanced tricks that allow it to de-duplicate function instances coming from the same source in some cases and inline them anyway; but in this case those are not applicable [I'm not sure why]).
So in the first iteration, everything (including x => x % 2 === 0 and x => x * 2) gets inlined into toArray. From the second iteration onwards, that's no longer the case, and instead the generated code performs actual function calls.
That's probably fine; I would guess that in most real applications, the difference is barely measurable. (Reduced test cases tend to make such differences stand out more; but changing the design of a larger app based on observations made on a small test is often not the most impactful way to spend your time, and at worst can make things worse.)
Also, hand-optimizing code for engines/compilers is a difficult balance. I would generally recommend not to do that (because engines improve over time, and it really is their job to make your code fast); on the other hand, there clearly is more efficient code and less efficient code, and for maximum overall efficiency, everyone involved needs to do their part, i.e. you might as well make the engine's job simpler when you can.
If you do want to fine-tune performance of this, you can do so by separating code and data, thereby making sure that always the same functions get called. For example like this modified version of your code:
const ITERATION_END = Symbol('ITERATION_END');
class ArrayIterator {
constructor(array) {
this.array = array;
this.index = 0;
}
next() {
if (this.index >= this.array.length) return ITERATION_END;
return this.array[this.index++];
}
}
function arrayIterator(array) {
return new ArrayIterator(array);
}
class MapIterator {
constructor(source, modifier) {
this.source = source;
this.modifier = modifier;
}
next() {
const value = this.source.next();
return value === ITERATION_END ? value : this.modifier(value);
}
}
function map(iterator, selector) {
return new MapIterator(iterator, selector);
}
class FilterIterator {
constructor(source, predicate) {
this.source = source;
this.predicate = predicate;
}
next() {
let value = this.source.next();
while (value !== ITERATION_END && !this.predicate(value)) {
value = this.source.next();
}
return value;
}
}
function filter(iterator, predicate) {
return new FilterIterator(iterator, predicate);
}
function toArray(iterator) {
const array = [];
let value;
while ((value = iterator.next()) !== ITERATION_END) {
array.push(value);
}
return array;
}
function test(fn, iterations) {
for (let i = 0; i < iterations; i++) {
const start = performance.now();
fn();
console.log(performance.now() - start);
}
}
function createData() {
return Array.from({ length: 9000000 }, (_, i) => i + 1);
};
function even(x) { return x % 2 === 0; }
function double(x) { return x * 2; }
function testIterator(data) {
return function main() {
return toArray(map(filter(arrayIterator(data), even), double));
};
}
test(testIterator(createData()), 10);
Observe how there are no more dynamically created functions on the hot path, and the "public interface" (i.e. the way arrayIterator, map, filter, and toArray compose) is exactly the same as before, only under-the-hood details have changed. A benefit of giving all functions names is that you get more useful profiling output ;-)
Astute readers will notice that this modification only shifts the issue away: if you have several places in your code that call map and filter with different modifiers/predicates, then the inlineability issue will come up again. As I said above: microbenchmarks tend to be misleading, as real apps typically have different behavior...
(FWIW, this is pretty much the same effect as at Why is the execution time of this function call changing? .)
Just to add to this investigation, I compared the OP's original code with the predicate and selector functions declared as separate functions as suggested by jmrk to two other implementations. So, this code has three implementations:
OP's code with predicate and selector functions declared separately as named functions (not inline).
Using standard array.map() and .filter() (which you would think would be slower because of the extra creation of intermediate arrays)
Using a custom iteration that does both filtering and mapping in one iteration
The OP's attempt at saving time and making things faster is actually the slowest (on average). The custom iteration is the fastest.
I guess the lesson here is that it's not necessarily intuitive how you make things faster with the optimizing compiler so if you're tuning performance, you have to measure against the "typical" way of doing things (which may benefit from the most optimizations).
Also, note that in the method #3, the first two iterations are the slowest and then it gets faster - the opposite effect from the original code. Go figure.
The results are here:
[
99.90320014953613,
253.79690098762512,
271.3091011047363,
247.94990015029907,
247.457200050354,
261.9487009048462,
252.95090007781982,
250.8520998954773,
270.42809987068176,
249.340900182724
]
240.59370033740998
[
222.14270091056824,
220.48679995536804,
224.24630093574524,
237.07260012626648,
218.47070002555847,
218.1493010520935,
221.50559997558594,
223.3587999343872,
231.1618001461029,
243.55419993400574
]
226.01488029956818
[
147.81360006332397,
144.57479882240295,
73.13350009918213,
79.41700005531311,
77.38950109481812,
78.40880012512207,
112.31539988517761,
80.87990117073059,
76.7899010181427,
79.79679894447327
]
95.05192012786866
The code is here:
const { performance } = require('perf_hooks');
const ITERATION_END = Symbol('ITERATION_END');
const arrayIterator = (array) => {
let index = 0;
return {
hasValue: true,
next() {
if (index >= array.length) {
this.hasValue = false;
return ITERATION_END;
}
return array[index++];
},
};
};
const customIterator = (valueGetter) => {
return {
hasValue: true,
next() {
const nextValue = valueGetter();
if (nextValue === ITERATION_END) {
this.hasValue = false;
return ITERATION_END;
}
return nextValue;
},
};
};
const map = (iterator, selector) => customIterator(() => {
const value = iterator.next();
return value === ITERATION_END ? value : selector(value);
});
const filter = (iterator, predicate) => customIterator(() => {
if (!iterator.hasValue) {
return ITERATION_END;
}
let currentValue = iterator.next();
while (iterator.hasValue && currentValue !== ITERATION_END && !predicate(currentValue)) {
currentValue = iterator.next();
}
return currentValue;
});
const toArray = (iterator) => {
const array = [];
while (iterator.hasValue) {
const value = iterator.next();
if (value !== ITERATION_END) {
array.push(value);
}
}
return array;
};
const test = (fn, iterations) => {
const times = [];
let result;
for (let i = 0; i < iterations; i++) {
const start = performance.now();
result = fn();
times.push(performance.now() - start);
}
console.log(times);
console.log(times.reduce((sum, x) => sum + x, 0) / times.length);
return result;
}
const createData = () => Array.from({ length: 9000000 }, (_, i) => i + 1);
const cache = createData();
const comp1 = x => x % 2 === 0;
const comp2 = x => x * 2;
const testIterator = (data) => () => toArray(map(filter(arrayIterator(data), comp1), comp2))
// regular array filter and map
const testIterator2 = (data) => () => data.filter(comp1).map(comp2);
// combine filter and map in same operation
const testIterator3 = (data) => () => {
let result = [];
for (let value of data) {
if (comp1(value)) {
result.push(comp2(value));
}
}
return result;
}
const a = test(testIterator(cache), 10);
const b = test(testIterator2(cache), 10);
const c = test(testIterator3(cache), 10);
function compareArrays(a1, a2) {
if (a1.length !== a2.length) return false;
for (let [i, val] of a1.entries()) {
if (a2[i] !== val) return false;
}
return true;
}
console.log(a.length);
console.log(compareArrays(a, b));
console.log(compareArrays(a, c));

NodeJS write into a array and read simultaneously

I have a while loop that loads about 10000 entries into an array and then another function pops them one at a time to be used as test inputs. The process of generating and loading those 10000 entries takes a bit of time. I'm looking for a way to to this more asynchronously i.e. once 50 entries have been created the method that uses that input can be called, at the same time it continues to generate data until it reaches 10000
Answer is in typescript. The idea is to generate the test cases using a generator (es6 specific), then a reader is used to buffer the generated test cases. Finally the tester is represented by a Transform stream which tests each data given it and either throws some exception or ignores a failing test, or returns an appropriate message if the test case passes. Simply pipe the test generator (reader) to the tester (transform), and possibly pipe to some output stream to write passed and failed test cases.
Code (typescript):
class InputGen<T> extends Readable {
constructor(gen: IterableIterator<T>, count: number) {
super({
objectMode: true,
highWaterMark: 50,
read: (size?: number) => {
if (count < 0) {
this.push(null);
} else {
count--;
let testData = gen.next();
this.push(testData.value);
if (testData.done) {
count = -1;
}
}
}
});
}
}
class Tester extends Transform {
constructor() {
super({
objectMode: true,
transform: (data: any, enc: string, cb: Function) => {
// test data
if (/* data passes the test */!!data) {
cb(null, data);
} else {
cb(new Error("Data did not pass the test")); // OR cb() to skip the data
}
}
});
}
}
Usage:
new InputGen(function *() {
for (let v = 0; v < 100001; v++) {
yield v; // Some test case
}
}(), 10000).pipe(new Tester); // pipe to an output stream if necessary

How to cache outer scope values to be used inside of Async NodeJS calls?

Something like the below code illustrates my intention, if you can imagine how a naive programmer would probably try to write this the first time:
function (redisUpdatesHash) {
var redisKeys = Object.keys(redisUpdatesHash);
for (var i = 0; i < redisKeys.length; i++) {
var key = redisKeys[i];
redisClient.get(key, function (err, value) {
if (value != redisUpdatesHash[key]) {
redisClient.set(key, redisUpdatesHash[key]);
redisClient.publish(key + "/notifications", redisUpdatesHash[key]);
}
});
}
}
The problem is, predictably, key is the wrong value in the callback scopes of the asynchronous nature of the node_redis callbacks. The method of detection is really primitive because of security restrictions out of my control - so the only option for me was to resort to polling the source for it's state. So the intention above is to store that state in Redis so that I can compare during the next poll to determine if it changed. If it has, I publish an event and store off the new value to update the comparison value for the next polling cycle.
It appears that there's no good way to do this in NodeJS... I'm open to any suggestions - whether it's fixing the above code to somehow be able to perform this check, or to suggest a different method of doing this entirely.
I solved this problem through using function currying to cache the outer values in a closure.
In vanilla Javascript/NodeJS
asyncCallback = function (newValue, redisKey, redisValue) {
if (newValue != redisValue) {
redisClient.set(redisKey, newValue, handleRedisError);
redisClient.publish(redisKey + '/notifier', newValue, handleRedisError);
}
};
curriedAsyncCallback = function (newValue) {
return function (redisKey) {
return function (redisValue) {
asyncCallback(newValue, redisKey, redisValue);
};
};
};
var newResults = getNewResults(),
redisKeys = Object.keys(newResults);
for (var i = 0; i < redisKeys.length; i++) {
redisClient.get(redisKeys[i], curriedAsyncCallback(newResults[redisKeys[i]])(redisKeys[i]));
}
However, I ended up using HighlandJS to help with the currying and iteration.
var _ = require('highland'),
//...
asyncCallback = function (newValue, redisKey, redisValue) {
if (newValue != redisValue) {
redisClient.set(redisKey, newValue, handleRedisError);
redisClient.publish(redisKey + '/notifier', newValue, handleRedisError);
}
};
var newResults = getNewResults(),
redisKeys = Object.keys(newResults),
curriedAsyncCallback = _.curry(asyncCallback),
redisGet = _(redisClient.get.bind(redisClient));
redisKeys.each(function (key) {
redisGet(key).each(curriedAsyncCallback(newResults[key], key));
});

How do you implement a stream that properly handles backpressure in node.js?

I can't for the life of me figure out how to implement a stream that properly handles backpressure. Should you never use pause and resume?
I have this implementation I'm trying to get to work correctly:
var StreamPeeker = exports.StreamPeeker = function(myStream, callback) {
stream.Readable.call(this, {highWaterMark: highWaterMark})
this.stream = myStream
myStream.on('readable', function() {
var data = myStream.read(5000)
//process.stdout.write("Eff: "+data)
if(data !== null) {
if(!this.push(data)) {
process.stdout.write("Pause")
this.pause()
}
callback(data)
}
}.bind(this))
myStream.on('end', function() {
this.push(null)
}.bind(this))
}
util.inherits(StreamPeeker, stream.Readable)
StreamPeeker.prototype._read = function() {
process.stdout.write("resume")
//this.resume() // putting this in for some reason causes the stream to not output???
}
It correctly sends output, but doesn't correctly produce backpressure. How can I change it to properly support backpressure?
Ok I finally figured it out after lots of trial and error. A couple guidelines:
Never ever use pause or resume (otherwise it'll go into legacy "flowing" mode)
Never add a "data" event listener (otherwise it'll go into legacy "flowing" mode)
Its the implementor's responsibility to keep track of when the source is readable
Its the implementor's responsibility to keep track of when the destination wants more data
The implementation should not read any data until the _read method is called
The argument to read tells the source to give it that many bytes, it probably best to pass the argument passed to this._read into the source's read method. This way you should be able to configure how much to read at a time at the destination, and the rest of the stream chain should be automatic.
So this is what I changed it to:
Update: I created a Readable that is much easier to implement with proper back-pressure, and should have just as much flexibility as node's native streams.
var Readable = stream.Readable
var util = require('util')
// an easier Readable stream interface to implement
// requires that subclasses:
// implement a _readSource function that
// * gets the same parameter as Readable._read (size)
// * should return either data to write, or null if the source doesn't have more data yet
// call 'sourceHasData(hasData)' when the source starts or stops having data available
// calls 'end()' when the source is out of data (forever)
var Stream666 = {}
Stream666.Readable = function() {
stream.Readable.apply(this, arguments)
if(this._readSource === undefined) {
throw new Error("You must define a _readSource function for an object implementing Stream666")
}
this._sourceHasData = false
this._destinationWantsData = false
this._size = undefined // can be set by _read
}
util.inherits(Stream666.Readable, stream.Readable)
Stream666.Readable.prototype._read = function(size) {
this._destinationWantsData = true
if(this._sourceHasData) {
pushSourceData(this, size)
} else {
this._size = size
}
}
Stream666.Readable.prototype.sourceHasData = function(_sourceHasData) {
this._sourceHasData = _sourceHasData
if(_sourceHasData && this._destinationWantsData) {
pushSourceData(this, this._size)
}
}
Stream666.Readable.prototype.end = function() {
this.push(null)
}
function pushSourceData(stream666Readable, size) {
var data = stream666Readable._readSource(size)
if(data !== null) {
if(!stream666Readable.push(data)) {
stream666Readable._destinationWantsData = false
}
} else {
stream666Readable._sourceHasData = false
}
}
// creates a stream that can view all the data in a stream and passes the data through
// correctly supports backpressure
// parameters:
// stream - the stream to peek at
// callback - called when there's data sent from the passed stream
var StreamPeeker = function(myStream, callback) {
Stream666.Readable.call(this)
this.stream = myStream
this.callback = callback
myStream.on('readable', function() {
this.sourceHasData(true)
}.bind(this))
myStream.on('end', function() {
this.end()
}.bind(this))
}
util.inherits(StreamPeeker, Stream666.Readable)
StreamPeeker.prototype._readSource = function(size) {
var data = this.stream.read(size)
if(data !== null) {
this.callback(data)
return data
} else {
this.sourceHasData(false)
return null
}
}
Old Answer:
// creates a stream that can view all the data in a stream and passes the data through
// correctly supports backpressure
// parameters:
// stream - the stream to peek at
// callback - called when there's data sent from the passed stream
var StreamPeeker = exports.StreamPeeker = function(myStream, callback) {
stream.Readable.call(this)
this.stream = myStream
this.callback = callback
this.reading = false
this.sourceIsReadable = false
myStream.on('readable', function() {
this.sourceIsReadable = true
this._readMoreData()
}.bind(this))
myStream.on('end', function() {
this.push(null)
}.bind(this))
}
util.inherits(StreamPeeker, stream.Readable)
StreamPeeker.prototype._read = function() {
this.reading = true
if(this.sourceIsReadable) {
this._readMoreData()
}
}
StreamPeeker.prototype._readMoreData = function() {
if(!this.reading) return;
var data = this.stream.read()
if(data !== null) {
if(!this.push(data)) {
this.reading = false
}
this.callback(data)
}
}

Why doesn't this program output any data?

I'm messing with the Node.js 0.10 Stream classes to try and figure out how to use them. I'm not sure why this experiment isn't working. It's supposed to output letters of the alphabet to an HTTP response object, but does not. I've annotated the source with a few comments.
Thank you!
var Readable = require('stream').Readable
, inherits = require('util').inherits
, http = require('http');
/**
* A stream that streams the English alphabet
*/
function AlphabetStream() {
Readable.call(this);
this.code = this.offset = 'a'.charCodeAt(0);
this.last = 'z'.charCodeAt(0);
}
inherits(AlphabetStream, Readable);
AlphabetStream.prototype._read = function(size) {
for (var i = 0; i < size; i++)
this.push(this.next_char());
this.push(null);
};
AlphabetStream.prototype.next_char = function() {
var cycle = this.last+1;
return String.fromCharCode((++this.code % cycle) + this.offset);
};
/**
* An HTTP server, prints the first n letters of the English alphabet
*/
var server = http.createServer(function(req, res) {
// $ curl localhost:3001/?size=11
var size = require('url').parse(req.url, true).query.size;
if (size) {
var rs = new AlphabetStream;
rs.pipe(res); // This calls `_read()` with `size` 16kb
rs.read(parseInt(size)); // This also calls `_read()` with `size` 16kb
}
res.end(''); // Nothing gets printed, despite the pipe and the reading.
});
server.listen(3001, function() {
console.log('Listening on 3001');
});
Have a look at this piece of code:
if (size) {
var rs = new AlphabetStream;
rs.pipe(res); // This calls `_read()` with `size` 16kb
rs.read(parseInt(size)); // This also calls `_read()` with `size` 16kb
}
res.end(''); // Nothing gets printed, despite the pipe and the reading.
You end the response (last line) before the actual piping can take place (this happens because .pipe is asynchronous). What you should do is something like that:
if (size) {
var rs = new AlphabetStream;
rs.pipe(res);
rs.read(parseInt(size));
} else {
// NOTE THE ELSE STATEMENT
res.end('');
}
.pipe function will take care of ending the destination stream (i.e. the response) unless explicitely stated otherwise, see the docs:
http://nodejs.org/api/stream.html#stream_readable_pipe_destination_options
EDIT As for why 16kb? Well, I had to do some tests and it seems that this is the default behaviour of .pipe (and I'm not sure how to change that to be honest). First of all note that this line:
rs.read(parseInt(size));
is totally useless (you can remove it). .pipe will take care of reading the data. Now the default behaviour is to read chunks of 16kb of data. So in order to do what you are trying to do you should probably pass size to the constructor of AlphabetStream, like this:
function AlphabetStream(size) {
Readable.call(this);
this.code = this.offset = 'a'.charCodeAt(0);
this.last = 'z'.charCodeAt(0);
this.size = size; // <--- store size here
}
inherits(AlphabetStream, Readable);
AlphabetStream.prototype._read = function(size) {
// this allows the stream to be a true stream
// it reads only as much data as it can
// but .read can be called multiple times without issues
// with a predefined limit
var chunk = Math.min(size, this.size);
this.size -= chunk;
for (var i = 0; i < chunk; i++) {
this.push(this.next_char());
}
if (!this.size) {
// end the stream only when the limit is reached
this.push(null);
}
};
after all a stream should not depend on how much data you read. Then you do:
if (size) {
var rs = new AlphabetStream(parseInt(size));
rs.pipe(res);
} else {
res.end('');
}

Resources