I have a while loop that loads about 10000 entries into an array and then another function pops them one at a time to be used as test inputs. The process of generating and loading those 10000 entries takes a bit of time. I'm looking for a way to to this more asynchronously i.e. once 50 entries have been created the method that uses that input can be called, at the same time it continues to generate data until it reaches 10000
Answer is in typescript. The idea is to generate the test cases using a generator (es6 specific), then a reader is used to buffer the generated test cases. Finally the tester is represented by a Transform stream which tests each data given it and either throws some exception or ignores a failing test, or returns an appropriate message if the test case passes. Simply pipe the test generator (reader) to the tester (transform), and possibly pipe to some output stream to write passed and failed test cases.
Code (typescript):
class InputGen<T> extends Readable {
constructor(gen: IterableIterator<T>, count: number) {
super({
objectMode: true,
highWaterMark: 50,
read: (size?: number) => {
if (count < 0) {
this.push(null);
} else {
count--;
let testData = gen.next();
this.push(testData.value);
if (testData.done) {
count = -1;
}
}
}
});
}
}
class Tester extends Transform {
constructor() {
super({
objectMode: true,
transform: (data: any, enc: string, cb: Function) => {
// test data
if (/* data passes the test */!!data) {
cb(null, data);
} else {
cb(new Error("Data did not pass the test")); // OR cb() to skip the data
}
}
});
}
}
Usage:
new InputGen(function *() {
for (let v = 0; v < 100001; v++) {
yield v; // Some test case
}
}(), 10000).pipe(new Tester); // pipe to an output stream if necessary
Related
I have a function in which I read CSV file as a readable stream using the "pipeline" method, splitting it by rows and transforming the data of each row, then I add the data to an array. When the pipeline is finished, I insert all the data to a database.
This is the relevant part of the code:
pipeline(storageStream as Readable, split(), this.FilterPipe(), this.MapData(result));
public MapData(result: Array<string>): MapStream {
return mapSync((filteredData: string) => {
const trimmed: string = filteredData.trim();
if (trimmed.length !== 0) {
result.push(trimmed);
}
});
}
We have encountered sometimes with memory limits since we uploaded a big amount of very large CSV files, so we have decided to try to split the logic into insertion batches so we won't use a lot of memory at the same time.
So I thought to handle the readed data by batches, in which per every batch (let's say 100 rows in the file), I will trigger the "MapData" function and insert the result array to the DB.
Is there any option to add a condition so the MapData will be triggered every X rows?
Or, if there is any other solution that might meet the requirement?
Thanks in advance!
The following code shows a transform stream that buffers incoming objects (or arrays of objects) until it has 100 of them and then pushes them onwards as an array:
var t = new stream.Transform({
objectMode: true,
transform(chunk, encoding, callback) {
this.buffer = (this.buffer || []).concat(chunk);
if (this.buffer.length >= 100) {
this.push(this.buffer);
this.buffer = [];
}
callback();
},
flush(callback) {
if (this.buffer.length > 0) this.push(this.buffer);
callback();
}
}).on("data", console.log);
for (var i = 0; i < 250; i++) t.write(i);
t.end();
You can include such a transform stream in your pipeline.
And here's the same in Typescript. It can very probably be done more elegantly, but I am no Typescript expert.
class MyTransform extends Transform {
buffer: Array<any>;
}
var t = new MyTransform({
objectMode: true,
transform(chunk, encoding, callback) {
var that = this as MyTransform;
that.buffer = (that.buffer || []).concat(chunk);
if (that.buffer.length >= 100) {
this.push(that.buffer);
that.buffer = [];
}
callback();
},
flush(callback) {
var that = this as MyTransform;
if (that.buffer.length > 0) this.push(that.buffer);
callback();
}
}).on("data", console.log);
for (var i = 0; i < 250; i++) t.write(i);
t.end();
Can nodejs streams natively queue objects, if they are not yet piped to a Writable stream?
Part 2: I can no longer process items once super.push(null) has been called. Can I restart a stream once super.push(null) has been called?
I've implemented the desired behaviour in the Readable queue below - it stores events until the output is piped to a stream. It does what I want, but I feel like I'm reinventing the wheel.
import { Readable, ReadableOptions } from 'node:stream'
export class OrderedQueue<EventType = unknown> extends Readable {
// stores a queue of events
queue: EventType[] = []
constructor(opts?: ReadableOptions) {
super({ objectMode: true, highWaterMark: 1024, ...opts })
}
add(event: EventType): boolean {
this.queue.push(event)
return this.queue.length <= this.readableHighWaterMark
}
_read(size: number): void {
super.push(this.queue.shift() || null)
}
}
Not sure about the first part of the question, but for those who want to keep writing data after the queue has been emptied, you will need to call super.push() (e.g. in the add() function), to start the stream going again.
Once the readable._read() method has been called, it will not be called again until more data is pushed through the readable.push() method.
Reference: https://nodejs.org/api/stream.html#stream_readable_read_size_1
i.e.
import { Readable, ReadableOptions } from 'node:stream'
/**
* Readable stream backed by a Queue.
*/
export class OrderedQueue<EventType = unknown> extends Readable {
// stores a queue of events
queue: EventType[] = []
constructor(opts?: ReadableOptions) {
super({ objectMode: true, highWaterMark: 1024, emitClose: false, ...opts })
}
add(event: EventType): boolean {
// if queue is empty, and we can't push an event downstream, then queue the event
if (this.queue.length === 0 && !super.push(event)) this.queue.push(event)
return this.queue.length <= this.readableHighWaterMark
}
waitUntilDrained(): Promise<void> {
return this.queue.length === 0 ? Promise.resolve() : new Promise((resolve) => this.once('idle', resolve))
}
_read(size: number): void {
while (this.queue.length > 0 && super.push(this.queue[0])) this.queue.shift()
// else we received back pressure... wait until _read is called again
// if queue is now empty...
if (this.queue.length === 0) this.emit('idle')
}
}
Rather than a push based model such as this, a better design would be to pull the data (preferably only the required data) from the Event generator.
I have an Out-Of-Memory Problem in Node.js and see a lot of big strings that can't be garbage collected when I inspect the snapshots of the heap.
I use lowDB and those strings are mainly the content of the lowDb file.
Question in principle...
When I use FileAsync (so the writing to the file is asynchronous) and I do a lot of (fire and forget) writes...is it possible that my heap space is full of waiting stack entries that all wait for the file system to finish writing? (and node can clear the memory for each finished write).
I do a lot of writes as I use lowDB to save log messages of an algorithm that I execute. Later on I want to find the log messages of a specific execution. So basically:
{
executions: [
{
id: 1,
logEvents: [...]
},
{
id: 2,
logEvents: [...]
},
...
]
}
My simplified picture of node processing this is:
my script is the next script on the stack and runs
with each write something is waiting for the file system to return an answer
this something is bloating my memory and each of this 'somethings' hold the whole content of the lowdb file (multiple times?!)
Example typescript code to try it out:
import * as lowDb from 'lowdb';
import * as FileAsync from 'lowdb/adapters/FileAsync';
/* first block just for generating random data... */
const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
const charactersLength = characters.length;
const alphanum = (length: number) => {
const result = new Buffer(length);
for (let i = 0; i < length; i++ ) {
result.write(characters.charAt(Math.floor(Math.random() * charactersLength)));
}
return result.toString('utf8');
};
class TestLowDb {
private adapter = new FileAsync('test.json');
private db;
/* starting the db up, loading with Async FileAdapter */
async startDb(): Promise<void> {
return lowDb(this.adapter).then(db => {
this.db = db;
return this.db.defaults({executions: [], dbCreated: new Date()}).write().then(_ => {
console.log('finished with intialization');
})
});
}
/* fill the database with data, fails quite quickly, finally produces a json like the following:
* { "executions": [ { "id": "<ID>", "data": [ <data>, <data>, ... ] }, <nextItem>, ... ] } */
async fill(): Promise<void> {
for (let i = 0; i < 100; i++) {
const id = alphanum(3);
this.start(id); // add the root id for this "execution"
for (let j = 0; j < 100; j++) {
this.fireAndForget(id, alphanum(1000));
// await this.wait(id, alphanum(1000));
}
}
}
/* for the first item in the list add the id with the empty array */
start(id:string): void {
this.db.get('executions')
.push({id, data:[]})
.write();
}
/* ignores the promise and continues to work */
fireAndForget(id:string, data:string): void {
this.db.get('executions')
.find({id})
.get('data')
.push(data)
.write();
}
/* returns the promise that the caller can handle it "properly" */
async wait(id:string, data:string): Promise<void> {
return this.db.get('executions')
.find({id})
.get('data')
.push(data)
.write();
}
}
const instance = new TestLowDb();
instance.startDb().then(_ => {
instance.fill()
});
enter code here
I have this snippet of code:
const file = fs.createWriteStream('./test.txt');
let written = true;
// handler is added before even an attempt to write is made
file.on('drain', function () {
written = true;
console.log('drained');
});
const interval = setInterval(function () {
if (Date.now() - time > 10000) {
clearInterval(interval);
}
if (written) {
written = file.write(new Array(1000000).join('z'));
}
}, 100);
I'm wondering if that a standard practice to add handler even an attempt to write is made?
In case of using file.on('drain') listener you set up general listener to drain event of your stream.
Notice: This listener will be removed after closing of writable stream.
Generally that code will work proper, but most common practice in Node.js is to use stream.once('drain') handler for each case of internal buffer exceeding. That approach is covered in Node.js documentation for Event: 'drain':
function writeOneMillionTimes(writer, data, encoding, callback) {
var i = 1000000;
write();
function write() {
var ok = true;
do {
i -= 1;
if (i === 0) {
// last time!
writer.write(data, encoding, callback);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = writer.write(data, encoding);
}
} while (i > 0 && ok);
if (i > 0) {
// had to stop early!
// write some more once it drains
writer.once('drain', write);
}
}
}
I have the following readable stream in typescript:
import {Readable} from "stream";
enum InputState {
NOT_READABLE,
READABLE,
ENDED
}
export class Aggregator extends Readable {
private inputs: Array<NodeJS.ReadableStream>;
private states: Array<InputState>;
private records: Array<any>;
constructor(options, inputs: Array<NodeJS.ReadableStream>) {
// force object mode
options.objectMode = true;
super(options);
this.inputs = inputs;
// set initial state
this.states = this.inputs.map(() => InputState.NOT_READABLE);
this.records = this.inputs.map(() => null);
// register event handlers for input streams
this.inputs.forEach((input, i) => {
input.on("readable", () => {
console.log("input", i, "readable event fired");
this.states[i] = InputState.READABLE;
if (this._readable) { this.emit("_readable"); }
});
input.on("end", () => {
console.log("input", i, "end event fired");
this.states[i] = InputState.ENDED;
// if (this._end) { this.push(null); return; }
if (this._readable) { this.emit("_readable"); }
});
});
}
get _readable () {
return this.states.every(
state => state === InputState.READABLE ||
state === InputState.ENDED);
}
get _end () {
return this.states.every(state => state === InputState.ENDED);
}
_aggregate () {
console.log("calling _aggregate");
let timestamp = Infinity,
indexes = [];
console.log("initial record state", JSON.stringify(this.records));
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
return;
}
}
// update timestamp if a better one is found
if (this.records[i] && timestamp > this.records[i].t) {
timestamp = this.records[i].t;
// clean the indexes array
indexes.length = 0;
}
// include the record index if has the required timestamp
if (this.records[i] && this.records[i].t === timestamp) {
indexes.push(i);
}
});
console.log("final record state", JSON.stringify(this.records), indexes, timestamp);
// end prematurely if after trying to read inputs the aggregator is
// not ready
if (!this._readable) {
console.log("end prematurely trying to read inputs", this.states);
this.push(null);
return;
}
// end prematurely if all inputs are ended and there is no remaining
// record values
if (this._end && indexes.length === 0) {
console.log("end on empty indexes", this.states);
this.push(null);
return;
}
// create the aggregated record
let record = {
t: timestamp,
v: this.records.map(
(r, i) => indexes.indexOf(i) !== -1 ? r.v : null
)
};
console.log("aggregated record", JSON.stringify(record));
if (this.push(record)) {
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
} else {
console.log("record failed to push downstream");
}
}
_read () {
console.log("calling _read", this._readable);
if (this._readable) { this._aggregate(); }
else {
this.once("_readable", this._aggregate.bind(this));
}
}
}
It is designed to aggregate multiple input streams in object mode. In the end it aggregate multiple time series data streams into a single one. The problem i'm facing is that when i test the feature i'm seeing repeatedly the message record failed to push downstream and immediately the message calling _read true and in between just the 3 messages related to the aggregation algorithm. So the Readable stream machinery is calling _read and every time it's failing the push() call. Any idea why is this happening? Did you know of a library that implement this kind of algorithm or a better way to implement this feature?
I will answer myself the question.
The problem was that i was misunderstanding the meaning of the this.push() return value call. I think a false return value mean that the current push operation fail but the real meaning is that the next push operation will fail.
A simple fix to the code shown above is to replace this:
if (this.push(record)) {
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
} else {
console.log("record failed to push downstream");
}
By this:
this.push(record);
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
You can notice that the only difference is avoid conditioning operations on the return value of the this.push() call. Given that the current implementation call this.push() only once per _read() call this simple change solve the issue.
It means feeding is faster than consuming. The official approach is enlarge its highWaterMark, Default: 16384 (16KB), or 16 for objectMode. As long as its inner buffer is big enough, the push function will always return true. It does not have to be single push() in single _read(). You may push as much as the highWaterMark indicates in a single _read().