lowDb - out of memory - node.js

I have an Out-Of-Memory Problem in Node.js and see a lot of big strings that can't be garbage collected when I inspect the snapshots of the heap.
I use lowDB and those strings are mainly the content of the lowDb file.
Question in principle...
When I use FileAsync (so the writing to the file is asynchronous) and I do a lot of (fire and forget) writes...is it possible that my heap space is full of waiting stack entries that all wait for the file system to finish writing? (and node can clear the memory for each finished write).
I do a lot of writes as I use lowDB to save log messages of an algorithm that I execute. Later on I want to find the log messages of a specific execution. So basically:
{
executions: [
{
id: 1,
logEvents: [...]
},
{
id: 2,
logEvents: [...]
},
...
]
}
My simplified picture of node processing this is:
my script is the next script on the stack and runs
with each write something is waiting for the file system to return an answer
this something is bloating my memory and each of this 'somethings' hold the whole content of the lowdb file (multiple times?!)
Example typescript code to try it out:
import * as lowDb from 'lowdb';
import * as FileAsync from 'lowdb/adapters/FileAsync';
/* first block just for generating random data... */
const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
const charactersLength = characters.length;
const alphanum = (length: number) => {
const result = new Buffer(length);
for (let i = 0; i < length; i++ ) {
result.write(characters.charAt(Math.floor(Math.random() * charactersLength)));
}
return result.toString('utf8');
};
class TestLowDb {
private adapter = new FileAsync('test.json');
private db;
/* starting the db up, loading with Async FileAdapter */
async startDb(): Promise<void> {
return lowDb(this.adapter).then(db => {
this.db = db;
return this.db.defaults({executions: [], dbCreated: new Date()}).write().then(_ => {
console.log('finished with intialization');
})
});
}
/* fill the database with data, fails quite quickly, finally produces a json like the following:
* { "executions": [ { "id": "<ID>", "data": [ <data>, <data>, ... ] }, <nextItem>, ... ] } */
async fill(): Promise<void> {
for (let i = 0; i < 100; i++) {
const id = alphanum(3);
this.start(id); // add the root id for this "execution"
for (let j = 0; j < 100; j++) {
this.fireAndForget(id, alphanum(1000));
// await this.wait(id, alphanum(1000));
}
}
}
/* for the first item in the list add the id with the empty array */
start(id:string): void {
this.db.get('executions')
.push({id, data:[]})
.write();
}
/* ignores the promise and continues to work */
fireAndForget(id:string, data:string): void {
this.db.get('executions')
.find({id})
.get('data')
.push(data)
.write();
}
/* returns the promise that the caller can handle it "properly" */
async wait(id:string, data:string): Promise<void> {
return this.db.get('executions')
.find({id})
.get('data')
.push(data)
.write();
}
}
const instance = new TestLowDb();
instance.startDb().then(_ => {
instance.fill()
});
enter code here

Related

For await x of y using an AsyncIterator causes memory leak

When using AsyncIterator i have a substential memory leak when used in for-x-of-y
I need this when scraping a HTML-Page which includes the information about the next HTML-Page to be scraped:
Scrape Data
Evaluate Data
Scrape Next Data
The async Part is needed since axios is used to obtain the HTML
Here is a repro, which allows to see the memory rising von ~4MB to ~25MB at the end of the script. The memory is not freed till the program terminates.
const scraper = async ():Promise<void> => {
let browser = new BrowserTest();
let parser = new ParserTest();
for await (const data of browser){
console.log(await parser.parse(data))
}
}
class BrowserTest {
private i: number = 0;
public async next(): Promise<IteratorResult<string>> {
this.i += 1;
return {
done: this.i > 1000,
value: 'peter '.repeat(this.i)
}
}
[Symbol.asyncIterator](): AsyncIterator<string> {
return this;
}
}
class ParserTest {
public async parse(data: string): Promise<string[]> {
return data.split(' ');
}
}
scraper()
It looks like that the data of the for-await-x-of-y is dangling in memory. The callstack gets huge aswell.
In the repro the Problem could still be handled. But for my actual code a whole HTML-Page stays in memory which is ~250kb each call.
In this screenshot you can see the heap memory on the first iteration compared to the heap memory after the last iteration
Cannot post inline Screenshots yet
The expected workflow would be the following:
Obtain Data
Process Data
Extract Info for the next "Obtain Data"
Free all Memory from the last "Obtain Data"
Use extracted information to restart the loop with new Data obtained.
I am unsure an AsyncIterator is the right choice here to archive what is needed.
Any help/hint would be appriciated!
In Short
When using an AsyncIterator the Memory is rising drastically. It drops once the Iteration is done.
The x in `for await (x of y) is not freed till the Iteration is done. Also every Promise awaited inside the for-loop is not freed.
I came to the conclusion that the Garbage Collector cannot catch the contents of Iteration, since the Promises generated by the AsyncIterator will only fully resolve once the Iteration is done.
I think this might be a Bug.
Workaround Repro
As workaround to free the contents of the Parser we encapsulate the Result in a lightweight Container. We then free the contents, so only the Container itself remains in Memory.
The data Object cannot be freed even if you use the same technic to encapsulate it - so it seems to be the case when debugging at least.
const scraper = async ():Promise<void> => {
let browser = new BrowserTest();
for await (const data of browser){
let parser = new ParserTest();
let result = await parser.parse(data);
console.log(result);
/**
* This avoids memory leaks, due to a garbage collector bug
* of async iterators in js
*/
result.free();
}
}
class BrowserTest {
private i: number = 0;
private value: string = "";
public async next(): Promise<IteratorResult<string>> {
this.i += 1;
this.value = 'peter '.repeat(this.i);
return {
done: this.i > 1000,
value: this.value
}
}
public [Symbol.asyncIterator](): AsyncIterator<string> {
return this;
}
}
/**
* Result class for wrapping the result of the parser.
*/
class Result {
private result: string[] = [];
constructor(result: string[]){
this.setResult(result);
}
public setResult(result: string[]) {
this.result = result;
}
public getResult(): string[] {
return this.result;
}
public free(): void {
delete this.result;
}
}
class ParserTest {
public async parse(data: string): Promise<Result>{
let result = data.split(' ');
return new Result(result);
}
}
scraper())
Workaround in actual context
What is not shown in the Repro-Solution is that we also try to free the Result of the Iteration itself. This seems not to have any effect tho(?).
public static async scrape<D,M>(scraper: IScraper<D,M>, callback: (data: DataPackage<Object,Object> | null) => Promise<void>) {
let browser = scraper.getBrowser();
let parser = scraper.getParser();
for await (const parserFragment of browser) {
const fragment = await parserFragment;
const json = await parser.parse(fragment);
await callback(json);
json.free();
fragment.free();
}
}
See: https://github.com/demokratie-live/scapacra/blob/master/src/Scraper.ts
To test with an actual Application: https://github.com/demokratie-live/scapacra-bt (yarn dev ConferenceWeekDetail)
References
Github NodeJs: https://github.com/nodejs/node/issues/30298
Github DEMOCRACY: https://github.com/demokratie-live/democracy-client/issues/926
Conclusion
We found a feasible Solution for us. Therefore i close this Issue. The followup is directed towards the Node.js Repo in order to fix this potential Bug
https://github.com/nodejs/node/issues/30298

Laravel Excel upload and progressbar

I have a website where I can upload a .xlsx file which contains some rows of information for my database. I read the documentation for laravel-excel but it looks like it only works with progress bar if you use the console method; which I don't.
I currently just use a plain HTML upload form, no ajax yet.
But to create this progress bar for this I need to convert it to ajax, which is no hassle, that I can do.
But how would I create the progress bar when uploading the file and iterating through each row in the Excel file?
This is the controller and method where the upload gets done:
/**
* Import companies
*
* #param Import $request
* #return \Illuminate\Routing\Redirector|\Illuminate\Http\RedirectResponse
*/
public function postImport(Import $request)
{
# Import using Import class
Excel::import(new CompaniesImport, $request->file('file'));
return redirect(route('dashboard.companies.index.get'))->with('success', 'Import successfull!');
}
And this is the import file:
public function model(array $row)
{
# Don't create or validate on empty rows
# Bad workaround
# TODO: better solution
if (!array_filter($row)) {
return null;
}
# Create company
$company = new Company;
$company->crn = $row['crn'];
$company->name = $row['name'];
$company->email = $row['email'];
$company->phone = $row['phone'];
$company->website = (!empty($row['website'])) ? Helper::addScheme($row['website']) : '';
$company->save();
# Everything empty.. delete address
if (!empty($row['country']) || !empty($row['state']) || !empty($row['postal']) || !empty($row['address']) || !empty($row['zip'])) {
# Create address
$address = new CompanyAddress;
$address->company_id = $company->id;
$address->country = $row['country'];
$address->state = $row['state'];
$address->postal = $row['postal'];
$address->address = $row['address'];
$address->zip = $row['zip'];
$address->save();
# Attach
$company->addresses()->save($address);
}
return $company;
}
I know this is not much at this point. I just need some help figuring out how I would create this progress bar, because I'm pretty stuck.
My thought is to create a ajax upload form though, but from there I don't know.
Just an idea, but you could use the Laravel session to store the total_row_count and processed_row_count during the import execution. Then, you could create a separate AJAX call on a setInterval() to poll those session values (e.g., once per second). This would allow you to calculate your progress as processed_row_count / total_row_count, and output to a visual progress bar. – matticustard
Putting #matticustard comment into practice. Below is just sample of how things could be implemented, and maybe there are areas to improve.
1. Routes
import route to initialize Excel import.
import-status route will be used to get latest import status
Route::post('import', [ProductController::class, 'import']);
Route::get('import-status', [ProductController::class, 'status']);
2. Controller
import action will validate uploaded file, and pass $id to ProductsImport class. As it will be queued and run in the background, there is no access to current session. We will use cache in the background. It will be good idea to generate more randomized $id if more concurrent imports will be processed, for now just unix date to keep simple.
You currently cannot queue xls imports. PhpSpreadsheet's Xls reader contains some non-utf8 characters, which makes it impossible to queue.
XLS imports could not be queued
public function import()
{
request()->validate([
'file' => ['required', 'mimes:xlsx'],
]);
$id = now()->unix()
session([ 'import' => $id ]);
Excel::queueImport(new ProductsImport($id), request()->file('file')->store('temp'));
return redirect()->back();
}
Get latest import status from cache, passing $id from session.
public function status()
{
$id = session('import');
return response([
'started' => filled(cache("start_date_$id")),
'finished' => filled(cache("end_date_$id")),
'current_row' => (int) cache("current_row_$id"),
'total_rows' => (int) cache("total_rows_$id"),
]);
}
3. Import class
Using WithEvents BeforeImport we set total rows of our excel file to the cache. Using onRow we set currently processing row to the cache. And AfterReset clear all the data.
<?php
namespace App\Imports;
use App\Models\Product;
use Maatwebsite\Excel\Row;
use Maatwebsite\Excel\Concerns\OnEachRow;
use Maatwebsite\Excel\Events\AfterImport;
use Maatwebsite\Excel\Events\BeforeImport;
use Maatwebsite\Excel\Concerns\WithEvents;
use Illuminate\Contracts\Queue\ShouldQueue;
use Maatwebsite\Excel\Concerns\WithStartRow;
use Maatwebsite\Excel\Concerns\WithChunkReading;
use Maatwebsite\Excel\Concerns\WithMultipleSheets;
class ProductsImport implements OnEachRow, WithEvents, WithChunkReading, ShouldQueue
{
public $id;
public function __construct(int $id)
{
$this->id = $id;
}
public function chunkSize(): int
{
return 100;
}
public function registerEvents(): array
{
return [
BeforeImport::class => function (BeforeImport $event) {
$totalRows = $event->getReader()->getTotalRows();
if (filled($totalRows)) {
cache()->forever("total_rows_{$this->id}", array_values($totalRows)[0]);
cache()->forever("start_date_{$this->id}", now()->unix());
}
},
AfterImport::class => function (AfterImport $event) {
cache(["end_date_{$this->id}" => now()], now()->addMinute());
cache()->forget("total_rows_{$this->id}");
cache()->forget("start_date_{$this->id}");
cache()->forget("current_row_{$this->id}");
},
];
}
public function onRow(Row $row)
{
$rowIndex = $row->getIndex();
$row = array_map('trim', $row->toArray());
cache()->forever("current_row_{$this->id}", $rowIndex);
// sleep(0.2);
Product::create([ ... ]);
}
}
4. Front end
On the front-end side this is just sample how things could be handled. Here I used vuejs, ant-design-vue and lodash.
After uploading file handleChange method is called
On successful upload trackProgress method is called for the first time
trackProgress method is recursive function, calling itself on complete
with lodash _.debounce method we can prevent calling it too much
export default {
data() {
this.trackProgress = _.debounce(this.trackProgress, 1000);
return {
visible: true,
current_row: 0,
total_rows: 0,
progress: 0,
};
},
methods: {
handleChange(info) {
const status = info.file.status;
if (status === "done") {
this.trackProgress();
} else if (status === "error") {
this.$message.error(_.get(info, 'file.response.errors.file.0', `${info.file.name} file upload failed.`));
}
},
async trackProgress() {
const { data } = await axios.get('/import-status');
if (data.finished) {
this.current_row = this.total_rows
this.progress = 100
return;
};
this.total_rows = data.total_rows;
this.current_row = data.current_row;
this.progress = Math.ceil(data.current_row / data.total_rows * 100);
this.trackProgress();
},
close() {
if (this.progress > 0 && this.progress < 100) {
if (confirm('Do you want to close')) {
this.$emit('close')
window.location.reload()
}
} else {
this.$emit('close')
window.location.reload()
}
}
},
};
<template>
<a-modal
title="Upload excel"
v-model="visible"
cancel-text="Close"
ok-text="Confirm"
:closable="false"
:maskClosable="false"
destroyOnClose
>
<a-upload-dragger
name="file"
:multiple="false"
:showUploadList="false"
:action="`/import`"
#change="handleChange"
>
<p class="ant-upload-drag-icon">
<a-icon type="inbox" />
</p>
<p class="ant-upload-text">Click to upload</p>
</a-upload-dragger>
<a-progress class="mt-5" :percent="progress" :show-info="false" />
<div class="text-right mt-1">{{ this.current_row }} / {{ this.total_rows }}</div>
<template slot="footer">
<a-button #click="close">Close</a-button>
</template>
</a-modal>
</template>
<script>
export default {
data() {
this.trackProgress = _.debounce(this.trackProgress, 1000);
return {
visible: true,
current_row: 0,
total_rows: 0,
progress: 0,
};
},
methods: {
handleChange(info) {
const status = info.file.status;
if (status === "done") {
this.trackProgress();
} else if (status === "error") {
this.$message.error(_.get(info, 'file.response.errors.file.0', `${info.file.name} file upload failed.`));
}
},
async trackProgress() {
const { data } = await axios.get('/import-status');
if (data.finished) {
this.current_row = this.total_rows
this.progress = 100
return;
};
this.total_rows = data.total_rows;
this.current_row = data.current_row;
this.progress = Math.ceil(data.current_row / data.total_rows * 100);
this.trackProgress();
},
close() {
if (this.progress > 0 && this.progress < 100) {
if (confirm('Do you want to close')) {
this.$emit('close')
window.location.reload()
}
} else {
this.$emit('close')
window.location.reload()
}
}
},
};
</script>

NodeJS Cluster: how to reduce data from workers in master?

i'm new to nodejs, and what i want is to read data from database and compute.To make it faster, i use the nodejs cluster module.
there are tow global variables: pairMap and nameSet, and i allocate the jobs to worker in master process, and they do some computation works(to modify the map and set, just like the map-reduce )
however, it seems that the pairMap and nameSet are not modified and empty . (code in the doMasterAction )
(another strange thing is i console the data, and it did modified but in the end ,return to empty in the master process).
the data is as follows(i extract the main idea):
const Promise = require('bluebird');
const cluster = require('cluster');
const numCPUs = require('os').cpus().length;
const fs = Promise.promisifyAll(require('fs'))
const utils = {
mergeMap:(source,dest)=>{
for(let [key,value] of Object.entries(source)){
if(!dest.has(key)) dest.set(key,value);
for(let [type,arr] of Object.entries(value)){
const final = new Set([...dest.get[key][type],...arr])
dest.get[key][type] = final;
}
}
}
}
/**
* key: name1#group.com||name2#group.com
* value: {to: [id1,id2,id3],cc,bcc}
* #param row
* #param map
* #param nameSet
*/
function countLinks(res,map,nameSet) {
nameSet.add(res);
map.set(res,{ 'test': Math.floor(Math.random()*10+1)});
}
class hackingTeamPrepare {
constructor(bulk=100000,total = 1150000){
this.bulk = bulk;
this.count = Math.ceil(total / this.bulk);
const parallelArr = new Array(this.count).fill(0).map((v,i)=> i);
this.jobs = parallelArr.map(v=> 'key'+v);
this.pairMap = new Map();
this.nameSet = new Set();
this.bindThis();
}
bindThis(){
this.doWorkerAction = this.doWorkerAction.bind(this);
this.doMasterAction = this.doMasterAction.bind(this);
}
doMasterAction() {
const workers = [],result = {};
const self = this;
let count = 0,timeout;
for(let i=0;i<numCPUs;i++){
const worker = cluster.fork();
workers[i] = worker;
}
cluster.on('online', (worker) => {
worker.send(self.jobs.shift());
});
cluster.on('exit', function() {
if(self.jobs.length===0) return;
console.log('A worker process died, restarting...');
});
cluster.on('message',function (senderWorkder,info) {
const { workerId,jobIndex } = info;
result[jobIndex] = true;
console.log(`----worker ${workerId} done job: ${jobIndex}----`);
const finish = !self.jobs.length && Object.keys(result).length===self.count;
if(finish){
// -----------------!!here!!--------------------------**
console.log('-------finished-------',self.pairMap,self.nameSet); // Map {}, Set {}
for(let id in cluster.workers){
const curWorker = cluster.workers[id];
curWorker.disconnect();
}
}else{
if(!self.jobs.length) return;
senderWorkder.send(self.jobs.shift());
}
})
}
/**
* {[person1,person2]: {to,cc,bcc}}
*/
doWorkerAction() {
//Process为worker, receive from master
const self = this;
process.on('message',(sql)=>{
const jobPromise = Promise.resolve(sql).then(res => {
countLinks(res,self.pairMap,self.nameSet);
const data = {
workerId: process.pid,
jobIndex: sql,
}
// send to master
process.send(data);
}).catch(err=> {
console.log('-----query error----',err)
});
})
}
readFromPG(){
if(cluster.isMaster){
this.doMasterAction();
}else if (cluster.isWorker){
this.doWorkerAction();
}
}
init(){
this.readFromPG();
}
}
const test = new hackingTeamPrepare(2,10);
test.init();
anyone can help me with this?
i have tried to merge data manually in the master process,however the data sent by the worker.send seems to ignore the object in it.
In Node.js cluster, objects in memory are not shared between master and workers.
pairMap and nameSet exist separately in master and in every worker. When a worker modifies these objects, they change in the same worker (process), while remain unchanged in master and other workers.
To make your idea work, you need to maintain a single pairMap and a single nameSet inside the master process, send messages containing whatever data you need from workers to master, and update these objects using the received data.
Note that you cannot pass any object as a message from worker to master. If you need somewhat complex data, you'll need to send plain javascript objects (key-value pairs). For example, if you need to send a Map instance from worker to master, see the following functions taken from here:
// source - http://2ality.com/2015/08/es6-map-json.html
function mapToJson(map) {
return JSON.stringify([...map]);
}
function jsonToMap(jsonStr) {
return new Map(JSON.parse(jsonStr));
}
// send message using this example:
process.send(mapToJson(pairMap));
// receive message:
worker.on('message', message => console.log(jsonToMap(message)))

NodeJS write into a array and read simultaneously

I have a while loop that loads about 10000 entries into an array and then another function pops them one at a time to be used as test inputs. The process of generating and loading those 10000 entries takes a bit of time. I'm looking for a way to to this more asynchronously i.e. once 50 entries have been created the method that uses that input can be called, at the same time it continues to generate data until it reaches 10000
Answer is in typescript. The idea is to generate the test cases using a generator (es6 specific), then a reader is used to buffer the generated test cases. Finally the tester is represented by a Transform stream which tests each data given it and either throws some exception or ignores a failing test, or returns an appropriate message if the test case passes. Simply pipe the test generator (reader) to the tester (transform), and possibly pipe to some output stream to write passed and failed test cases.
Code (typescript):
class InputGen<T> extends Readable {
constructor(gen: IterableIterator<T>, count: number) {
super({
objectMode: true,
highWaterMark: 50,
read: (size?: number) => {
if (count < 0) {
this.push(null);
} else {
count--;
let testData = gen.next();
this.push(testData.value);
if (testData.done) {
count = -1;
}
}
}
});
}
}
class Tester extends Transform {
constructor() {
super({
objectMode: true,
transform: (data: any, enc: string, cb: Function) => {
// test data
if (/* data passes the test */!!data) {
cb(null, data);
} else {
cb(new Error("Data did not pass the test")); // OR cb() to skip the data
}
}
});
}
}
Usage:
new InputGen(function *() {
for (let v = 0; v < 100001; v++) {
yield v; // Some test case
}
}(), 10000).pipe(new Tester); // pipe to an output stream if necessary

Why Readable.push() return false every time Readable._read() is called

I have the following readable stream in typescript:
import {Readable} from "stream";
enum InputState {
NOT_READABLE,
READABLE,
ENDED
}
export class Aggregator extends Readable {
private inputs: Array<NodeJS.ReadableStream>;
private states: Array<InputState>;
private records: Array<any>;
constructor(options, inputs: Array<NodeJS.ReadableStream>) {
// force object mode
options.objectMode = true;
super(options);
this.inputs = inputs;
// set initial state
this.states = this.inputs.map(() => InputState.NOT_READABLE);
this.records = this.inputs.map(() => null);
// register event handlers for input streams
this.inputs.forEach((input, i) => {
input.on("readable", () => {
console.log("input", i, "readable event fired");
this.states[i] = InputState.READABLE;
if (this._readable) { this.emit("_readable"); }
});
input.on("end", () => {
console.log("input", i, "end event fired");
this.states[i] = InputState.ENDED;
// if (this._end) { this.push(null); return; }
if (this._readable) { this.emit("_readable"); }
});
});
}
get _readable () {
return this.states.every(
state => state === InputState.READABLE ||
state === InputState.ENDED);
}
get _end () {
return this.states.every(state => state === InputState.ENDED);
}
_aggregate () {
console.log("calling _aggregate");
let timestamp = Infinity,
indexes = [];
console.log("initial record state", JSON.stringify(this.records));
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
return;
}
}
// update timestamp if a better one is found
if (this.records[i] && timestamp > this.records[i].t) {
timestamp = this.records[i].t;
// clean the indexes array
indexes.length = 0;
}
// include the record index if has the required timestamp
if (this.records[i] && this.records[i].t === timestamp) {
indexes.push(i);
}
});
console.log("final record state", JSON.stringify(this.records), indexes, timestamp);
// end prematurely if after trying to read inputs the aggregator is
// not ready
if (!this._readable) {
console.log("end prematurely trying to read inputs", this.states);
this.push(null);
return;
}
// end prematurely if all inputs are ended and there is no remaining
// record values
if (this._end && indexes.length === 0) {
console.log("end on empty indexes", this.states);
this.push(null);
return;
}
// create the aggregated record
let record = {
t: timestamp,
v: this.records.map(
(r, i) => indexes.indexOf(i) !== -1 ? r.v : null
)
};
console.log("aggregated record", JSON.stringify(record));
if (this.push(record)) {
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
} else {
console.log("record failed to push downstream");
}
}
_read () {
console.log("calling _read", this._readable);
if (this._readable) { this._aggregate(); }
else {
this.once("_readable", this._aggregate.bind(this));
}
}
}
It is designed to aggregate multiple input streams in object mode. In the end it aggregate multiple time series data streams into a single one. The problem i'm facing is that when i test the feature i'm seeing repeatedly the message record failed to push downstream and immediately the message calling _read true and in between just the 3 messages related to the aggregation algorithm. So the Readable stream machinery is calling _read and every time it's failing the push() call. Any idea why is this happening? Did you know of a library that implement this kind of algorithm or a better way to implement this feature?
I will answer myself the question.
The problem was that i was misunderstanding the meaning of the this.push() return value call. I think a false return value mean that the current push operation fail but the real meaning is that the next push operation will fail.
A simple fix to the code shown above is to replace this:
if (this.push(record)) {
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
} else {
console.log("record failed to push downstream");
}
By this:
this.push(record);
console.log("record pushed downstream");
// remove records already aggregated and pushed
indexes.forEach(i => { this.records[i] = null; });
this.records.forEach((record, i) => {
// try to read missing records
if (!this.records[i] && this.states[i] !== InputState.ENDED) {
this.records[i] = this.inputs[i].read();
if (!this.records[i]) {
this.states[i] = InputState.NOT_READABLE;
}
}
});
You can notice that the only difference is avoid conditioning operations on the return value of the this.push() call. Given that the current implementation call this.push() only once per _read() call this simple change solve the issue.
It means feeding is faster than consuming. The official approach is enlarge its highWaterMark, Default: 16384 (16KB), or 16 for objectMode. As long as its inner buffer is big enough, the push function will always return true. It does not have to be single push() in single _read(). You may push as much as the highWaterMark indicates in a single _read().

Resources