Got incomplete data on stream piping to an express response - node.js

Need to convert a DB table to a csv report.
If I immediately unload the entire tablet with one query then the application crashes because the memory runs out. I decided to query data from the table in portions of 100 rows, convert each row into a line of the report and write it into a stream that is piped with an express response.
All this happens nearly like this:
DB query
const select100Users = (maxUserCreationDateStr) => {
return db.query(`
SELECT * FROM users WHERE created_at < to_timestamp(${maxUserCreationDateStr})
ORDER BY created_at DESC LIMIT 100`);
}
stream initialisation
const { PassThrough } = require('stream');
const getUserReportStream = () => {
const stream = new PassThrough();
writeUserReport(stream).catch((e) => stream.emit('error', e));
return stream;
};
piping the stream with an express response
app.get('/report', (req, res) => {
const stream = getUserReportStream();
res.setHeader('Content-Type', 'application/vnd.ms-excel');
res.setHeader(`Content-Disposition', 'attachment; filename="${ filename }"`);
stream.pipe(res);
});
and finally how do I write data to the stream
const writeUserReport(stream) => {
let maxUserCreationDateGlobal = Math.trunc(Date.now() / 1000);
let flag = true;
stream.write(USER_REPORT_HEADER);
while (flag) {
const rows100 = await select100Users(maxUserCreationDateGlobal);
console.log(rows100.length);
if (rows100.length === 0) {
flag = false;
} else {
let maxUserCreationDate = maxUserCreationDateGlobal;
const users100 = await Promise.all(
rows100.map((r) => {
const created_at = r.created_at;
const createdAt = new Date(created_at);
if (created_at && createdAt.toString() !== 'Invalid Date') {
const createdAtNumber = Math.trunc(createdAt.valueOf() / 1000);
maxUserCreationDate = Math.min(maxUserCreationDate, createdAtNumber);
}
return mapUser(r); // returns a promise
})
);
users100.forEach((u) => stream.write(generateCsvRowFromUser(u)));
maxUserCreationDateGlobal = maxUserCreationDate;
if (rows100.length < 100) {
flag = false;
console.log('***');
}
}
}
console.log('end');
stream.end();
};
as a result I see this output in the console:
100 // 100
100 // 200
100 // 300
100 // 400
100 // 500
87 // 587
***
end
But in the downloaded file I get 401 lines (the first one with USER_REPORT_HEADER). It feels like stream.end() closes the stream before all values are read from it.
I tried using BehaviorSubject from rxjs instead of PassThrough in a similar way - the result is the same..
How can I wait for reading from the stream of all the data that I wrote there?
Or maybe someone can recommend an alternative way to solve this problem.

stream.write expects you to pass a callback as a second (or third parameter), to know when the write operation did finish. You can't call write again unless the previous write operation is finished.
So in general I'd suggest to make this whole function async and every time you call stream.write you wrap it into a Promise like
await new Promise((resolve, reject) => stream.write(data, (error) => {
if (error) {
reject(error);
return;
}
resolve();
});
Obviously it would make sense to extract this to some method.
EDIT: Additionally I don't think that's the actual problem. I assume your http connection is just timing out before all the fetching is completed, so the server will eventually close the stream once the timeout deadline is met.

Related

Chunking axios.get requests with a 1 second delay per chunk - presently getting 429 error

I have a script using axios that hits an API with a limit of 5 requests per second. At present my request array length is 72 and will grow over time. I receive an abundance of 429 errors. The responses per endpoint change with each run of the script; ex: url1 on iteration1 returns 429, then url1 on iteration2 returns 200, url1 on iteration3 returns 200, url1 on iteration4 returns 429.
Admittedly my understanding of async/await and promises are spotty at best.
What I understand:
I can have multiple axios.get running because of async. The variable I set in my main that uses the async function can include the await to ensure all requests have processed before continuing the script.
Promise.all can run multiple axios.gets but, if a single request fails the chain breaks and no more requests will run.
Because the API will only accept 5 requests per second I have to chunk my axios.get requests to 5 endpoints, wait for those to finish processing before sending the next chunk of 5.
setTimeout will assign a time limit to a single request, once the time is up the request is done and will not be sent again no matter the return being other than 200.
setInterval will assign a time limit but it will send the request again after time's up and keep requesting until it receives a 200.
async function main() {
var endpoints = makeEndpoints(boards, whiteList); //returns an array of string API endpoints ['www.url1.com', 'www.url2.com', ...]
var events = await getData(endpoints);
...
}
The getData() has seen many iterations in attempt to correct the 429's. Here are a few:
// will return the 200's sometimes and not others, I believe it's the timeout but that won't attempt the hit a failed url (as I understand it)
async function getData(endpoints) {
let events = [];
for (x = 0; x < endpoints.length; x++) {
try {
let response = await axios.get(endpoints[x], {timeout: 2000});
if ( response.status == 200 &&
response.data.hasOwnProperty('_embedded') &&
response.data._embedded.hasOwnProperty('events')
) {
let eventsArr = response.data._embedded.events;
eventsArr.forEach(event => {
events.push(event)
});
}
} catch (error) {
console.log(error);
}
}
return events;
}
// returns a great many 429 errors via the setInterval, as I understand this function sets a delay of N seconds before attempting the next call
async function getData(endpoints) {
let data = [];
let promises = [];
endpoints.forEach((url) => {
promises.push(
axios.get(url)
)
})
setInterval(function() {
for (i = 0; i < promises.length; i += 5) {
let requestArr = promises.slice(i, i + 5);
axios.all(requestArr)
.then(axios.spread((...res) => {
console.log(res);
}))
.catch(err => {
console.log(err);
})
}
}, 2000)
}
// Here I hoped Promise.all would allow each request to do its thing and return the data, but after further reading I found that if a single request fails the rest will fail in the Promise.all
async getData(endpoints) {
try {
const res = await Promise.all(endpoints.map(url => axios.get(url))).catch(err => {});
} catch {
throw Error("Promise failed");
}
return res;
}
// Returns so many 429 and only 3/4 data I know to expect
async function getData(endpoints) {
const someFunction = () => {
return new Promise(resolve => {
setTimeout(() => resolve('222'), 100)
})
}
const requestArr = endpoints.map(async data => {
let waitForThisData = await someFunction(data);
return axios.get(data)
.then(response => { console.log(response.data)})
.catch(error => console.log(error.toString()))
});
Promise.all(requestArr).then(() => {
console.log('resolved promise.all')
})
}
// Seems to get close to solving but once an error is it that Promise.all stops processing endpoint
async function getData(endpoints) {
(async () => {
try {
const allResponses = await Promise.all(
endpoints.map(url => axios.get(url).then(res => console.log(res.data)))
);
console.log(allResponses[0]);
} catch(e) {
console.log(e);
// handle errors
}
})();
}
It seems like I have so many relevant pieces but I cannot connect them in an efficient and working model. Perhaps axios has something completely unknown to me? I've also tried using blurbird concurrent to limit the request to 5 per attempt but that still returned the 429 from axios.
I've been starring at this for days and with so much new information swirling in my head I'm at a loss as to how to send 5 requests per second, await the response, then send another set of 5 requests to the API.
Guidance/links/ways to improve upon the question would be much appreciated.

How to make a function wait for data to appear in the DB? NodeJS

I am facing a peculiar situation.
I have a backend system (nodejs) which is being called by FE (pretty standard :) ). This endpoint (nodejs) needs to call another system (external) and get the data it produces and return them to the FE. Until now it all might seem pretty usual but here comes the catch.
The external system has async processing and therefore responds to my request immediately but is still processing data (saves them in a DB) and I have to get those data from DB and return them to the FE.
And here goes the question: what is the best (efficient) way of doing it? It usually takes a couple of seconds only and I am very hesitant of making a loop inside the function and for the data to appear in the DB.
Another way would be to have the external system call an endpoint at the end of the processing (if possible - would need to check that with the partner) and wait in the original function until that endpoint is called (not sure exactly how to implement that - so if there is any documentation, article, tutorial, ... would appreciate it very much if you could share guys)
thx for the ideas!
I can give you an example that checks the Database and waits for a while if it can't find a record. And I made a fake database connection for example to work.
// Mocking starts
ObjectID = () => {};
const db = {
collection: {
find: () => {
return new Promise((resolve, reject) => {
// Mock like no record found
setTimeout(() => { console.log('No record found!'); resolve(false) }, 1500);
});
}
}
}
// Mocking ends
const STANDBY_TIME = 1000; // 1 sec
const RETRY = 5; // Retry 5 times
const test = async () => {
let haveFound = false;
let i = 0;
while (i < RETRY && !haveFound) {
// Check the database
haveFound = await checkDb();
// If no record found, increment the loop count
i++
}
}
const checkDb = () => {
return new Promise((resolve) => {
setTimeout(async () => {
record = await db.collection.find({ _id: ObjectID("12345") });
// Check whether you've found or not the record
if (record) return resolve(true);
resolve(false);
}, STANDBY_TIME);
});
}
test();

Promise around event-stream mapSync() not working

Currently i am trying to create a CSV reader that can handle very large CSV files. I chose for a streaming implementation with the event-stream NPM package.
I have created a function getNextp() that should return a promise and give me the next piece of data every time i call it.
"use strict";
const fs = require('fs');
const es = require('event-stream');
const csv = require('csv-parser');
class CsvFileReader {
constructor(file) {
this.file = file;
this.isStreamReading = false;
this.stream = undefined;
}
getNextP() {
return new Promise( (resolve) => {
if (this.isStreamReading === true) {
this.stream.resume();
} else {
this.isStreamReading = true;
// Start reading the stream.
this.stream = fs.createReadStream(this.file)
.pipe(csv())
.pipe(es.mapSync( (row) => {
this.stream.pause();
resolve(row);
}))
.on('error', (err) => {
console.error('Error while reading file.', err);
})
.on("end", () => {
resolve(undefined);
})
}
});
}
}
I call this then with this code.
const csvFileReader = new CsvFileReader("small.csv");
setInterval( () => {
csvFileReader.getNextP().then( (frame) => {
console.log(frame);
})
}, 1000);
However every time when i try this out i only get the first row and the subsequent rows i do not get. I can not figure out why this it not working. I have tried the same with a good old callback function and then it works without any problem.
Update:
So what i basically want is a function (getNext()) that returns me every time when i call it the next row of the CSV. Some rows can be buffered, but yeah until now i could not figure out how to do this with streams. So if somebody could give me a pointer on how to create a correct getNext() function that would be great.
I would like to ask if somebody understands what is going wrong here, and ask kindly to share his/hers knowledge.
Thank you in advance.

Error when do Promise.all (close connection / hang up)

I have working with google-spreadsheet#2.0.7 package, I have many data to be exported to Google Sheet,
for now I use this code
const insertDataToSheet = async (data, sheet, msg) => {
let query = []
try {
data.map(async item => {
query.push(promisify(sheet.addRow)(item))
})
const result = await Promise.all(query)
if (result) return result
throw new Error(`${msg} Unkown Error`)
} catch (e) {
throw new Error(`${msg} Failed: ${e.message}`)
}
}
This code is working with 100 data or less, but if I use 150+ data the connection not support it.
Error List
- Client network socket disconnected before secure TLS connection was established
- Socket hang up
- Error: HTTP error 429 (Too Many Requests)
Is there any limitation for Promise.all.?
or
Is there any better solution to export batch / bulk data to Google
Spreadsheet?
Promise.all will throw if one of the promise throw. If you want to proceed even if one promise fails, you do not want to rethrow it as in your code above.
you can re-add it to pending queue and try it again.
also, i may consider batching it. divide them into chunks and upload it.
example:
create a pool of worker (number of work = number of cpu cores (default))
run uploading logic with the worker pool
simulate error / retry with Math.random
process.js file
const path = require('path')
const _ = require('lodash')
const Pool = require('piscina')
const BB = require('bluebird')
const workerPool = new Pool({
filename: path.resolve(__dirname, 'worker.js'),
})
const generateData = (numItems = 5) => {
return Array.from({ length: numItems }, (v, idx) => 'item ' + idx)
}
const CHUNK_SIZE = 10
const data = generateData(100)
const chunks = _.chunk(data, CHUNK_SIZE)
BB.map(
chunks,
(chunk) => {
workerPool.runTask(chunk)
},
{ concurrency: 1 /* 1 chunk at a time */ }
)
worker.js file
const retry = require('p-retry')
// your upload logic here
function process(data) {
if (Math.random() > 0.5) {
console.log('processing ', data)
} else {
console.log('fail => retry ', data)
throw new Error('process failed' + data)
}
}
module.exports = (data) => {
return retry(() => process(data), { retries: 10 })
}
run with node process.js
In final i work on this, and find out there is new version of the package google-spreadsheet#3.0.11.
it's change from Google Drive API to Google Sheets API.
It has many changes, but in my case now I can Batch / Bulk insert just with single line.
this is my code now.
const insertDataToSheet = async (data, sheet, msg) => {
try {
const result = await sheet.addRows(data)
if (result) return result
throw new Error(`${msg} Unkown Error`)
} catch (e) {
throw new Error(`${msg} Failed: ${e.message}`)
}
}
I just use sheet.addRows and tada it's working.
My Problem is solved, but with promise I still need to learn,
Thanks for all of your suggestion / attention.

NodeJS, promises, streams - processing large CSV files

I need to build a function for processing large CSV files for use in a bluebird.map() call. Given the potential sizes of the file, I'd like to use streaming.
This function should accept a stream (a CSV file) and a function (that processes the chunks from the stream) and return a promise when the file is read to end (resolved) or errors (rejected).
So, I start with:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
// use readable or data event?
parser.on('readable', function() {
// call processor, which may be async
// how do I throttle the amount of promises generated
});
var db = pgp(api.config.mailroom.fileMakerDbConfig);
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Now, I have two inter-related issues:
I need to throttle the actual amount of data being processed, so as to not create memory pressures.
The function passed as the processor param is going to often be async, such as saving the contents of the file to the db via a library that is promise-based (right now: pg-promise). As such, it will create a promise in memory and move on, repeatedly.
The pg-promise library has functions to manage this, like page(), but I'm not able to wrap my ahead around how to mix stream event handlers with these promise methods. Right now, I return a promise in the handler for readable section after each read(), which means I create a huge amount of promised database operations and eventually fault out because I hit a process memory limit.
Does anyone have a working example of this that I can use as a jumping point?
UPDATE: Probably more than one way to skin the cat, but this works:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
// some checks trimmed out for example
var db = pgp(api.config.mailroom.fileMakerDbConfig);
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
var readDataFromStream = function(index, data, delay) {
var records = [];
var record;
do {
record = parser.read();
if(record != null)
records.push(record);
} while(record != null && (records.length < api.config.mailroom.fileParserConcurrency))
parser.pause();
if(records.length)
return records;
};
var processData = function(index, data, delay) {
console.log('processData(' + index + ') > data: ', data);
parser.resume();
};
parser.on('readable', function() {
db.task(function(tsk) {
this.page(readDataFromStream, processData);
});
});
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Anyone sees a potential problem with this approach?
You might want to look at promise-streams
var ps = require('promise-streams');
passedStream
.pipe(csv.parse({trim: true}))
.pipe(ps.map({concurrent: 4}, row => processRowDataWhichMightBeAsyncAndReturnPromise(row)))
.wait().then(_ => {
console.log("All done!");
});
Works with backpressure and everything.
Find below a complete application that correctly executes the same kind of task as you want: It reads a file as a stream, parses it as a CSV and inserts each row into the database.
const fs = require('fs');
const promise = require('bluebird');
const csv = require('csv-parse');
const pgp = require('pg-promise')({promiseLib: promise});
const cn = "postgres://postgres:password#localhost:5432/test_db";
const rs = fs.createReadStream('primes.csv');
const db = pgp(cn);
function receiver(_, data) {
function source(index) {
if (index < data.length) {
// here we insert just the first column value that contains a prime number;
return this.none('insert into primes values($1)', data[index][0]);
}
}
return this.sequence(source);
}
db.task(t => {
return pgp.spex.stream.read.call(t, rs.pipe(csv()), receiver);
})
.then(data => {
console.log('DATA:', data);
}
.catch(error => {
console.log('ERROR:', error);
});
Note that the only thing I changed: using library csv-parse instead of csv, as a better alternative.
Added use of method stream.read from the spex library, which properly serves a Readable stream for use with promises.
I found a slightly better way of doing the same thing; with more control. This is a minimal skeleton with precise parallelism control. With parallel value as one all records are processed in sequence without having the entire file in memory, we can increase parallel value for faster processing.
const csv = require('csv');
const csvParser = require('csv-parser')
const fs = require('fs');
const readStream = fs.createReadStream('IN');
const writeStream = fs.createWriteStream('OUT');
const transform = csv.transform({ parallel: 1 }, (record, done) => {
asyncTask(...) // return Promise
.then(result => {
// ... do something when success
return done(null, record);
}, (err) => {
// ... do something when error
return done(null, record);
})
}
);
readStream
.pipe(csvParser())
.pipe(transform)
.pipe(csv.stringify())
.pipe(writeStream);
This allows doing an async task for each record.
To return a promise instead we can return with an empty promise, and complete it when stream finishes.
.on('end',function() {
//do something wiht csvData
console.log(csvData);
});
So to say you don't want streaming but some kind of data chunks? ;-)
Do you know https://github.com/substack/stream-handbook?
I think the simplest approach without changing your architecture would be some kind of promise pool. e.g. https://github.com/timdp/es6-promise-pool

Resources