Currently i am trying to create a CSV reader that can handle very large CSV files. I chose for a streaming implementation with the event-stream NPM package.
I have created a function getNextp() that should return a promise and give me the next piece of data every time i call it.
"use strict";
const fs = require('fs');
const es = require('event-stream');
const csv = require('csv-parser');
class CsvFileReader {
constructor(file) {
this.file = file;
this.isStreamReading = false;
this.stream = undefined;
}
getNextP() {
return new Promise( (resolve) => {
if (this.isStreamReading === true) {
this.stream.resume();
} else {
this.isStreamReading = true;
// Start reading the stream.
this.stream = fs.createReadStream(this.file)
.pipe(csv())
.pipe(es.mapSync( (row) => {
this.stream.pause();
resolve(row);
}))
.on('error', (err) => {
console.error('Error while reading file.', err);
})
.on("end", () => {
resolve(undefined);
})
}
});
}
}
I call this then with this code.
const csvFileReader = new CsvFileReader("small.csv");
setInterval( () => {
csvFileReader.getNextP().then( (frame) => {
console.log(frame);
})
}, 1000);
However every time when i try this out i only get the first row and the subsequent rows i do not get. I can not figure out why this it not working. I have tried the same with a good old callback function and then it works without any problem.
Update:
So what i basically want is a function (getNext()) that returns me every time when i call it the next row of the CSV. Some rows can be buffered, but yeah until now i could not figure out how to do this with streams. So if somebody could give me a pointer on how to create a correct getNext() function that would be great.
I would like to ask if somebody understands what is going wrong here, and ask kindly to share his/hers knowledge.
Thank you in advance.
Related
I'm trying to download a bunch of files. Let's say 1.jpg, 2.jpg, 3.jpg and so on. If 1.jpg exist, then I want to try and download 2.jpg. And if that exist I will try the next, and so on.
But the current "getFile" returns a promise, so I can't loop through it. I thought I had solved it by adding await in front of the http.get method. But it looks like it doesn't wait for the callback method to finish. Is there a more elegant way to solve this than to wrap the whole thing in a new async method?
// this returns a promise
var result = getFile(url, fileToDownload);
const getFile = async (url, saveName) => {
try {
const file = fs.createWriteStream(saveName);
const request = await http.get(url, function(response) {
const { statusCode } = response;
if (statusCode === 200) {
response.pipe(file);
return true;
}
else
return false;
});
} catch (e) {
console.log(e);
return false;
}
}
I don't think your getFile method is returning promise and also there is no point of awaiting a callback. You should split functionality in to two parts
- get file - which gets the file
- saving file which saves the file if get file returns something.
try the code like this
const getFile = url => {
return new Promise((resolve, reject) => {
http.get(url, response => {
const {statusCode} = response;
if (statusCode === 200) {
resolve(response);
}
reject(null);
});
});
};
async function save(saveName) {
const result = await getFile(url);
if (result) {
const file = fs.createWriteStream(saveName);
response.pipe(file);
}
}
What you are trying to do is getting / requesting images in some sync fashion.
Possible solutions :
You know the exact number of images you want to get, then go ahead with "request" or "http" module and use promoise chain.
You do not how the exact number of images, but will stop at image no. N-1 if N not found. then go ahed with sync-request module.
your getFile does return a promise, but only because it has async keyword before it, and it's not a kind of promise you want. http.get uses old callback style handling, luckily it's easy to convert it to Promise to suit your needs
const tryToGetFile = (url, saveName) => {
return new Promise((resolve) => {
http.get(url, response => {
if (response.statusCode === 200) {
const stream = fs.createWriteStream(saveName)
response.pipe(stream)
resolve(true);
} else {
// usually it is better to reject promise and propagate errors further
// but the function is called tryToGetFile as it expects that some file will not be available
// and this is not an error. Simply resolve to false
resolve(false);
}
})
})
}
const fileUrls = [
'somesite.file1.jpg',
'somesite.file2.jpg',
'somesite.file3.jpg',
'somesite.file4.jpg',
]
const downloadInSequence = async () => {
// using for..of instead of forEach to be able to pause
// downloadInSequence function execution while getting file
// can also use classic for
for (const fileUrl of fileUrls) {
const success = await tryToGetFile('http://' + fileUrl, fileUrl)
if (!success) {
// file with this name wasn't found
return;
}
}
}
This is a basic setup to show how to wrap http.get in a Promise and run it in sequence. Add error handling wherever you want. Also it's worth noting that it will proceed to the next file as soon as it has received a 200 status code and started downloading it rather than waiting for a full download before proceeding
I am learning SSO and trying this out without the conventional User class/object. I am new to asynchronous programming and having difficulty in managing the data flow. I am stuck at a point where I have successfully exported a boolean value, but my import (in another module) gets undefined. I suspect it is because import does not wait for the corresponding export statement to execute first. How do I make it and all subsequent code wait?
I don't know what to try in this case.
Module that is exporting usrFlag
const request = require("request");
let usrFlag = false; // assuming user doesn't already exist.
function workDB(usr_id, usr_name, dateTimeStamp) {
//some code excluded - preparing selector query on cloudant db
request(options, function (error, response, body) {
if (error) throw new Error(error);
if (body.docs.length == 0) addUsr(usr_id, usr_name, dateTimeStamp);
else {
xyz(true); //This user already exists in cloudant
console.log('User already exists since', body.docs[0].storageTime);
}
});
}
async function setUsrFlag(val) { usrFlag = val; }
async function xyz(val) {
await setUsrFlag(val);
//module.exports below does not execute until usrFlag has the correct value.
//so value is not exported until usrFlag has been properly set.
console.log(usrFlag);
module.exports.usrFlag = usrFlag;
}
Module that is importing this value
const usrP = require('../config/passport-setup');
const dbProcess = require('../dbOps/dbProcessLogic'); // <-- This is import
router.get('/google/redirect', passport.authenticate('google'), (req, res) => {
dbProcess.workDB(usrP.usrPrf.id, usrP.usrPrf.displayName, new Date());
// Instead of true/false, I see undefined here.
console.log(dbProcess.usrFlag);
});
I expect the require function of import module to wait for export module to send it all the required values. However, I know that is probably not going to happen without me explicitly telling it to do so. My question is, how?
So, I have just modified some of the code, so that I can work on it easily.
Module that is exporting usrFlag
// const request = require("request");
let usrFlag = false; // assuming user doesn't already exist.
function workDB(usr_id, usr_name, dateTimeStamp) {
return new Promise(function (resolve, reject) {
setTimeout(function () {
xyz(true).then(function () {
resolve('done');
})
}, 1000);
});
}
function setUsrFlag(val) { usrFlag = val; }
function xyz(val) {
return new Promise(function (resolve, reject) {
setUsrFlag(val);
module.exports.usrFlag = usrFlag;
resolve('done');
});
}
module.exports = {
usrFlag,
workDB
}
Module that is importing this value
const dbProcess = require('../dbOps/dbProcessLogic'); // <-- This is import
dbProcess.workDB().then(function () {
console.log(dbProcess.usrFlag);
})
Now when you run the second file, you get usrFlag as true.
I have used setTimeout to imitate a request.
Sorry if I butchered up some of your code.
I have a for loop array of promises, so I used Promise.all to go through them and called then afterwards.
let promises = [];
promises.push(promise1);
promises.push(promise2);
promises.push(promise3);
Promise.all(promises).then((responses) => {
for (let i = 0; i < promises.length; i++) {
if (promise.property === something) {
//do something
} else {
let file = fs.createWriteStream('./hello.pdf');
let stream = responses[i].pipe(file);
/*
I WANT THE PIPING AND THE FOLLOWING CODE
TO RUN BEFORE NEXT ITERATION OF FOR LOOP
*/
stream.on('finish', () => {
//extract the text out of the pdf
extract(filePath, {splitPages: false}, (err, text) => {
if (err) {
console.log(err);
} else {
arrayOfDocuments[i].text_contents = text;
}
});
});
}
}
promise1, promise2, and promise3 are some http requests, and if one of them is an application/pdf, then I write it to a stream and parse the text out of it. But this code runs the next iteration before parsing the test out of the pdf. Is there a way to make the code wait until the piping to the stream and extracting are finished before moving on to the next iteration?
Without async/await, it's quite nasty. With async/await, just do this:
Promise.all(promises).then(async (responses) => {
for (...) {
await new Promise(fulfill => stream.on("finish", fulfill));
//extract the text out of the PDF
}
})
Something like the following would also work. I use this pattern fairly often:
let promises = [];
promises.push(promise1);
promises.push(promise2);
promises.push(promise3);
function doNext(){
if(!promises.length) return;
promises.shift().then((resolved) =>{
if(resolved.property === something){
...
doNext();
}else{
let file = fs.createWriteStream('./hello.pdf');
let stream = resolved.pipe(file);
stream.on('finish', () =>{
...
doNext();
});
}
})
}
doNext();
or break up the handler to a controller and Promisified handler:
function streamOrNot(obj){
return new Promise(resolve, reject){
if(obj.property === something){
resolve();
return;
}
let file = fs.createWriteStream...;
stream.on('finish', () =>{
...
resolve();
});
}
}
function doNext(){
if(!promises.length) return;
return promises.shift().then(streamOrNot).then(doNext);
}
doNext()
Use await with stream.pipeline() instead of stream.pipe():
import * as StreamPromises from "stream/promises";
...
await StreamPromises.pipeline(sourceStream, destinationStream);
You can write the else part inside a self invoked function. So that the handling of stream will happen in parallel
(function(i) {
let file = fs.createWriteStream('./hello.pdf');
let stream = responses[i].pipe(file);
/*
I WANT THE PIPING AND THE FOLLOWING CODE
TO RUN BEFORE NEXT ITERATION OF FOR LOOP
*/
stream.on('finish', () => {
//extract the text out of the pdf
extract(filePath, {splitPages: false}, (err, text) => {
if (err) {
console.log(err);
}
else {
arrayOfDocuments[i].text_contents = text;
}
});
});
})(i)
Else you can handle the streaming part as part of the original/individual promise itself.
As of now you are creating the promise and adding it to array, instead of that you add promise.then to the array(which is also a promise). And inside the handler to then you do your streaming stuff.
I store files in Amazon S3, but also maintain a local file cache. When I need a file I want to check the cache first. I want to avoid testing for local file existance before reading, both because fs.exists will be deprecated and the file can actually be deleted between the exists-check and the file read.
I want to use promises and streams. The below example has a fallback to another local file. My real code will have S3 as fallback.
Would the below be a good solution?
The only way I found to get information of a failed read was to hook up an error handler to the stream. Once I get the "readable" event I unhook my temporary error handler.
I also wonder If I really need to unhook the handler when I use "once" to hook it up.
'use strict';
const fs = require('fs');
function tryToReadLocalFile() {
return new Promise(function(resolve, reject) {
let rs = fs.createReadStream('./test1.txt');
let errorListener = function(err) {
reject(err);
};
rs.once('error', errorListener)
rs.on('readable', () => {
rs.removeListener('error', errorListener);
resolve(rs)
});
});
}
function tryToReadAnotherFile() {
return new Promise(function(resolve, reject) {
let rs = fs.createReadStream('./test2.txt');
let errorListener = function(err) {
reject(err);
};
rs.once('error', errorListener)
rs.on('readable', () => {
rs.removeListener('error', errorListener);
resolve(rs)
});
});
}
tryToReadLocalFile()
.catch(function(err) {
if(err.code === 'ENOENT') {
console.log('test1.txt not found. Fallback to test2.txt')
//Reading from another file as a test. Should read from S3 as fallback
return tryToReadAnotherFile();
} else {
return Promise.reject(err);
}
}).then(function(file) {
console.log('writing to test.txt');
let ws = fs.createWriteStream('./test.txt');
file.pipe(ws);
});
---------- edit -->
I have now implemented a more compact version of the above. I would still be grateful for any input on this, though. Is this a good way to solve this?
As you can see I don't bother to check for ENOENT anymore. Whatever the error is, I want to fall back to S3.
function getFileFromStorageP(options) {
return new Promise(function(resolve, reject) {
let rs = fs.createReadStream(path.join(env.fsp.cacheDir, options.fileId));
rs.once('error', (err) => {reject(err)});
rs.once('readable', () => {resolve(rs)});
}).catch(function(err) {
return srvS3.download({
fileId: options.fileId
});
});
}
I'll post an answer for others to read..
The above solution is now implemented and works great. I haven't found any easier way to get hold of the error, though.
I need to build a function for processing large CSV files for use in a bluebird.map() call. Given the potential sizes of the file, I'd like to use streaming.
This function should accept a stream (a CSV file) and a function (that processes the chunks from the stream) and return a promise when the file is read to end (resolved) or errors (rejected).
So, I start with:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
// use readable or data event?
parser.on('readable', function() {
// call processor, which may be async
// how do I throttle the amount of promises generated
});
var db = pgp(api.config.mailroom.fileMakerDbConfig);
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Now, I have two inter-related issues:
I need to throttle the actual amount of data being processed, so as to not create memory pressures.
The function passed as the processor param is going to often be async, such as saving the contents of the file to the db via a library that is promise-based (right now: pg-promise). As such, it will create a promise in memory and move on, repeatedly.
The pg-promise library has functions to manage this, like page(), but I'm not able to wrap my ahead around how to mix stream event handlers with these promise methods. Right now, I return a promise in the handler for readable section after each read(), which means I create a huge amount of promised database operations and eventually fault out because I hit a process memory limit.
Does anyone have a working example of this that I can use as a jumping point?
UPDATE: Probably more than one way to skin the cat, but this works:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
// some checks trimmed out for example
var db = pgp(api.config.mailroom.fileMakerDbConfig);
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
var readDataFromStream = function(index, data, delay) {
var records = [];
var record;
do {
record = parser.read();
if(record != null)
records.push(record);
} while(record != null && (records.length < api.config.mailroom.fileParserConcurrency))
parser.pause();
if(records.length)
return records;
};
var processData = function(index, data, delay) {
console.log('processData(' + index + ') > data: ', data);
parser.resume();
};
parser.on('readable', function() {
db.task(function(tsk) {
this.page(readDataFromStream, processData);
});
});
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Anyone sees a potential problem with this approach?
You might want to look at promise-streams
var ps = require('promise-streams');
passedStream
.pipe(csv.parse({trim: true}))
.pipe(ps.map({concurrent: 4}, row => processRowDataWhichMightBeAsyncAndReturnPromise(row)))
.wait().then(_ => {
console.log("All done!");
});
Works with backpressure and everything.
Find below a complete application that correctly executes the same kind of task as you want: It reads a file as a stream, parses it as a CSV and inserts each row into the database.
const fs = require('fs');
const promise = require('bluebird');
const csv = require('csv-parse');
const pgp = require('pg-promise')({promiseLib: promise});
const cn = "postgres://postgres:password#localhost:5432/test_db";
const rs = fs.createReadStream('primes.csv');
const db = pgp(cn);
function receiver(_, data) {
function source(index) {
if (index < data.length) {
// here we insert just the first column value that contains a prime number;
return this.none('insert into primes values($1)', data[index][0]);
}
}
return this.sequence(source);
}
db.task(t => {
return pgp.spex.stream.read.call(t, rs.pipe(csv()), receiver);
})
.then(data => {
console.log('DATA:', data);
}
.catch(error => {
console.log('ERROR:', error);
});
Note that the only thing I changed: using library csv-parse instead of csv, as a better alternative.
Added use of method stream.read from the spex library, which properly serves a Readable stream for use with promises.
I found a slightly better way of doing the same thing; with more control. This is a minimal skeleton with precise parallelism control. With parallel value as one all records are processed in sequence without having the entire file in memory, we can increase parallel value for faster processing.
const csv = require('csv');
const csvParser = require('csv-parser')
const fs = require('fs');
const readStream = fs.createReadStream('IN');
const writeStream = fs.createWriteStream('OUT');
const transform = csv.transform({ parallel: 1 }, (record, done) => {
asyncTask(...) // return Promise
.then(result => {
// ... do something when success
return done(null, record);
}, (err) => {
// ... do something when error
return done(null, record);
})
}
);
readStream
.pipe(csvParser())
.pipe(transform)
.pipe(csv.stringify())
.pipe(writeStream);
This allows doing an async task for each record.
To return a promise instead we can return with an empty promise, and complete it when stream finishes.
.on('end',function() {
//do something wiht csvData
console.log(csvData);
});
So to say you don't want streaming but some kind of data chunks? ;-)
Do you know https://github.com/substack/stream-handbook?
I think the simplest approach without changing your architecture would be some kind of promise pool. e.g. https://github.com/timdp/es6-promise-pool