i'm having trouble parsing a 800k lines CSV file line-by-line using the npm library 'csv-parser' and promises.
Here is what i am doing, simply pausing every row and resuming after the user has been upserted in the database.
At ~= 3000 users, more than 1gb of RAM is used and a memory heap usage exception appears.
const csv = require('csv-parser');
const fs = require('fs');
const path = require('path');
function parseData() {
return new Promise((resolve, reject) => {
const stream = fs.createReadStream(filePath)
.pipe(csv(options))
.on('data', row => {
stream.pause();
upsertUser(row)
.then(user => {
stream.resume();
})
.catch(err => {
console.log(err);
stream.resume();
});
})
.on('error', () => reject())
.on('end', () => resolve());
return stream;
});
the upsert function :
function upsertUser(row) {
return user.find({
where: {
mail: row.emailAddress
}
});
edit : here is a picture of the node inspector :
Related
On a Node.js GraphQL API, using express + #graphql-yoga/node + graphql-upload-minimal, I have the uploads working very well, but when I upload a huge file, and the upload file size limit is reached, the stream continues until be finished, and it imposes makes unnecessary waiting to finish the entire stream.
I tried the below code, but the 'reject' does't destroy the stream:
stream.on('limit', function () {
const incomplete_file = `${folder}/${my_filename}`;
fs.unlink(incomplete_file, function () {});
reject(
new GraphQLYogaError(`Error: Upload file size overflow`, {
code: 'UPLOAD_SIZE_LIMIT_OVERFLOW',
}),
);
});
Full module below:
import getFilename from './getFilename';
import fs from 'fs';
import { GraphQLYogaError } from '#graphql-yoga/node';
export default async function uploadSingleFile(
folder: string,
file: any,
): Promise<any> {
return new Promise(async (resolve, reject) => {
const { createReadStream, filename, fieldName, mimetype, encoding } =
await file;
const my_filename = getFilename(filename, fieldName);
let size = 0;
let stream = createReadStream();
stream.on('data', function (chunk: any) {
size += chunk.length;
fs.appendFile(`${folder}/${my_filename}`, chunk, function (err) {
if (err) throw err;
});
});
stream.on('close', function () {
resolve({
nameOriginal: filename,
nameUploaded: my_filename,
mimetype: mimetype,
});
});
stream.on('limit', function () {
const incomplete_file = `${folder}/${my_filename}`;
fs.unlink(incomplete_file, function () {});
reject(
new GraphQLYogaError(`Error: Upload file size overflow`, {
code: 'UPLOAD_SIZE_LIMIT_OVERFLOW',
}),
);
});
})
.then((data) => data)
.catch((e) => {
throw new GraphQLYogaError(e.message);
});
}
How I can force imediate end of stream? There is any method to destroy the stream?
Thanks for the help!
I know for sure that my pullData module is getting the data back from the file read but the function calling it, though it has an await, is not getting the data.
This is the module (./initialise.js) that reads the data:
const fs = require('fs');
const getData = () => {
return new Promise((resolve, reject) => {
fs.readFile('./Sybernika.txt',
{ encoding: 'utf8', flag: 'r' },
function (err, data) {
if (err)
reject(err);
else
resolve(data);
});
});
};
module.exports = {getData};
And this is where it gets called (app.js):
const init = require('./initialise');
const pullData = async () => {
init.getData().then((data) => {
return data;
}).catch((err) => {
console.log(err);
});
};
const start = async() => {
let data = await pullData();
console.log(data);
}
start();
putting 'console.log(data)' just before return(data) in the resolve part of the call shows the data so I know it's being read OK. However, that final console.log shows my data variabkle as being undefined.
Any suggestions?
It's either
const pullData = async () => {
return init.getData().then((data) => {
return data;
}).catch((err) => {
console.log(err);
});
};
or
const pullData = async () =>
init.getData().then((data) => {
return data;
}).catch((err) => {
console.log(err);
});
Both versions make sure a promise returned by then/catch is passed down to the caller.
I am reading a CSV file line by line and inserting/updating in MongoDB. The expected output will be
1. console.log(row);
2. console.log(cursor);
3.console.log("stream");
But getting output like
1. console.log(row);
console.log(row); console.log(row); console.log(row); console.log(row); ............ ............
2. console.log(cursor);
3.console.log("stream");
Please let me know what i am missing here.
const csv = require('csv-parser');
const fs = require('fs');
var mongodb = require("mongodb");
var client = mongodb.MongoClient;
var url = "mongodb://localhost:27017/";
var collection;
client.connect(url,{ useUnifiedTopology: true }, function (err, client) {
var db = client.db("UKCompanies");
collection = db.collection("company");
startRead();
});
var cursor={};
async function insertRec(row){
console.log(row);
cursor = await collection.update({CompanyNumber:23}, row, {upsert: true});
if(cursor){
console.log(cursor);
}else{
console.log('not exist')
}
console.log("stream");
}
async function startRead() {
fs.createReadStream('./data/inside/6.csv')
.pipe(csv())
.on('data', async (row) => {
await insertRec(row);
})
.on('end', () => {
console.log('CSV file successfully processed');
});
}
In your startRead() function, the await insertRec() does not stop more data events from flowing while the insertRec() is processing. So, if you don't want the next data event to run until the insertRec() is done, you need to pause, then resume the stream.
async function startRead() {
const stream = fs.createReadStream('./data/inside/6.csv')
.pipe(csv())
.on('data', async (row) => {
try {
stream.pause();
await insertRec(row);
} finally {
stream.resume();
}
})
.on('end', () => {
console.log('CSV file successfully processed');
});
}
FYI, you also need some error handling if insertRec() fails.
That is expected behavior in this case because your on data listener triggers the insertRec asynchronously as and when data is available in stream. So that is why your first line of insert method is getting executed kind of in parallel. If you want to control this behavior you can use highWaterMark (https://nodejs.org/api/stream.html#stream_readable_readablehighwatermark) property while creating the read stream. This way you will get 1 record at a time but I am not sure what your use case is.
something like this
fs.createReadStream(`somefile.csv`, {
"highWaterMark": 1
})
Also you are not awaiting your startRead method. I would wrap it inside the promise and resolve it in end listener else you will not know when the processing got finished. Something like
function startRead() {
return new Promise((resolve, reject) => {
fs.createReadStream(`somepath`)
.pipe(csv())
.on("data", async row => {
await insertRec(row);
})
.on("error", err => {
reject(err);
})
.on("end", () => {
console.log("CSV file successfully processed");
resolve();
});
});
}
From Node 10+ ReadableStream got property Symbol.asyncIterator and is's allow processing stream using for-await-of
async function startRead() {
const readStream = fs.createReadStream('./data/inside/6.csv');
for await (const row of readStream.pipe(csv())) {
await insertRec(row);
}
console.log('CSV file successfully processed');
}
I have some hundreds of JSON files that I need to process in a defined sequence and write back the content as CSV in the same order as in the JSON files:
Write a CSV file with header
Collect an array of JSON files to process
Read the file and return an array with the required information
Append the CSV file, created under #1, with the information
Continue with the next JSON file at step #3
'use strict';
const glob = require('glob');
const fs = require('fs');
const fastcsv = require('fast-csv');
const readFile = require('util').promisify(fs.readFile);
function writeHeader(fileName) {
return new Promise((resolve, reject) => {
fastcsv
.writeToStream(fs.createWriteStream(fileName), [['id', 'aa', 'bb']], {headers: true})
.on('error', (err) => reject(err))
.on('finish', () => resolve(true));
});
}
function buildFileList(globPattern) {
return new Promise((resolve, reject) => {
glob(globPattern, (err, files) => {
if (err) {
reject(err);
} else {
resolve(files);
}
});
});
}
function readFromFile(file) {
return new Promise((resolve, reject) => {
readFile(file, 'utf8', (err, data) => {
if (err) {
reject(err);
} else {
const obj = JSON.parse(data);
const key = Object.keys(obj['776'])[0];
const solarValues = [];
obj['776'][key].map((item, i) => solarValues.push([i, item[0], item[1][0][0]]));
resolve(solarValues);
}
});
});
}
function csvAppend(fileName, rows = []) {
return new Promise((resolve, reject) => {
const csvFile = fs.createWriteStream(fileName, {flags: 'a'});
csvFile.write('\n');
fastcsv
.writeToStream(csvFile, rows, {headers: false})
.on('error', (err) => reject(err))
.on('finish', () => resolve(true));
});
}
writeHeader('test.csv')
.then(() => buildFileList('data/*.json'))
.then(fileList => Promise.all(fileList.map(item => readFromFile(item))))
.then(result => Promise.all(result.map(item => csvAppend('test.csv', item))))
.catch(err => console.log(err.message));
JSON examples:
https://gist.github.com/Sineos/a40718c13ad0834b4a0056091e3ac4ca
https://gist.github.com/Sineos/d626c3087074c23a073379ecef84a55c
Question
While the code basically works, my problem is that the CSV is not written back in a defined order but mixed up like in an asynchronous process.
I tried various combinations with and without Promise.all resulting in either pending promises or mixed up CSV file.
This is my first take on Node.js Promises so every input on how to do it correctly is greatly appreciated. Many thanks in advance.
This code should process your files in order, we'll use async/await and for .. of to loop in sequence:
async function processJsonFiles() {
try {
await writeHeader('test.csv');
let fileList = await buildFileList('data/*.json');
for(let file of fileList) {
let rows = await readFromFile(file);
await csvAppend('test.csv', rows);
}
} catch (err) {
console.error(err.message);
}
}
processJsonFiles();
Hi guys I'm facing problem with my Node.js api with Express when I'm trying to get files from FTP and then send then over my API as base64.
I'm using -> promise-ftp (https://www.npmjs.com/package/promise-ftp).
This is how endpoint looks like:
getData = (req, res, next) => {
const ftp = new PromiseFtp();
let data = [];
ftp.connect({host: 'xxxl',user: 'xxx',password: 'xxx'})
.then(() => {
return ftp.get('xxx.pdf');
}).then((stream) => {
return new Promise((resolve, reject) => {
stream.once('close', resolve);
stream.once('error', reject);
stream.pipe(fs.createReadStream('test.pdf'));
stream
.on('error', (err) => {
return res.send({errorMessage: err});
})
.on('data', (chunk) => data.push(chunk))
.on('end', () => {
const buffer = Buffer.concat(data);
label = buffer.toString('base64');
return res.send(label);
});
});
}).then(() => {
return ftp.end();
});
}
The problem is that I don't want to save this file localy next to api files and when I remove line stream.pipe(fs.createReadStream('test.pdf')); it doesn't work.
I'm not sure what pipe is doing here.
May you please help me?
readable.pipe(writable) is part of Node's Stream API, which transparently writes the data that is read from the readable into the writable stream, handling backpressure for you. Piping the data to the filesystem is unnecessary, and Express Response object implements the writable stream interface so you could just pipe the stream returned from the FTP promise directly to the res object.
getData = async (req, res) => {
const ftp = new PromiseFtp();
try {
await ftp.connect({host: 'xxxl',user: 'xxx',password: 'xxx'});
const stream = await ftp.get('xxx.pdf');
res.type('pdf');
await new Promise((resolve, reject) => {
res.on('finish', resolve);
stream.once('error', reject);
stream.pipe(res);
});
} catch(e) {
console.error(e);
} finally {
await ftp.end();
}
}
If you don't have a Node version that supports async/await, here's a Promise-only version:
getData = (req, res) => {
const ftp = new PromiseFtp();
ftp
.connect({host: 'xxxl',user: 'xxx',password: 'xxx'})
.then(() => ftp.get('xxx.pdf'))
.then(stream => {
res.type('pdf');
return new Promise((resolve, reject) => {
res.on('finish', resolve);
stream.once('error', reject);
stream.pipe(res);
});
})
.catch(e => {
console.error(e);
})
.finally(() => ftp.end());
}
Here you have a good use-case for using a Promise's finally()-method or a try/catch/finally block, which will ensure that ftp.end() is called even if an error occurs or not.
Note that I've deliberately left out sending the error back to clients as doing such things could possibly leak sensitive information. A better solution is to setup proper server-side logging with request context.