Make multi threaded 1 million inserts to MongoDb using Node JS scripts - node.js

I have a isolated sync server that pulls a tab limited text file from a external ftp server and updates(saves) to mongodb after processing.
My code looks like this
//this function pulls file from external ftp server
async function upsteamFile() {
try {
let pythonProcess = spawn('python3', [configVar.ftpInbound, '/outbound/Items.txt', configVar.dataFiles.items], {encoding: 'utf8'});
logger.info('FTP SERVER LOGS...' + '\n' + pythonProcess.stdout);
await readItemFile();
logger.info('The process of file is done');
process.exit();
} catch (upstreamError) {
logger.error(upstreamError);
process.exit();
}
}
//this function connects to db and calls processing function for each row in the text file.
async function readItemFile(){
try{
logger.info('Reading Items File');
let dataArray = fs.readFileSync(configVar.dataFiles.items, 'utf8').toString().split('\n');
logger.info('No of Rows Read', dataArray.length);
await dbConnect.connectToDB(configVar.db);
logger.info('Connected to Database', configVar.db);
while (dataArray.length) {
await Promise.all( dataArray.splice(0, 5000).map(async (f) => {
splitValues = f.split('|');
await processItemsFile(splitValues)
})
)
logger.info("Current batch finished processing")
}
logger.info("ALL batch finished processing")
}
catch(PromiseError){
logger.error(PromiseError)
}
}
async function processItemsFile(splitValues) {
try {
// Processing of the file is done here and I am using 'save' in moongoose to write to db
// data is cleaned and assigned to respective fields
if(!exists){
let processedValues = new Products(assignedValues);
let productDetails = await processedValues.save();
}
return;
}
catch (error) {
throw error
}
}
upstream()
So this takes about 3 hours to process 100,000 thousand rows and update it in the database.
Is there any way that I can speed this up. I am very much limited from the hardware. I am using a ec2 instance based linux server with 2 core and 4 gb ram.
Should I use worker threads like microjob to run multi-threads . if yes , then how would I go about doing it
Or is this the maximum performance?
Note : I cant do bulk update in mongodb as there is mongoose pre hooks are getting triggered on save

You can always try a bulk update with the use of updateOne method.
I would consider also using the readFileStream instead of readFileSync.
With the event-driven architecture you could push, let's say every 100k updates into array chunks and bulk update on them simultaneously.
You can trigger a pre updateOne() (instead of save()) hook during this operation.
I have solved a similar problem (updating 100k CSV rows) with the following solution:
Create a readFileStream (thanks to that, your application won't consume much heap memory in case of the huge files)
I'm using CSV-parser npm library to deconstruct a CSV file into separate rows of data:
let updates = [];
fs.createReadStream('/filePath').pipe(csv())
.on('data', row => {
// ...do anything with the data
updates.push({
updateOne: {
filter: { /* here put the query */ },
update: [ /* any data you want to update */ ],
upsert: true /* in my case I want to create record if it does not exist */
}
})
})
.on('end', async () => {
await MyCollection.bulkWrite(data)
.catch(err => {
logger.error(err);
})
updates = []; // I just clean up the huge array
})

Related

nodejs - Async generator/iterator with or without awaiting long operation

I'm trying to understand which setup is the best for doing the following operations:
Read line by line a CSV file
Use the row data as input of a complex function that at the end outputs a file (one file for each row)
When the entire process is finished I need to zip all the files generated during step 2
My goal: fast and scalable solution able to handle huge files
I've implemented step 2 using two approaches and I'd like to know what is the best and why (or if there are other better ways)
Step 1
This is simple and I rely on CSV Parser - async iterator API:
async function* loadCsvFile(filepath, params = {}) {
try {
const parameters = {
...csvParametersDefault,
...params,
};
const inputStream = fs.createReadStream(filepath);
const csvParser = parse(parameters);
const parser = inputStream.pipe(csvParser)
for await (const line of parser) {
yield line;
}
} catch (err) {
throw new Error("error while reading csv file: " + err.message);
}
}
Step 2
Option 1
Await the long operation handleCsvLine for each line:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
for await (const row of csvIterator) {
await handleCvsLine(
row,
);
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
// step 3
zipFolder(folderPath);
Pro
nice to see the files being generated one after the other
since it wait for the operation to end I can show the progress nicely
Cons
it waits for each operation, can I be faster?
Option 2
Push the long operation handleCsvLine in an array and then after the loop do Promise.all:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
const promises = [];
for await (const row of csvIterator) {
promises.push(handleCvsLine(row));
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
await Promise.all(promises);
// step 3
zipFolder(folderPath);
Pro
I do not wait, so it should be faster, isn't it?
Cons
since it does not wait, the for loop is very fast but then there is a long wait at the end (aka, bad progress experience)
Step 3
A simple step in which I use the archiver library to create a zip of the folder in which I saved the files from step 2:
function zipFolder(folderPath, globPath, outputFolder, outputName, logger) {
return new Promise((resolve, reject) => {
// create a file to stream archive data to.
const stream = fs.createWriteStream(path.join(outputFolder, outputName));
const archive = archiver("zip", {
zlib: { level: 9 }, // Sets the compression level.
});
archive.glob(globPath, { cwd: folderPath });
// good practice to catch warnings (ie stat failures and other non-blocking errors)
archive.on("warning", function (err) {
if (err.code === "ENOENT") {
logger.warning(err);
} else {
logger.error(err);
reject(err);
}
});
// good practice to catch this error explicitly
archive.on("error", function (err) {
logger.error(err);
reject(err);
});
// pipe archive data to the file
archive.pipe(stream);
// listen for all archive data to be written
// 'close' event is fired only when a file descriptor is involved
stream.on("close", function () {
resolve();
});
archive.finalize();
});
}
Not using await does not make operations faster. It will not wait for the response and will move to the next operation. It will keep adding operations to the event queue, with or without await.
You should use child_process instead to mock parallel processing. Node js is not multithreaded but you can achieve it using child_process, which runs on CPU cores. This way, you can generate multiple files at a time based on number of CPU cores available in the system.

Correct way to organise this process in Node

I need some advice on how to structure this function as at the moment it is not happening in the correct order due to node being asynchronous.
This is the flow I want to achieve; I don't need help with the code itself but with the order to achieve the end results and any suggestions on how to make it efficient
Node routes a GET request to my controller.
Controller reads a .csv file on local system and opens a read stream using fs module
Then use csv-parse module to convert that to an array line by line (many 100,000's of lines)
Start a try/catch block
With the current row from the csv, take a value and try to find it in a MongoDB
If found, take the ID and store the line from the CSV and this id as a foreign ID in a separate database
If not found, create an entry into the DB and take the new ID and then do 6.
Print out to terminal the row number being worked on (ideally at some point I would like to be able to send this value to the page and have it update like a progress bar as the rows are completed)
Here is a small part of the code structure that I am currently using;
const fs = require('fs');
const parse = require('csv-parse');
function addDataOne(req, id) {
const modelOneInstance = new InstanceOne({ ...code });
const resultOne = modelOneInstance.save();
return resultOne;
}
function addDataTwo(req, id) {
const modelTwoInstance = new InstanceTwo({ ...code });
const resultTwo = modelTwoInstance.save();
return resultTwo;
}
exports.add_data = (req, res) => {
const fileSys = 'public/data/';
const parsedData = [];
let i = 0;
fs.createReadStream(`${fileSys}${req.query.file}`)
.pipe(parse({}))
.on('data', (dataRow) => {
let RowObj = {
one: dataRow[0],
two: dataRow[1],
three: dataRow[2],
etc,
etc
};
try {
ModelOne.find(
{ propertyone: RowObj.one, propertytwo: RowObj.two },
'_id, foreign_id'
).exec((err, searchProp) => {
if (err) {
console.log(err);
} else {
if (searchProp.length > 1) {
console.log('too many returned from find function');
}
if (searchProp.length === 1) {
addDataOne(RowObj, searchProp[0]).then((result) => {
searchProp[0].foreign_id.push(result._id);
searchProp[0].save();
});
}
if (searchProp.length === 0) {
let resultAddProp = null;
addDataTwo(RowObj).then((result) => {
resultAddProp = result;
addDataOne(req, resultAddProp._id).then((result) => {
resultAddProp.foreign_id.push(result._id);
resultAddProp.save();
});
});
}
}
});
} catch (error) {
console.log(error);
}
i++;
let iString = i.toString();
process.stdout.clearLine();
process.stdout.cursorTo(0);
process.stdout.write(iString);
})
.on('end', () => {
res.send('added');
});
};
I have tried to make the functions use async/await but it seems to conflict with the fs.openReadStream or csv parse functionality, probably due to my inexperience and lack of correct use of code...
I appreciate that this is a long question about the fundamentals of the code but just some tips/advice/pointers on how to get this going would be appreciated. I had it working when the data was sent one at a time via a post request from postman but can't implement the next stage which is to read from the csv file which contains many records
First of all you can make the following checks into one query:
if (searchProp.length === 1) {
if (searchProp.length === 0) {
Use upsert option in mongodb findOneAndUpdate query to update or upsert.
Secondly don't do this in main thread. Use a queue mechanism it will be much more efficient.
Queue which I personally use is Bull Queue.
https://github.com/OptimalBits/bull#basic-usage
This also provides the functionality you need of showing progress.
Also regarding using Async Await with ReadStream, a lot of example can be found on net such as : https://humanwhocodes.com/snippets/2019/05/nodejs-read-stream-promise/

NodeJS Streams Waiting for all data to be streamed

My code is streaming data from a mongo collection and inserting it to another DB after doing some amount of operations on it.
I cannot seem to find a way to wait for the entire data to be operated on before closing the DB connections. The data is finished being worked on a few seconds after the stream is 'closed'. I want to wait for ALL the data
operations to be finished before closing the DB connections. How do i do that?
PS : I do see that all the data has been operated on. This works well. But I want to close the DB connections. I need to wait for all ops to finish before doing this!
var count = 0; // Counter for number of mongo docs that has been inserted in new_db
mongo_db.collection(config.collection, function (err, coll) {
coll.find(config.mongo_query).count(function (e, coll_docs_count) {
var stream = coll.find(config.mongo_query).stream();
stream.on('close', function () {
if (count === coll_docs_count) {
// This NEVER executes since the data is finished
//being operated and inserted in the NEW DB a few seconds after
//this moment.
mongo_client.close();
cb_bucket.disconnect();
}
});
stream.on('data', function (doc) {
... // Do some operations on it
new_db.insert(doc, function (blah blah) {
count++;
})
})
})
})

NodeJS Out of memory, using for loop with asynchronous function

exports.updateFullCentralRecordSheet = function (req, _id, type) {
FullCentralRecordSheet.remove({_ExternalParty: _id, centralRecordType: type, centralSheetType: "Central Sheet"}, function (err) {
if (err) {
saveErrorLog(req, err);
}
let query = {"structure.externalPartyRelationships": {$elemMatch: {_ExternalParty: _id}}, disabled: {$mod: [2, 0]}, initialized: true, profitLossType: type};
let fullCentralRecordSheetObjects = [];
ProfitLossSheet.find(query).sort({profitLossDate: 1}).lean().exec(function (err, profitLossSheetObjects) {
if (err) {
saveErrorLog(req, err);
}
async.each(profitLossSheetObjects, function (profitLossSheetObject, callback) {
/// HEAVY COMPUTATION HERE
callback();
});
}, function (err) {
if (err) {
saveErrorLog(req, err);
} else {
query = {centralRecordMode: {$in: ["Payment In", "Payment Out", "Transfer", "General Out"]}, disabled: {$mod: [2, 0]}, centralRecordType: {$in: ["Split", type]}, _ExternalParty: _id, status: {$ne: "Reject"}};
CentralRecordSheet.find(query).lean().exec(function (err, centralRecordSheetObjects) {
if (err) {
saveErrorLog(req, err);
}
_.each(centralRecordSheetObjects, function (centralRecordSheetObject) {
// SOME MORE PROCESSING
});
fullCentralRecordSheetObjects = _.sortBy(fullCentralRecordSheetObjects, function (fullCentralRecordSheetObject) {
return new Date(fullCentralRecordSheetObject.centralRecordDate).getTime();
});
let runningBalance = 0;
_.each(fullCentralRecordSheetObjects, function (fullCentralRecordSheetObject) {
runningBalance = runningBalance - fullCentralRecordSheetObject.paymentIn.total + fullCentralRecordSheetObject.paymentOut.total + fullCentralRecordSheetObject.moneyIn.total - fullCentralRecordSheetObject.moneyOut.total + fullCentralRecordSheetObject.transferIn.total - fullCentralRecordSheetObject.transferOut.total;
fullCentralRecordSheetObject.balance = runningBalance;
const newFullCentralSheetRecordObject = new FullCentralRecordSheet(fullCentralRecordSheetObject);
newFullCentralSheetRecordObject.save(); // Asynchronous save
});
});
}
});
});
});
};
This is my code to process some data and save it to database. As you can see there is some computation involves in each async loop and after the loop there is final processing of data. It works fine if I pass in one _id at a time. However when I try to do the task like this
exports.refreshFullCentralRecordSheetObjects = function (req, next) {
ExternalParty.find().exec(function (err, externalPartyObjects) {
if (err) {
utils.saveErrorLog(req, err);
return next(err, null, [req.__(err.message)], []);
}
_.each(externalPartyObjects, function (externalPartyObject) {
updateFullCentralRecordSheet(req, externalPartyObject._id, "Malay");
updateFullCentralRecordSheet(req, externalPartyObject._id, "Thai");
})
return next(err, null, ["Ddd"], ["Ddd"]);
});
};
I have about 273 objects to loop through. This cause the memory fatal error. I tried to increase --max-old-space-size=16000 but it is still crashing. I used task manager to track the memory of node.exe process and it goes over 8 GB.
I am not sure why increasing memory to 16GB does not help, it is still crashing around 8GB (according to task manager). Another thing is when I try to only process 10 records instead of 273, task manager report that it is using about 500 MB. This 500 MB will not disappear unless I make another request to the server. I find this very odd because why isn't NodeJS garbage collect after it is done with processing 10 records? Those 10 records successfully processed and saved to database however the memory usage remain unchanged in task manager.
I tried using async.forEachLimit, turning my update function to be asynchronous, play around with process.nextTick() but I still have fatal error memory problem. What can I do to make sure this runs?
Another thing is when I try to only process 10 records instead of 273,
task manager report that it is using about 500 MB. This 500 MB will
not disappear unless I make another request to the server. I find this
very odd because why isn't NodeJS garbage collect after it is done
with processing 10 records? Those 10 records successfully processed
and saved to database however the memory usage remain unchanged in
task manager.
That's normal, node GC is lazy (GC is a synchronous operation, blocking the loop, so that's a good thing).
Try to paginate the query ?

Using .batch with list of parameters in pg-promise

I'm running nodejs and pg-promise, and would like to use the batch function for creating a transaction with a BEGIN and COMMIT surrounding the multiple UPDATEs.
This is my code:
db.tx(function (t) {
return this.batch(function() {
for (var i = 0; i < cars.length; i++) {
return db.any('UPDATE ... ', [car_id, cars[i].votes]);
}
});
})
However, it seems not to be working as nothing happens. Isn't it possible to create my batch-list for input like that?
Method batch does not take a function as parameter, it takes an array of promises to resolve.
And there are plenty of examples of how to use it (on StackOverflow also), starting from the official documentation: Transactions.
For a set of updates you would simply create an array of update queries and then execute them using batch:
db.tx(t => {
const queries = cars.map(c => {
return t.none('UPDATE ... ', [c.car_id, c.votes]);
});
return t.batch(queries);
})
.then(data => {
// success
})
.catch(error => {
// error
});
Extra
Multiple updates of the same type can be executed as a single query, for a much better performance. See Performance Boost and method helpers.update.

Resources