NodeJS Streams Waiting for all data to be streamed - node.js

My code is streaming data from a mongo collection and inserting it to another DB after doing some amount of operations on it.
I cannot seem to find a way to wait for the entire data to be operated on before closing the DB connections. The data is finished being worked on a few seconds after the stream is 'closed'. I want to wait for ALL the data
operations to be finished before closing the DB connections. How do i do that?
PS : I do see that all the data has been operated on. This works well. But I want to close the DB connections. I need to wait for all ops to finish before doing this!
var count = 0; // Counter for number of mongo docs that has been inserted in new_db
mongo_db.collection(config.collection, function (err, coll) {
coll.find(config.mongo_query).count(function (e, coll_docs_count) {
var stream = coll.find(config.mongo_query).stream();
stream.on('close', function () {
if (count === coll_docs_count) {
// This NEVER executes since the data is finished
//being operated and inserted in the NEW DB a few seconds after
//this moment.
mongo_client.close();
cb_bucket.disconnect();
}
});
stream.on('data', function (doc) {
... // Do some operations on it
new_db.insert(doc, function (blah blah) {
count++;
})
})
})
})

Related

Make multi threaded 1 million inserts to MongoDb using Node JS scripts

I have a isolated sync server that pulls a tab limited text file from a external ftp server and updates(saves) to mongodb after processing.
My code looks like this
//this function pulls file from external ftp server
async function upsteamFile() {
try {
let pythonProcess = spawn('python3', [configVar.ftpInbound, '/outbound/Items.txt', configVar.dataFiles.items], {encoding: 'utf8'});
logger.info('FTP SERVER LOGS...' + '\n' + pythonProcess.stdout);
await readItemFile();
logger.info('The process of file is done');
process.exit();
} catch (upstreamError) {
logger.error(upstreamError);
process.exit();
}
}
//this function connects to db and calls processing function for each row in the text file.
async function readItemFile(){
try{
logger.info('Reading Items File');
let dataArray = fs.readFileSync(configVar.dataFiles.items, 'utf8').toString().split('\n');
logger.info('No of Rows Read', dataArray.length);
await dbConnect.connectToDB(configVar.db);
logger.info('Connected to Database', configVar.db);
while (dataArray.length) {
await Promise.all( dataArray.splice(0, 5000).map(async (f) => {
splitValues = f.split('|');
await processItemsFile(splitValues)
})
)
logger.info("Current batch finished processing")
}
logger.info("ALL batch finished processing")
}
catch(PromiseError){
logger.error(PromiseError)
}
}
async function processItemsFile(splitValues) {
try {
// Processing of the file is done here and I am using 'save' in moongoose to write to db
// data is cleaned and assigned to respective fields
if(!exists){
let processedValues = new Products(assignedValues);
let productDetails = await processedValues.save();
}
return;
}
catch (error) {
throw error
}
}
upstream()
So this takes about 3 hours to process 100,000 thousand rows and update it in the database.
Is there any way that I can speed this up. I am very much limited from the hardware. I am using a ec2 instance based linux server with 2 core and 4 gb ram.
Should I use worker threads like microjob to run multi-threads . if yes , then how would I go about doing it
Or is this the maximum performance?
Note : I cant do bulk update in mongodb as there is mongoose pre hooks are getting triggered on save
You can always try a bulk update with the use of updateOne method.
I would consider also using the readFileStream instead of readFileSync.
With the event-driven architecture you could push, let's say every 100k updates into array chunks and bulk update on them simultaneously.
You can trigger a pre updateOne() (instead of save()) hook during this operation.
I have solved a similar problem (updating 100k CSV rows) with the following solution:
Create a readFileStream (thanks to that, your application won't consume much heap memory in case of the huge files)
I'm using CSV-parser npm library to deconstruct a CSV file into separate rows of data:
let updates = [];
fs.createReadStream('/filePath').pipe(csv())
.on('data', row => {
// ...do anything with the data
updates.push({
updateOne: {
filter: { /* here put the query */ },
update: [ /* any data you want to update */ ],
upsert: true /* in my case I want to create record if it does not exist */
}
})
})
.on('end', async () => {
await MyCollection.bulkWrite(data)
.catch(err => {
logger.error(err);
})
updates = []; // I just clean up the huge array
})

How to run asynchronous tasks synchronous?

I'm developing an app with the following node.js stack: Express/Socket.IO + React. In React I have DataTables, wherein you can search and with every keystroke the data gets dynamically updated! :)
I use Socket.IO for data-fetching, so on every keystroke the client socket emits some parameters and the server calls then the callback to return data. This works like a charm, but it is not garanteed that the returned data comes back in the same order as the client sent it.
To simulate: So when I type in 'a', the server responds with this same 'a' and so for every character.
I found the async module for node.js and tried to use the queue to return tasks in the same order it received it. For simplicity I delayed the second incoming task with setTimeout to simulate a slow performing database-query:
Declaration:
const async = require('async');
var queue = async.queue(function(task, callback) {
if(task.count == 1) {
setTimeout(function() {
callback();
}, 3000);
} else {
callback();
}
}, 10);
Usage:
socket.on('result', function(data, fn) {
var filter = data.filter;
if(filter.length === 1) { // TEST SYNCHRONOUSLY
queue.push({name: filter, count: 1}, function(err) {
fn(filter);
// console.log('finished processing slow');
});
} else {
// add some items to the queue
queue.push({name: filter, count: filter.length}, function(err) {
fn(data.filter);
// console.log('finished processing fast');
});
}
});
But the way I receive it in the client console, when I search for abc is as follows:
ab -> abc -> a(after 3 sec)
I want it to return it like this: a(after 3sec) -> ab -> abc
My thought is that the queue runs the setTimeout and then goes further and eventually the setTimeout gets fired somewhere on the event loop later on. This resulting in returning later search filters earlier then the slow performing one.
How can i solve this problem?
First a few comments, which might help clear up your understanding of async calls:
Using "timeout" to try and align async calls is a bad idea, that is not the idea about async calls. You will never know how long an async call will take, so you can never set the appropriate timeout.
I believe you are misunderstanding the usage of queue from async library you described. The documentation for the queue can be found here.
Copy pasting the documentation in here, in-case things are changed or down:
Creates a queue object with the specified concurrency. Tasks added to the queue are processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one becomes available. Once a worker completes a task, that task's callback is called.
The above means that the queue can simply be used to priorities the async task a given worker can perform. The different async tasks can still be finished at different times.
Potential solutions
There are a few solutions to your problem, depending on your requirements.
You can only send one async call at a time and wait for the first one to finish before sending the next one
You store the results and only display the results to the user when all calls have finished
You disregard all calls except for the latest async call
In your case I would pick solution 3 as your are searching for something. Why would you use care about the results for "a" if they are already searching for "abc" before they get the response for "a"?
This can be done by giving each request a timestamp and then sort based on the timestamp taking the latest.
SOLUTION:
Server:
exports = module.exports = function(io){
io.sockets.on('connection', function (socket) {
socket.on('result', function(data, fn) {
var filter = data.filter;
var counter = data.counter;
if(filter.length === 1 || filter.length === 5) { // TEST SYNCHRONOUSLY
setTimeout(function() {
fn({ filter: filter, counter: counter}); // return to client
}, 3000);
} else {
fn({ filter: filter, counter: counter}); // return to client
}
});
});
}
Client:
export class FilterableDataTable extends Component {
constructor(props) {
super();
this.state = {
endpoint: "http://localhost:3001",
filters: {},
counter: 0
};
this.onLazyLoad = this.onLazyLoad.bind(this);
}
onLazyLoad(event) {
var offset = event.first;
if(offset === null) {
offset = 0;
}
var filter = ''; // filter is the search character
if(event.filters.result2 != undefined) {
filter = event.filters.result2.value;
}
var returnedData = null;
this.state.counter++;
this.socket.emit('result', {
offset: offset,
limit: 20,
filter: filter,
counter: this.state.counter
}, function(data) {
returnedData = data;
console.log(returnedData);
if(returnedData.counter === this.state.counter) {
console.log('DATA: ' + JSON.stringify(returnedData));
}
}
This however does send unneeded data to the client, which in return ignores it. Somebody any idea's for further optimizing this kind of communication? For example a method to keep old data at the server and only send the latest?

setTimeout or child_process.spawn?

I have a REST service in Node.js with one specific request running a bunch of DB commands and other file processing that could take 10-15 seconds to run. Since I didn't want to hold up my browser request thread, I wrote a separate .js script to do the needful, called the script using child_process.spawn() in my Node.js code and immediately returned OK back to the client. This works fine, but then so does calling the same script (as a local function) by just using a simple setTimeout.
router.post("/longRequest", function(req, res) {
console.log("Started long request with id: " + req.body.id);
var longRunningFunction = function() {
// Usually runs a bunch of things that take time.
// Simulating a 10 sec delay for sample code.
setTimeout(function() {
console.log("Done processing for 10 seconds")
}, 10000);
}
// Below line used to be
// child_process.spawn('longRunningFunction.js'
setTimeout(longRunningFunction, 0);
res.json({status: "OK"})
})
So, this works for my purpose. But what's the downside ? I probably can't monitor the offline process easily as child_process.spawn which would give me a process id. But, does this cause problems in the long run ? Will it hold up Node.js processing if the 10 second processing increases to a lot more in the future ?
The actual longRunningFunction is something that reads an Excel file, parses it and does a bulk load using tedious to a MS SQL Server.
var XLSX = require('xlsx');
var FileAPI = require('file-api'), File = FileAPI.File, FileList = FileAPI.FileList, FileReader = FileAPI.FileReader;
var Connection = require('tedious').Connection;
var Request = require('tedious').Request;
var TYPES = require('tedious').TYPES;
var importFile = function() {
var file = new File(fileName);
if (file) {
var reader = new FileReader();
reader.onload = function (evt) {
var data = evt.target.result;
var workbook = XLSX.read(data, {type: 'binary'});
var ws = workbook.Sheets[workbook.SheetNames[0]];
var headerNames = XLSX.utils.sheet_to_json( ws, { header: 1 })[0];
var data = XLSX.utils.sheet_to_json(ws);
var bulkLoad = connection.newBulkLoad(tableName, function (error, rowCount) {
if (error) {
console.log("bulk upload error: " + error);
} else {
console.log('inserted %d rows', rowCount);
}
connection.close();
});
// setup your columns - always indicate whether the column is nullable
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
bulkLoad.addColumn(columnName, columnsAndDataTypes[columnName].dataType, { length: columnsAndDataTypes[columnName].len, nullable: true });
})
data.forEach(function(row) {
var addRow = {}
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
addRow[columnName] = row[columnName];
})
bulkLoad.addRow(addRow);
})
// execute
connection.execBulkLoad(bulkLoad);
};
reader.readAsBinaryString(file);
} else {
console.log("No file!!");
}
};
So, this works for my purpose. But what's the downside ?
If you actually have a long running task capable of blocking the event loop, then putting it on a setTimeout() is not stopping it from blocking the event loop at all. That's the downside. It's just moving the event loop blocking from right now until the next tick of the event loop. The event loop will be blocked the same amount of time either way.
If you just did res.json({status: "OK"}) before running your code, you'd get the exact same result.
If your long running code (which you describe as file and database operations) is actually blocking the event loop and it is properly written using async I/O operations, then the only way to stop blocking the event loop is to move that CPU-consuming work out of the node.js thread.
That is typically done by clustering, moving the work to worker processes or moving the work to some other server. You have to have this work done by another process or another server in order to get it out of the way of the event loop. A setTimeout() by itself won't accomplish that.
child_process.spawn() will accomplish that. So, if you have an actual event loop blocking problem to solve and the I/O is already as async optimized as possible, then moving it to a worker process is a typical node.js solution. You can communicate with that child process in a number of ways, but one possibility would be via stdin and stdout.

NodeJS writes to MongoDB only once

I have a NodeJS app that is supposed to generate a lot of data sets in a synchronous manner (multiple nested for-loops). Those data sets are supposed to be saved to my MongoDB database to look them up more effectively later on.
I use the mongodb - driver for NodeJS and have a daemon running. The connection to the DB is working fine and according to the daemon window the first group of datasets is being successfully stored. Every ~400-600ms there is another group to store but after the first dataset there is no output in the MongoDB console anymore (not even an error), and as the file sizes doesn't increase i assume those write operations don't work (i cant wait for it to finish as it'd take multiple days to fully run).
If i restart the NodeJS script it wont even save the first key anymore, possibly because of duplicates? If i delete the db folder content the first one will be saved again.
This is the essential part of my script and i wasn't able to find anything that i did wrong. I assume the problem is more in the inner logic (weird duplicate checks/not running concurrent etc).
var MongoClient = require('mongodb').MongoClient, dbBuffer = [];
MongoClient.connect('mongodb://127.0.0.1/loremipsum', function(err, db) {
if(err) return console.log("Cant connect to MongoDB");
var collection = db.collection('ipsum');
console.log("Connected to DB");
for(var q=startI;q<endI;q++) {
for(var w=0;w<words.length;w++) {
dbBuffer.push({a:a, b:b});
}
if(dbBuffer.length) {
console.log("saving "+dbBuffer.length+" items");
collection.insert(dbBuffer, {w:1}, function(err, result) {
if(err) {
console.log("Error on db write", err);
db.close();
process.exit();
}
});
}
dbBuffer = [];
}
db.close();
});
Update
db.close is never called and the connection doesn't drop
Changing to bulk insert doesn't change anything
The callback for the insert is never called - this could be the problem! The MongoDB console does tell me that the insert process was successful but it looks like the communication between driver and MongoDB isn't working properly for insertion.
I "solved" it myself. One misconception that i had was that every insert transaction is confirmed in the MongoDB console while it actually only confirms the first one or if there is some time between the commands. To check if the insert process really works one needs to run the script for some time and wait for MongoDB to dump it in the local file (approx. 30-60s).
In addition, the insert processes were too quick after each other and MongoDB appears to not handle this correctly under Win10 x64. I changed from the Array-Buffer to the internal buffer (see comments) and only continued with the process after the previous data was inserted.
This is the simplified resulting code
db.collection('seedlist', function(err, collection) {
syncLoop(0,0, collection);
//...
});
function syncLoop(q, w, collection) {
batch = collection.initializeUnorderedBulkOp({useLegacyOps: true});
for(var e=0;e<words.length;e++) {
batch.insert({a:a, b:b});
}
batch.execute(function(err, result) {
if(err) throw err;
//...
return setTimeout(function() {
syncLoop(qNew,wNew,collection);
}, 0); // Timer to prevent Memory leak
});
}

is my code blocking node thread

I have some confusions over nodejs and would like some help. I have a table called camps, contacts and camp_contact. I have to show the contact list based on the camp the user belongs to. I used async to loop through the camps, which I save in the user session, and then grab the data from mysql.
var array_myData = [];
async.each(req.session.user.camps, function(camps, callback) {
database.getConnection(function(err, connection) {
// Use the connection
connection.query('SELECT contacts.*, contact_camp.* '+
' FROM contacts JOIN contact_camp '+
' ON contact_camp.contact_id = contacts.id '+
' WHERE contact_camp.camp_id = ?',
[camps.camp_id], function(err,data){
if(err) {
//this will call the err function
callback('error');
}
else {
array_myData = array_myData.concat(data);
callback();
}
connection.release();
});
});
//final function call.
}, function(err){
// if any error happened, this function fires.
if( err ) {
// All processing will now stop.
} else {
res.render('contacts',
{
page
});
}
});
The code works fine. Now the thing I'm wondering about is, does using array.concat block the thread? if so, how can I change that? I read around and according to what I understood, I/O operation that are not asynchronous blocks the thread like reading from file or database. Does having array like this var array = ['a','b', 'c'] and looping through it would block the thread?
Lastly, is there a way to know if a code that I have written has blocked the thread or not? Because I get worried every time I write a function of my own.
I also get confused when a create a function with a callback like:
function test(param, fn) { do something; fn(); }
I'm not sure if this kind of function without a timer would block the thread or not.
Ok, if it helps you somehow:
I read around and according to what I understood, I/O operation that are not asynchronous blocks the thread like reading from file or database.
In NodeJS the operations on file descriptors are asynchronous => non-blocking
The file descriptors would be:
database operations,
open/close files
network operations
ex.:
fs.readFile('<pathToTheFile>', (err, data) => {
if (err) throw err;
console.log(data);
});
// if your file is very big, until it is read, maybe other requests will be finished
Does having array like this var array = ['a','b', 'c'] and looping through it would block the thread?
Yes, it will block the thread, but operations like this are not resource intensive. NodeJs is an event-driven language (single-threaded), but also, in a multi threaded or multi process language, the kernel thread would still be blocked by such operations => the same thing ... this is maybe off topic.
// if you do something like this
while(true){
function test(param, fn) { do something; fn(); }
}
// you will see that you just blocked the thread
I'm not sure if this kind of function without a timer would block the thread or not
normally you would put that function in a error handler, and it shouldn't block the thread if you're not doing an infinite while.
try{}catch(err){
// do something with the error
}

Resources