How to populate mongoose with a large data set

How to populate mongoose with a large data set - node.js

I'm attempting to load a store catalog into MongoDb (2.2.2) using Node.js (0.8.18) and Mongoose (3.5.4) -- all on Windows 7 64bit. The data set contains roughly 12,500 records. Each data record is a JSON string.
My latest attempt looks like this:
var fs = require('fs');
var odir = process.cwd() + '/file_data/output_data/';
var mongoose = require('mongoose');
var Catalog = require('./models').Catalog;
var conn = mongoose.connect('mongodb://127.0.0.1:27017/sc_store');
exports.main = function(callback){
var catalogArray = fs.readFileSync(odir + 'pc-out.json','utf8').split('\n');
var i = 0;
Catalog.remove({}, function(err){
while(i < catalogArray.length){
new Catalog(JSON.parse(catalogArray[i])).save(function(err, doc){
if(err){
console.log(err);
} else {
i++;
}
});
if(i === catalogArray.length -1) return callback('database populated');
}
});
};
I have had a lot of problems trying to populate the database. Under previous scenarios (and this one), node pegs the processor and eventually runs out of memory. Note that in this scenario, I'm trying to allow Mongoose to save a record, and then iterate to the next record once the record saves.
But the iterator inside of the Mongoose save function never gets incremented. In addition, it never throws any errors. But if I put the iterator (i) outside of the asynchronous call to Mongoose, it will work, provided the number of records that I try to load are not too big (I have successfully loaded 2,000 this way).
So my questions are: Why isn't the iterator inside of the Mongoose save call ever incremented? And, more importantly, what is the best way to load a large data set into MongoDb using Mongoose?
Rob

i is your index to where you're pulling input data from in catalogArray, but you're also trying to use it to keep track of how many have been saved which isn't possible. Try tracking them separately like this:
var i = 0;
var saved = 0;
Catalog.remove({}, function(err){
while(i < catalogArray.length){
new Catalog(JSON.parse(catalogArray[i])).save(function(err, doc){
saved++;
if(err){
console.log(err);
} else {
if(saved === catalogArray.length) {
return callback('database populated');
}
}
});
i++;
}
});
UPDATE
If you want to add tighter flow control to the process, you can use the async module's forEachLimit function to limit the number of outstanding save operations to whatever you specify. For example, to limit it to one outstanding save at a time:
Catalog.remove({}, function(err){
async.forEachLimit(catalogArray, 1, function (catalog, cb) {
new Catalog(JSON.parse(catalog)).save(function (err, doc) {
if (err) {
console.log(err);
}
cb(err);
});
}, function (err) {
callback('database populated');
});
}

Rob,
The short answer:
You created an infinite loop. You're thinking synchronously and with blocking, Javascript functions asynchronously and without blocking. What you are trying to do is like trying to directly turn the feeling of hunger into a sandwich. You can't. The closest thing is you use the feeling of hunger to motivate you to go to the kitchen and make it. Don't try to make Javascript block. It won't work. Now, learn async.forEachLimit. It will work for what you want to do here.
You should probably review asynchronous design patterns and understand what it means on a deeper level. Callbacks are not simply an alternative to return values. They are fundamentally different in how and when they are executed. Here is a good primer: http://cs.brown.edu/courses/csci1680/f12/handouts/async.pdf
The long answer:
There is an underlying problem here, and that is your lack of understanding of what non-blocking IO and asynchronous means. Im not sure if you are breaking into node development, or this is just a one-off project, but if you do plan to continue using node (or any asynchronous language) then it is worth the time to understand the difference between synchronous and asynchronous design patterns, and what motivations there are for them. So, that is why you have a logic error putting the loop invariant increment inside an asynchronous callback which is creating an infinite loop.
In non-computer science, that means that your increment to i will never occur. The reason is because Javascript executes a single block of code to completion before any asynchronous callbacks are called. So in your code, your loop will run over and over, without i ever incrementing. And, in the background, you are storing the same document in mongo over and over. Each iteration of the loop starts sending document with index 0 to mongo, the callback can't fire until your loop ends, and all other code outside the loop runs to completion. So, the callback queues up. But, your loop runs again since i++ is never executed (remember, the callback is queued until your code finishes), inserting record 0 again, queueing another callback to execute AFTER your loop is complete. This goes on and on until your memory is filled with callbacks waiting to inform your infinite loop that document 0 has been inserted millions of times.
In general, there is no way to make Javascript block without doing something really really bad. For example, something paramount to setting your kitchen on fire to fry some eggs for that sandwich I talked about in the "short answer".
My advice is to take advantage of libs like async. https://github.com/caolan/async JohnnyHK mentioned it here, and he was correct for doing so.

Related

Could nodejs loop be asynchronous?

While this code is running i cant do anything. Is there asynchronous way for loops?
// This object is very large
var listOfUsers = {};
for(var key in listOfUsers){
delete listOfUsers[key]
}

Is there asynchronous way for loops?
No. Since delete listOfUsers[key] is itself a synchronous operation, then there is no way to do anything else while that loop is running. The JS interpreter is busy executing the loop and executing that delete operation. Because Javascript in node.js is single threaded so there's only ever one set of Javascript executing at a time. You can't execute anything else until the loop is done.
It occurs to me that if you're just trying to get listOfUsers back to an empty object and nobody else holds a reference to the original object, you could perhaps replace your existing loop with just this:
listOfUsers = {};
while would be a lot faster. The old object (and its properties) would then get garbage collected.
In rare circumstances, you can solve problems like this and lessen the impact of a synchronous operation by breaking its operation into chunks and doing one chunk, then letting the event loop run and then do another chunk.
For example, you might be able to do something like this:
// remove all users, chunked to 100 at a time
// allowing the event loop to run between chunks
function removeUsers() {
const chunkSize = 100;
let usersToDelete = Object.keys(listOfUsers).slice(chunkSize);
if (!usersToDelete.length) {
// everything deleted, no more work to do
return;
} else {
for (let key of usersToDelete) {
delete listOfUsers[key];
}
// delete some more after other things get a chance to run in the event loop
setTimeout(removeUsers, 20);
}
A problem with this approach is that you can't add any new users to the listOfUsers until this is done or they will get deleted.

Can this code cause a race condition in socket io?

I am very new to node js and socket io. Can this code lead to a race condition on counter variable. Should I use a locking library for safely updating the counter variable.
"use strict";
module.exports = function (opts) {
var module = {};
var io = opts.io;
var counter = 0;
io.on('connection', function (socket) {
socket.on("inc", function (msg) {
counter += 1;
});
socket.on("dec" , function (msg) {
counter -= 1;
});
});
return module;
};

No, there is no race condition here. Javascript in node.js is single threaded and event driven so only one socket.io event handler is ever executing at a time. This is one of the nice programming simplifications that come from the single threaded model. It runs a given thread of execution to completion and then and only then does it grab the next event from the event queue and run it.
Hopefully you do realize that the same counter variable is accessed by all socket.io connections. While this isn't a race condition, it means that there's only one counter that all socket.io connections are capable of modifying.
If you wanted a per-connection counter (separeate counter for each connection), then you could define the counter variable inside the io.on('connection', ....) handler.
The race conditions you do have to watch out for in node.js are when you make an async call and then continue the rest of your coding logic in the async callback. While the async operation is underway, other node.js code can run and can change publicly accessible variables you may be using. That is not the case in your counter example, but it does occur with lots of other types of node.js programming.
For example, this could be an issue:
var flag = false;
function doSomething() {
// set flag indicating we are in a fs.readFile() operation
flag = true;
fs.readFile("somefile.txt", function(err, data) {
// do something with data
// clear flag
flag = false;
});
}
In this case, immediately after we call fs.readFile(), we are returning control back to the node.js. It is free at that time to run other operations. If another operation could also run this code, then it will tromp on the value of flag and we'd have a concurrency issue.
So, you have to be aware that anytime you make an async operation and then the rest of your logic continues in the callback for the async operation that other code can run and any shared variables can be accessed at that time. You either need to make a local copy of shared data or you need to provide appropriate protections for shared data.
In this particular case, the flag could be incremented and decremented rather than simply set to true or false and it would probably serve the desired purpose of keeping track of whether this file is current being read or not.

Shorter answer:
"Race condition" is when you execute a series of ordered asynchronous functions and because of their async nature they won't finish processing in their original order.
In your code, you are executing a series of ordered synchronous process (increasing or decreasing the counter), So they finish instantly after they start, resulting in ordered output. So no racing here!

NodeJS fs API: Detect Asynchronous Completion

I have a NodeJS application which uses the fs API to read files from a directory tree. I'm using the fs-walk module to walk the tree. For every sub directory encountered, the same function executes again to handle it. (I don't think this is recursion; rather, the same function is bound to an event which is fired each time a directory is handled.) Files are handled by a different function, which does stuff to them.
I'd like to execute arbitrary code once all files have been read without using synchronous or blocking code. I couldn't find any way to keep track of the number of files in a directory (to count down, for instance), nor could I find any attribute in fs.stat to indicate that the entire operation has completed.
Had anyone found a way to do this yet? I could find nothing in the node docs or on stack overflow.

After reviewing the fs-walk library a little closer, it looks like the third argument to the walk() method is actually a final callback. Internally they are using the async library, specifically async.whilst() and async.waterfall() methods which will execute the final callback when everything is complete.
I think the intention of the library creator is for that final callback to be executed when all async actions are completed. If that isn't working, you may want to file an issue in Github for it:
According to the code, you should be able to do:
var walk = require('fs-walk';
walk('/some/dir', someFileOrDirHandler, function(err) {
// This should be a final callback, if the first argument is present,
// then there was an error
if (err) {
/* handle it */
return;
}
// Getting here indicates success
});

As a compromise in performance, I ended up doing a total file count using a recursive function that accessed the file system synchronously. Using the total, I then accessed all the files asynchronously, decrementing the total each time. Once the total reached zero, I executed a function to handle all of the completed data.
var countAllFiles = new Promise(function (resolve, reject) {
var total = 0,
count = function (path) {
var contents = fs.readdirSync(path), file, name;
for (file in contents) {
if (!contents.hasOwnProperty(file)) continue;
name = path + '/' + contents[file];
if (fs.statSync(name).isDirectory())
count(name);
else
++total;
}
};
count('/path/to/tree/');
resolve(total);
}).then(function (total) {
walk.dirs('/path/to/tree/', handlerFunction, errorHandler);
// for every file, decrement total. Then, if it's zero, execute the code that
// depends on all the read/write operations being complete
});

Redis, Transactions and Throughput

Alright, it's been about 10 hours, and I still can't figure this out. Can someone please help? I am writing to both Redis and MongoDB each time my Node/Express API is called. However, when I query each database by the same key, Redis gradually starts to miss records over time. I can minimize this behavior by throttling the overall throughput (reducing # of ops I'm asking Redis to do). Here's the pseudo code:
function (req, res) {
async.parallel {
f {w:1 into MongoDB -- seems to be working fine}
f {write to Redis -- seems to be miss-firing}
And here the Redis code:
var trx = 1; // transaction is 1:pending 0:complete
async.whilst(function(){return trx;},
function(callback){
r.db.watch(key);
r.db.hgetall(key, function(err, result){
// update existing key
if (result !== null) {
update(key, result, req, function(err, result){
if (err) {callback(err);}
else if (result === null) {callback(null);}
else {trx = 0; callback(null);}
});
}
// new key
else {
newSeries(bin, req, function(err, result){
if (err) {callback(err);}
else if (result === null) {callback(null);}
else {trx = 0; callback(null);}
});
}
});
}, function(err){if(err){callback(err);} else{callback(null);}}
);
in the "update" and "newSeries" functions, I'm basically just doing a MULTI/EXEC to redis using the values from HGETALL, and returning the result (to ensure I didn't hit a race condition).
I am using Cluster with Node, so I have multiple threads executing at once to Redis.
Any thoughts would be really helpful. Thanks.

I guess I just needed a bit of sleep, and a bit more log-trolling to figure this out.
Basically, it was the async.each loop above my block of code. Because that runs in parallel, EXEC was sometimes called on a different key! So it would wipe out the WATCH on another key! So, I just needed to switch it to async.eachSeries - which ensures my single node-worker isn't "working" (WATCH'ing and EXEC'ing) multiple keys at once!
So, first critical lessons: first, any EXEC command from a connection with wipe out all WATCH commands (so be very careful with parallel or Async processing).
And second, be very, very careful with async.each, and always default to async.eachSeries! For me, async.each is conceptually very tough - and it can really screw up single-threaded processes (like Redis). This has cost me a lot of time and pain over the past year... beware!
Hope this helps someone out there.

Nodejs asynchronous confusion

I can't seem to grasp how to maintain async control flow with NodeJs. All of the nesting makes the code very hard to read in my opinion. I'm a novice, so I'm probably missing the big picture.
What is wrong with simply coding something like this...
function first() {
var object = {
aProperty: 'stuff',
anArray: ['html', 'html'];
};
second(object);
}
function second(object) {
for (var i = 0; i < object.anArray.length; i++) {
third(object.anArray[i]);
};
}
function third(html) {
// Parse html
}
first();

The "big picture" is that any I/O is non-blocking and is performed asynchronously in your JavaScript; so if you do any database lookups, read data from a socket (e.g. in an HTTP server), read or write files to the disk, etc., you have to use asynchronous code. This is necessary as the event loop is a single thread, and if I/O wasn't non-blocking, your program would pause while performing it.
You can structure your code such that there is less nesting; for example:
var fs = require('fs');
var mysql = require('some_mysql_library');
fs.readFile('/my/file.txt', 'utf8', processFile);
function processFile(err, data) {
mysql.query("INSERT INTO tbl SET txt = '" + data + "'", doneWithSql);
}
function doneWithSql(err, results) {
if(err) {
console.log("There was a problem with your query");
} else {
console.log("The query was successful.");
}
}
There are also flow control libraries like async (my personal choice) to help avoid lots of nested callbacks.
You may be interested in this screencast I created on the subject.

As #BrandonTilley said, I/O is asynchronous, so you need callbacks in Node.js to handle them. This is why Node.js can do so much with just a single thread (it's not actually doing more in a single thread, but rather than having the thread wait around for the data, it just starts processing the next task and when the I/O comes back, then it'll jump back to that task with the callback function you gave it).
But, nested callbacks can be taken care of with a good library like the venerable async or my new little library: queue-flow. They handle the callback issues and let you keep your code un-nested and looking very similar to blocking, synchronous code. :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string