I would like to know how many documents are going to be processed by my MongoDB cursor before I iterate over the cursor. Do I have to run the query twice, or can I determine the size of a cursor and then iterate over the cursor without running the query twice?
myCount = myCollection.find({complex query}).count(); // will get my count
myCursor = myCollection.find({complex query}); // will get my cursor for iteration
but these two commands will run the query twice, which is presumably inefficient?
What is the most efficient way to determine the number of documents that I will iterate over without running the query twice? My docs are quite large.
Does MongoDB 'know' how many docs are going to be returned before it starts returning them via the cursor? Where can I read up on Mongo internals for this info?
I am running this within node, and I'm using the standard MongoDB node driver. So, the solution needs to deal with the usual node callback mechanisms etc, etc.
Thanks
EDIT
I have amended the original question to state I'm using the node driver
EDIT2
yogesh's answer is correct, I just needed to figure out the syntax for node. I've shown the working code below in case it helps anyone
db.collection('collectionName', function(err, collection) {
function processItem(err, item) {
if (item === null) {
db.close();
callback(null, {info: "yippee"});
return;
}
cursor.nextObject(processItem); // continue looping
}
var cursor = collection.find({foo:1});
cursor.count(function(err,count){
console.log('resultCursor size='+count);
cursor.nextObject(processItem); // Start of the loop
});
}
Check cursor.next this find next document in the cursor returned by the db.collection.find() method. So your case you should write as ( tested on mongo shell )
var collectionData = myCollection.find({complex query})
collectionData.count() // return matching documents count
collectionData.next() // return next documents
Or you also check mongo hasnext it returns true if the cursor returned by the db.collection.find() query can iterate further to return more documents.
Related
Hello I want to loop a list of games and find if any game is already inserted , if it is already present in database then skip , otherwise insert the new game into the database. Each game has an unique eventId which I am checking before inserting a new game.
My code is :
for (var i=0;i < gamesList.length;i++) {
var game = gamesList[i];
// check whether the game is already present in the DB.
Game.findOne({"eventId": game.ID},function(err,result){
if(err){
console.log(err);
res.json(err);
}
if(result){
console.log('game is already inserted skip');
}
else{
console.log('new game available insert this into list');
}
});
}
But the main problem here is , with asynchronous nature of the code , I guess. when Game.find is executed , it is not waiting for the results to come , it proceeds for the next game in the loop.
The callback is invoked for only last game in the list , gamesList.length time . I want to find whether a game is already present in the database and then when the result comes want to go for the next game in the list .
Please post any better solution where I can achieve this kind of functionality.
What about using synchronous code does it blocks and disadvantageous ?
Just a tip, change if (result) for if (result.length) because a empty object (Array) returns true.
And to execute the query async, see this answer.
I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?
The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.
There are 2 types of bulk operations:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
So in your case Unordered is what you want. The previous link provides an example:
MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();
// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});
// Execute the operations
batch.execute(function(err, result) {
console.dir(err);
console.dir(result);
db.close();
});
});
*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."
I have an array of data which I'll store in the database. When I'm looking if the data already exists, each() will called twice, even when I'm using limit(1). I have no clue whats going on here...
collection.find({
month: 'april'
}).limit(1).count(function(err, result){
console.log('counter', result);
});
collection.find({
month: 'april'
}).limit(1).each(function(err, result){
console.log('each', result);
});
collection.find({
month: 'april'
}).limit(1).toArray(function(err, result){
console.log('toArray', result);
});
At this time, there is exact 1 dataset of month April already stored in the collection.
The above queries will generate an output like this:
count 1
each {...}
each null
toArray {...}
In the mongo shell I have checked the count() and forEach() methods. Everything works as expected. Is it a driver problem? Am I doing anything wrong?
This is the expected behavior. The driver returns the items in the loop, and then at the end it returns null to indicate that there are no items left. You can see this in the driver's examples too:
// Find returns a Cursor, which is Enumerable. You can iterate:
collection.find().each(function(err, item) {
if(item != null) console.dir(item);
});
If you are interested in the details, you can check the source code for each:
if(this.items.length > 0) {
// Trampoline all the entries
while(fn = loop(self, callback)) fn(self, callback);
// Call each again
self.each(callback);
} else {
self.nextObject(function(err, item) {
if(err) {
self.state = Cursor.CLOSED;
return callback(utils.toError(err), item);
}
>> if(item == null) return callback(null, null); <<
callback(null, item);
self.each(callback);
})
}
In this code each iterates through the items using loop which shifts items from the array (var doc = self.items.shift();). When this.items.length becomes 0, the else block is executed. This else block tries to get the next document from the cursor. If there are no more documents, nextObject returns null (item's value becomes null) which makes if(item == null) return callback(null, null); to be executed. As you can see the callback is called with null, and this is the null that you can see in the console.
This is needed because MongoDB returns the matching documents using a cursor. If you have millions of documents in the collection and you run find(), not all documents are returned immediately because you would run out of memory. Instead MongoDB iterates through the items using a cursor. "For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte." So this.items.length becomes the number of the items that are in the first batch, but that's not necessarily the total number of the documents resulted by the query. That's why when you iterate through the documents and this.items.length becomes 0, MongoDB uses the cursor to check if there are more matching documents. If there are, it loads the next batch, otherwise it returns null.
It's easier to understand this if you use a large limit. For example in case of limit(100000) you would need a lot of memory if MongoDB returned all 100000 documents immediately. Not to mention how slow processing would be. Instead, MongoDB returns results in batches. Let's say the first batch contains 101 documents. Then this.items.length becomes 101, but that's only the size of the first batch, not the total number of the result. When you iterate through the results and you reach the next item after the last one that is in the current batch (102nd in this case), MongoDB uses the cursor to check if there are more matching documents. If there are, the next batch of documents are loaded, null otherwise.
But you don't have to bother with nextObject() in you code, you only need to check for null as in the MongoDB example.
I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).
Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js
Fetch all docs :
inst.db.collection(association.collection, function(err, collection) {
collection.find({}, {}, function(err, cursor) {
cursor.toArray(function(err, items){
......
);
});
});
update each doc :
items.forEach(function(item) {
// *** do some stuff with item, add field etc.
tasks.push(function(nextTask) {
inst.db.collection(association.collection, function(err, collection) {
if (err) callback(err, null);
collection.save(item, nextTask);
});
});
});
call the "save" task in parallel
async.parallel(tasks, function(err, results) {
callback(err, results);
});
Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?
Thanks for your support.
You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.
Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.
108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.
Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.
The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.
If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.
NodeJS + Express, MongoDB + Mongoose
I have a JSON feed where each record has a set of "venue" attributes (things like "venue name" "venue location" "venue phone" etc). I want to create a collection of all venues in the feed -- one instance of each venue, no dupes.
I loop through the JSON and test whether the venue exists in my venue collection. If it doesn't, save it.
jsonObj.events.forEach(function(element, index, array){
Venue.findOne({'name': element.vname}, function(err,doc){
if(doc == null){
var instance = new Venue();
instance.name = element.vname;
instance.location = element.location;
instance.phone = element.vphone;
instance.save();
}
}
}
Desired: A list of all venues (no dupes).
Result: Plenty of dupes in the venue collection.
Basically, the loop created a new Venue record for every record in the JSON feed.
I'm learning Node and its async qualities, so I believe the for loop finishes before even the first save() function finishes -- so the if statement is always checking against an empty collection. Console.logging backs this claim up.
I'm not sure how to rework this so that it performs the desired task. I've tried caolan's async module but I can't get it to help. There's a good chance I'm using incorrectly.
Thanks so much for pointing me in the right direction -- I've searched to no avail. If the async module is the right answer, I'd love your help with how to implement it in this specific case.
Thanks again!
Why not go the other way with it? You didn't say what your persistence layer is, but it looks like mongoose or possibly FastLegS. In either case, you can create a Unique Index on your Name field. Then, you can just try to save anything, and handle the error if it's a unique index violation.
Whatever you do, you must do as #Paul suggests and make a unique index in the database. That's the only way to ensure uniqueness.
But the main problem with your code is that in the instance.save() call, you need a callback that triggers the next iteration, otherwise the database will not have had time to save the new record. It's a race condition. You can solve that problem with caolan's forEachSeries function.
Alternatively, you could get an array of records already in the Venue collection that match an item in your JSON object, then filter the matches out of the object, then iteratively add each item left in the filtered JSON object. This will minimize the number of database operations by not trying to create duplicates in the first place.
Venue.find({'name': { $in: jsonObj.events.map(function(event){ return event.vname; }) }}, function (err, docs){
var existingVnames = docs.map(function(doc){ return doc.name; });
var filteredEvents = jsonObj.events.filter(function(event){
return existingVnames.indexOf(event.vname) === -1;
});
filteredEvents.forEach(function(event){
var venue = new Venue();
venue.name = event.vname;
venue.location = event.location;
venue.phone = event.vphone;
venue.save(function (err){
// Optionally, do some logging here, perhaps.
if (err) return console.error('Something went wrong!');
else return console.log('Successfully created new venue %s', venue.name);
});
});
});