MongoDB Bulk Insert where many documents already exist

MongoDB Bulk Insert where many documents already exist - node.js

I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.
Anybody know (or want to guess) if it would be more efficient to:
Do the bulk update but with continueOnError = true, e.g.
db.collection.insert(myArray, {continueOnError: true}, callback)
do individual inserts, checking first if the _ID exists?
First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?
I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)
What's happening is that a few new documents will be added.

My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.
In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.
The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).
Are bulk inserts appropriate for this use case?
Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.
I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.
I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.
MongoDB 2.6
As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.
The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.
For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.

collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
if (err && err.code != "11000"){
throw err;
}
db.close();
callBack();
});

For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).
I wouldn't use the continueOnError and send all documents ... it's less efficient.

I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:
From the documentation:
upsert
Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.
multi
Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.
db.collection.update(
<query>,
<update>,
{ upsert: <boolean>, multi: <boolean> }
)
Here is the referenced documentation:
http://docs.mongodb.org/manual/reference/method/db.collection.update/

Related

MongoDB Transaction, prevent multiple update on multi-server

On scale-out servers, several servers will compete the update process each other for the multiple data.
So I want to prevent multiple update on the same data.
=Coffee
CollectionRooms.find(isProcessed: false).forEach (room) ->
if room.isProcessed then return
#update something
CollectionRooms.update _id: room._id,
$set: isProcessed: true
The question between two servers(SERVER1, SERVER2) with same MongoDB is,
After the moment of SERVER1's find action,
If SERVER2 update that data to isProcessed = true,
Then the data in SERVER1's forEach could be isProcessed true?
I think I need to make the question simple.
find() will returns the cursor,
then inside of the loop in .forEach function,
each loop's actual data is different with find() function started?.
Sorry for ugly expressions and thanks.

Use optimistic locking.
Have an additional field on your document (a timestamp, or version number) which is updated every time the document is written. Then use this version in your update queries. The update will fail if the version has changed since reading.

Does mongoose ordered Model.insertMany write anything to the db if it fails?

In the documentation for Model.insertMany, it says that when options.ordered == true the method will fail on the 1st error.
https://mongoosejs.com/docs/api.html#model_Model.insertMany
[options.ordered «Boolean» = true] if true, will fail fast on the
first error encountered. If false, will insert all the documents it
can and report errors later. An insertMany() with ordered = false is
called an "unordered" insertMany().
Does it:
have an error and write no documents to the db. (what I would like)
or
write documents that occur before the error, have an error, then not write any more documents?

options.ordered = true (the default):
Mongoose always validates each document before sending insertMany to
MongoDB. So if one document has a validation error, no documents will
be saved, unless you set the ordered option to false.
Note this is validation not how it handles exceptions during insert.
If all documents pass validation then from MongoDb docs:
Excluding Write Concern errors, ordered operations stop after an
error, while unordered operations continue to process any remaining
write operations in the queue.
Note the last paragraph under the examples for insertMany:
Note that one document was inserted: The first document of _id: 13
will insert successfully, but the second insert will fail. This will
also stop additional documents left in the queue from being inserted.
With ordered to false, the insert operation would continue with any
remaining documents.
You seem to imply that you need a transactional approach. For which you should look into this and see if your MongoDB version supports it.
options.ordered = false:
Since you the explicitly specified that you do not care about the insert order it would keep inserting and simply skip the ones with exceptions.
Also from MongoDB docs:
If ordered is set to false, documents are inserted in an unordered
format and may be reordered by mongod to increase performance.
Applications should not depend on ordering of inserts if using an
unordered insertMany().

Overwriting existing values in unique index field in MongoDB (Aggregating heartbeat events)

I have web page sending heartbeat events to nodejs backend to track how long users are viewing on particular page (or parts of the page). Events are stored to MongoDB in batches using Mongoose's insertMany:
Event.insertMany(events)
Here events is an array containing multiple events. Single event is structured as follows:
{user_page_id: 1234, time_spent: 30, ...}
Since I'm tracking time spent on page, only most recent time_spent value per user_page_id is meaningful and I don't want to fill MongoDB with unnecessary data. I tried to deal with this by defining user_page_id as unique index:
user_page_id: {type: Number, index: {unique: true, sparse: true}}
Now couple questions:
Is it possible to make user_site_id unique so that existing values are always replaced with new values? (Default functionality seems to just reject duplicates. Something was discussed here but didn't help.)
Is it possible to make user_site_id unique so that duplicate null values are allowed? (There is also other events where this data is null. sparse option seems to deal only with missing values)
Is unique index even possible way to solve this problem or should I find another approach?
Other possible solutions I could think of are:
Processing heartbeats individually with their own handler using upsert. (Possible but adds some unnecessary(?) complexity to processing pipeline. Also batch mode highly is preferred.)
Keeping heartbeats in memory and storing them to DB after some timeout. (Problem is that heartbeat timers should be able to pause. Also I would like to keep server stateless.)
Turning whole thing upside down using websockets. (Possible but adds some unnecessary(?) complexity since sometimes tracking is related to parts of the page and there can be multiple concurrent heartbeat timers on one page.
Store all events as they are and clean up unnecessary events later with some batch processing job. (Not exactly sure but might cause some performance issues in MongoDB). Also though using capped collection here but doesn't actually solve the problem.
So as a recap the most important question is how to effectively and elegantly deal with aggregation of this sort of heartbeat data?

Does the size of a document affect performance of a find() query?

Can the size of a MongoDB document affect the performance of a find() query?
I'm running the following query on a collection, in the MongoDB shell
r.find({_id:ObjectId("5552966b380c2dbc29472755")})
The entire document is 3MB. When I run this query the operation takes about 8 seconds to perform. The document has a "salaries" property which makes up the bulk of the document's size (about 2.9MB). So when I ommit the salaries property and run the following query it takes less than a second.
r.find({_id:ObjectId("5552966b380c2dbc29472755")},{salaries:0})
I only notice this performance difference when I run the find() query only. When I run a find().count() query there is no difference. It appears that performance degrades only when I want to fetch the entire document.
The collection is never updated (never changes in size), an index is set on _id and I've run repairDatabase() on the database. I've searched around the web but can't find a satisfactory answer to why there is a performance difference. Any insight and recommendations would be appreciated. Thanks.

I think the experiments you've just ran are an answer to your own question.
Mongo will index the _id field by default, so document size shouldn't affect the length of time it takes to locate the document, but if its 3MB then you will likely notice a difference in actually downloading that data. I imagine that's why its taking less time if you omit some of the fields.
To get a better sense of how long your query is actually taking to run, try this:
r.find({
_id: ObjectId("5552966b380c2dbc29472755")
})
.explain(function(err, explaination) {
if (err) throw err;
console.log(explaination);
});
If salaries is the 3MB culprit, and its structured data, then to speed things up you could try A) splitting it up into separate mongo documents or B) querying based on sub-properties of that document, and in both cases A and B you can build indexes to keep those queries fast.

Handling conflict in find, modify, save flow in MongoDB with Mongoose

I would like to update a document that involves reading other collection and complex modifications, so the update operators in findAndModify() cannot serve my purpose.
Here's what I have:
Collection.findById(id, function (err, doc) {
// read from other collection, validation
// modify fields in doc according to user input
// (with decent amount of logic)
doc.save(function (err, doc) {
if (err) {
return res.json(500, { message: err });
}
return res.json(200, doc);
});
}
My worry is that this flow might cause conflict if multiple clients happens to modify the same document.
It is said here that:
Operations on a single document are always atomic with MongoDB databases
I'm a bit confused about what Operations mean.
Does this means that the findById() will acquire the lock until doc is out of scope (after the response is sent), so there wouldn't be conflicts? (I don't think so)
If not, how to modify my code to support multiple clients knowing that they will modify Collection?
Will Mongoose report conflict if it occurs?
How to handle the possible conflict? Is it possible to manually lock the Collection?
I see suggestion to use Mongoose's versionKey (or timestamp) and retry for stale document
Don't use MongoDB altogether...
Thanks.
EDIT
Thanks #jibsales for the pointer, I now use Mongoose's versionKey (timestamp will also work) to avoid committing conflicts.
aaronheckmann — Mongoose v3 part 1 :: Versioning
See this sample code:
https://gist.github.com/anonymous/9dc837b1ef2831c97fe8

Operations refers to reads/writes. Bare in mind that MongoDB is not an ACID compliant data layer and if you need true ACID compliance, you're better off picking another tech. That said, you can achieve atomicity and isolation via the Two Phase Commit technique outlined in this article in the MongoDB docs. This is no small undertaking, so be prepared for some heavy lifting as you'll need to work with the native driver instead of Mongoose. Again, my ultimate suggestion is to not drink the NoSQL koolaid if you need transaction support which it sounds like you do.

When MongoDB receives a request to update a document, it will lock the database until it has completed the operation. Any other requests that MongoDB receives will wait until the locking operation has completed and the database is unlocked. This lock/wait behavior is automatic, so there aren't any conflicts to handle. You can find a lot more information about this behavior in the Concurrency section of the FAQ.
See jibsales answer for links to MongoDB's recommended technique for doing multi-document transactions.
There are a couple of NoSQL databases that do full ACID transactions, which would make your life a lot easier. FoundationDB is one such database. Data is stored as Key-Value but it supports multiple data models through layers.
Full disclosure: I'm an engineer at FoundationDB.

In my case I was wrong when "try to query the dynamic field with the upsert option". This guide helped me: How to solve error E11000 duplicate
In above guide, you're probably making one of two mistakes:
Upsert a document when findOneAndupdate() but the query finds a non-unique field.
Use insert many new documents in one go but don't use "ordered = false"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string