How to bulk save an array of objects in MongoDB? - node.js

I have looked a long time and not found an answer. The Node.JS MongoDB driver docs say you can do bulk inserts using insert(docs) which is good and works well.
I now have a collection with over 4,000,000 items, and I need to add a new field to all of them. Usually mongodb can only write 1 transaction per 100ms, which means I would be waiting for days to update all those items. How can I do a "bulk save/update" to update them all at once? update() and save() seem to only work on a single object.
psuedo-code:
var stuffToSave = [];
db.collection('blah').find({}, function(err, stuff) {
stuff.toArray().forEach(function(item)) {
item.newField = someComplexCalculationInvolvingALookup();
stuffToSave.push(item);
}
}
db.saveButNotSuperSlow(stuffToSave);
Sure, I'll need to put some limit on doing something like 10,000 at once to not try do all 4 million at once, but i think you get the point.

MongoDB allows you to update many documents that match a specific query using a single db.collection.update(query, update, options) call, see the documentation. For example,
db.blah.update(
{ },
{
$set: { newField: someComplexValue }
},
{
multi: true
}
)
The multi option allows the command to update all documents that match the query criteria. Note that the exact same thing applies when using the Node.JS driver, see that documentation.
If you're performing many different updates on a collection, you can wrap them all in a Bulk() builder to avoid some of the overhead of sending multiple updates to the database.

Related

Dealing with race conditions and starvation when generating unique IDs using MongoDB + NodeJS

I am using MongoDB to generate unique IDs of this format:
{ID TYPE}{ZONE}{ALPHABET}{YY}{XXXX}
Here ID TYPE will be an alphabet from {U, E, V} depending on the input, zone will be from the set {N, S, E, W}, YY will be the last 2 digits of the current year and XXXXX will be a 5 digit number beginning from 0 (willbe padded with 0s to make it 5 digits long). When XXXXX reaches 99999, the ALPHABET part will be incremented to the next alphabet (starting from A).
I will receive ID TYPE and ZONE as input and will have to give the generated unique ID as output. Everytime, I have to generate a new ID, I will read the last generated for the given ID TYPE and ZONE, increment the number part by 1 (XXXXX + 1) and then save the new generated ID in MongoDB and return the output to the user.
This code will be run on a single NodeJS server and there can be multiple clients calling this method
Is there a possibility of a race condition like the once described below if I am ony running a single server instance:
First client reads last generated ID as USA2100000
Second client reads last generated ID as USA2100000
First client generates the new ID and saves it as USA2100001
Second client generates the new ID and saves it as USA2100001
Since 2 clients have generated IDs, finally the DB should have had USA2100002.
To overcome this, I am using MongoDB transactions. My code in Typescript using Mongoose as ODM is something like this:
session = await startSession();
session.startTransaction();
lastId = await GeneratedId.findOne({ key: idKeyStr }, "value").value
lastId = createNextId(lastId);
const newIdObj: any = {
key: `Type:${idPrefix}_Zone:${zone_letter}`,
value: lastId,
};
await GeneratedId.findOneAndUpdate({ key: idKeyStr }, newIdObj, {
upsert: true,
new: true,
});
await session.commitTransaction();
session.endSession();
I want to know what exactly will happen when the situation I
described above happens with this code?
Will the second client's transaction throw an exception and I have to abort or retry the transaction in my code or will it handle the retry on its own?
How does MongoDB or other DBs handle transactions? Does MongoDB lock the documents involved in the transaction? Are the exclusive locks (wont even allow other clients to read)?
If the same client keeps failing to commit its transaction, this client would be starved. How to deal with this starvation?
You are using MongoDB to store the ID. It's a state. Generation of the ID is a function. You use Mongodb to generate the ID when mongodb process takes arguments of the function and returns the generated ID. It's not what you are doing. You are using nodejs to generate the ID.
Number of threads, or rather event loops is critical as it defines the architecture but in either way you don't need transactions. Transactions in mongodb are being called "multi-document transactions" exactly to highlight they are intended for consistent update of several documents at once. The very first paragraph of https://docs.mongodb.com/manual/core/transactions/ warns you that if you update a single document there is no room for transactions.
A single threaded application does not require any synchronisation. You can reliably read the latest generated ID on start and guarantee the ID is unique within the nodejs process. If you exclude mongodb and other I/O from the generation function you will make it synchronous so you can maintain state of the ID within nodejs process and guarantee its uniqueness. Once generated you can persist in in the db asynchronously. In the worst case scenario you may have a gap in the sequential numbers but no duplicates.
If there is a slighteest chance that you may need to scale up to more than 1 nodejs process to handle more simultaneous requests or add another host for redundancy in the future you will need to sync generation of the ID and you can employ Mongodb unique indexes for that. The function itself doesn't change much you still generate the ID as in a single-threaded architecture but add an extra step to save the ID to mongo. The document should have unique index on the ID field, so in case of concurrent updates one of the query will successfully add the document and another will fail with "E11000 duplicate key error". You catch such errors on nodejs side and repeat the function again picking the next number:
This is what you can try. You need to store only one document in the GeneratedId collection. This document will have the last generated id's value. The document must have a known _id field, for example lets say it will be an integer with value 1. So, the document can be like this:
{ _id: 1, lastGeneratedId: "<some value>" }
In your application, you can use the findOneAndUpdate() method with a filter { _id: 1 }; which means you are targeting one document update. This update will be an atomic operation; as per the MongoDB documentation "All write operations in MongoDB are atomic on the level of a single document." . Do you need a transaction in this case? No. The update operation is atomic and performs better than using a transaction. See Update Documents - Atomicity.
Then, how do I generate the new generated id and retrieve it?
I will receive ID TYPE and ZONE...
Using the above input values and the existing lastGeneratedId value you can arrive at the new value and update the document (with the new value). The new value can be calculated / formatted within the Aggregation Pipeline of the update operation - you can use the feature Updates with Aggregation Pipeline (this is available with MongoDB v4.2 or higher).
Note the findOneAndUpdate() method returns the updated (or modified) document when you use the update option new: true. This returned document will have the newly generated lastGeneratedId value.
The update method can look like this (using NodeJS driver or even Mongoose):
const filter = { _id: 1 }
const update = [
{ $set: { lastGeneratedId: { // your calculation of new value goes here... } } }
]
const options = { new: true, projection: { _id: 0, lastGeneratedId: 1} }
const newId = await GeneratedId.findOneAndUpdate(filter, update, options).['lastGeneratedId']
Note about the JavaScript function:
With MongoDB v4.4 you can use JavaScript functions within an Aggregation Pipeline; and this is applicable for the Updates with Aggregation Pipeline. For details see $function aggregation pipeline operator.

Mongoose, Nodejs - replace many documents in one I/O?

I have an array of objects and I want to store them in a collection using only one I/O operation if it's possible. If any document already exists in the collection I want to replace it, or insert it otherwise.
These are the solutions that I found, but doesn't work exactly as I want:
insertMany(): this doesn't replace the document that already exists, but throws exception instead (This is what I found in the Mongodb documentation, but I don't know if it's the same as mongoose).
update() or ‎updateMany() with upsert = true: this doesn't help me as well, because here I have to do the same updates to all the to stored documents.
‎There is no replaceMany() in mongodb or mongoose.
Is there anyone how knows any optimal way to do replaceMany using mongoose and node.js
There is bulkWrite (https://docs.mongodb.com/manual/reference/method/db.collection.bulkWrite/), which makes it possible to execute multiple operations at once. In your case, you can use it to perform multiple replaceOne operations with upsert. The code below shows how you can do it with Mongoose:
// Assuming *data* is an array of documents that you want to insert (or replace)
const bulkData = data.map(item => (
{
replaceOne: {
upsert: true,
filter: {
// Filter specification. You must provide a field that
// identifies *item*
},
replacement: item
}
}
));
db.bulkWrite(bulkData);
You need to query like this:
db.getCollection('hotspot').update({
/Your Condition/
}, {
$set: {
"New Key": "Value"
}
}, {
multi: true,
upsert: true
});
It fulfils your requirements..!!!

MongoDB - two updates in sequence overlap each other

We are building size calculation mechanism for our system.
In order to calculate sizes, we start with the first atomic operation - findAndModify - to find the object and add lock properties to it (to prevent another calculations for this object to interact with it and wait till the end, as we could have many parallel calculations - in this case others should be postponed), then we calculate size of specific properties and after this operation - we add metadata to object and delete locks.
However, it seems that sometimes, when we have a lot of multiple calculations for single object (especially when we calculate a lot of objects in parallel), some updates aren't executed.
_size metadata during calculation looks like this:
{
_lockedAt: SomeDate,
_transactionId: 'abc'
}
And after calculation it should look like this:
{
somePropertySize: 123,
anotherPropertySize: 1245,
(...)
_total: 131431523 // Some number
// Notice that both _lockedAt and _transactionId should be missing
}
And this is how our update flow looks like:
return Promise.coroutine(function * () {
yield object.findOneAndUpdate({
'_id': gemId,
'_size._lockedAt': {
$exists: false
}
}, {
$set: {
'_size._lockedAt': moment.utc().toDate(),
'_size._transactionId': transactionId
}
}).then(results => results.value);
// Calculations are performed here, new _size object is built
yield object.findOneAndUpdate({
_id: gemId,
_lockedAt: {
$exists: true // We tried both with and without this property, does not change anything
}
}, {
$set: {
_size: newSizeObject
}
});
})()
Exemplary real-life object JUST before second update (truncated for brevity):
{
title: 11,
description: 2,
detailedSection: 0,
tags: 2
file: 5625898,
_total: 5625913
}
For some reason, when we have multiple calculations next to each other, sometimes (for new objects, without _size property at all), the objects stay with _size object looking exactly as after locking, despite the fact logs show us that everything went well (calculations were complete, new sizes object was calculated and second DB update was called).
We use MongoDB 3.0, two replicaSets. Any ideas on what is happening?
Put the second update after the then so it will wait until the promise resolves:
object.findOneAndUpdate({
'_id': gemId,
'_size._lockedAt': {
$exists: false
}
}, {
$set: {
'_size._lockedAt': moment.utc().toDate(),
'_size._transactionId': transactionId
}
}).then(results => {
// Calculations are performed here, new _size object is built
object.findOneAndUpdate({
_id: gemId,
_lockedAt: {
$exists: true // We tried both with and without this property, does not change anything
}
}, {
$set: {
_size: newSizeObject
}
});
}).catch(err => console.error);
Also make sure you have error handling for your promises using catch.
If you don't really need the lock or transaction fields then I would remove that stuff. If you do need them, something like RethinkDB may work a little better, or PostgresSQL could give real transactions.
All in all, I checked the code very carefully and what was happening in reality, was the fact that completely different part of the code was querying the object from the DB and then, after a few other operations (mine included), it wrote the object to the DB (hence, overwriting my changes).
So, important note for every MongoDB user - please do remember that MongoDB is not transactional, but still atomic, which means that it guarantees that your operation will be persisted, but does not guarantee that data between operations will be persisted.
To sum up, things I learned by this example:
NEVER update whole object in the database with the data obtained from it some time before (e.g. by querying, changing some properties and saving again)
USE $set, $inc, $unset and other special operators. If you have a lot of parameters, use e.g. mongo-dot-notation npm library to flatten your data into $set selector.
If something unexpected is happening with your data (e.g. missing properties after saving) the first thing to investigate is another pending operations on those specific entities
The least probable cause of your problems is MongoDB itself. It's usually code that does not follow atomicity rules (which happens probably with a lot of people used to transactional DBs :)).

Comparing ObjectIDs in Mongoose Query

I'm trying to update every document in an expanding Mongo database.
My plan is to start with the youngest, most recently created document and work back from there, one-by-one querying the next oldest document.
The problem is that my Mongoose query is skipping documents that were created in the same second. I thought greater than/less than operators would work on _ids generated in the same second. But though there are 150 documents in the database right now, this function gets from the youngest to the oldest document in only 8 loops.
Here's my Mongoose query within the recursive node loop:
function loopThroughDatabase(i, doc, sizeOfDatabase){
if (i < sizeOfDatabase) {
(function(){
myMongooseCollection.model(false)
.find()
.where("_id")
.lt(doc._id)
.sort("id")
.limit(1)
.exec(function(err, docs) {
if (err) {
console.log(err);
}
else {
updateDocAndSaveToDatabase(docs[0]);
loopThroughDatabase(i + 1, docs[0], sizeOfDatabase); //recursion here
}
});
})();
}
}
loopThroughDatabase(1, youngestDoc, sizeOfDatabase);
Error found.
In the Mongoose query, I was sorting by "id" rather than "_id"
If you read the MongoDB documentation, you will see that it depends on the process in which the item was created http://docs.mongodb.org/manual/reference/glossary/#term-objectid, therefore, to guarantee what you need, you need to add a Date stamp to the records and use that instead of the _id

How to do a massive random update with MongoDB / NodeJS

I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).
Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js
Fetch all docs :
inst.db.collection(association.collection, function(err, collection) {
collection.find({}, {}, function(err, cursor) {
cursor.toArray(function(err, items){
......
);
});
});
update each doc :
items.forEach(function(item) {
// *** do some stuff with item, add field etc.
tasks.push(function(nextTask) {
inst.db.collection(association.collection, function(err, collection) {
if (err) callback(err, null);
collection.save(item, nextTask);
});
});
});
call the "save" task in parallel
async.parallel(tasks, function(err, results) {
callback(err, results);
});
Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?
Thanks for your support.
You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.
Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.
108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.
Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.
The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.
If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.

Resources