Mongodb queries from multiple processes; how to implement atomicity? - node.js

I have a mongodb database where multiple node processes read and write documents. I would like to know how can I make that so only one process can work on a document at a time. (Some sort of locking) that is freed after the process finished updating that entry.
My application should do the following:
Walk through each entry one by one with a cursor.
(Lock the entry so no other processes can work with it)
Fetch information from a thirdparty site.
Calculate new information and update the entry.
(Unlock the document)
Also after unlocking the document there will be no need for other processes to update it for a few hours.
Later on I would like to set up multiple mongodb clusters so that I can reduce the load on the databases. So the solution should apply to both single and multiple database servers. Or at least using multiple mongo servers.

An elegant solution that doesn't involve locks is:
Add a version property to your document.
When updating the document, increment the version property.
When updating the document, include the last read version in the find query. If your document has been updated elsewhere, the find query will yield no results and your update will fail.
If your update fails, you can retry the operation.
I have used this pattern with great success in the past.
Example
Imagine you have a document {_id: 123, version: 1}.
Imagine now you have 3 Mongo clients doing db.collection.findAndModify({ query: {_id: 123, version: 1}, update: { $inc: 1 }});, concurrently.
The first update will apply, the remaining will fail. Why? because now version is 2, and the query included version: 1.

Per MongoDB documentation:
isolated: Prevents a write operation that affects multiple documents from yielding to other reads or writes once the first document is written... $isolated operator causes write operations to acquire an exclusive lock on the collection...will make WiredTiger single-threaded for the duration of the operation. So if you are updating multiple documents, you could first get the data from the third-party API, parse the info into an array for example, and then use something like this in Mongo shell:
db.foo.update(
{ status : "A" , $isolated : 1 },
{ $set: { < your key >: < your info >}}, //use the info in your array
{ multi: true }
)
Or if you have to update the document one by one, you could use findAndModify() or updateOne() of the Node Driver for MongoDB. Please note that per MongoDB documentation 'When modifying a single document, both findAndModify() and the update() method atomically update the document.'
An example of updating one by one: first you connect to the Mongod with the NodeJS driver, then connect to the third part API using NodeJS's Request module, for example, get and parse the data, before using the data to modify your documents, something like below:
var request = require('request');
var MongoClient = require('mongodb').MongoClient,
test = require('assert');
MongoClient.connect('mongodb://localhost:27017/test', function(err, db) {
var collection = db.collection('simple_query');
collection.find().forEach(
function(doc) {
request('http://www.google.com', function(error, response, body) {
console.log('body:', body); // parse body for your info
collection.findAndModify({
<query based on your doc>
}, {
$set: { < your key >: < your info >
}
})
});
}, function(err) {
});
});

Encountered this question today,
I feel like it's been left open,
First, findAndModify really seems like the way to go about it,
But, I found vulnerabilities in both answers suggested:
in Treefish Zhang's answer - if you run multiple processes in parallel they will query the same documents because in the beginning you use "find" and not "findAndModify", you use "findAndModify" only after the process was done - during processing it's still not updated and other processes can query it as well.
in arboreal84's answer - what happens if the process crashes in the middle of handling the entry? if you update the version while querying, then the process crashes, you have no clue whether the operation succeeded or not.
therefore, I think the most reliable approach would be to have multiple fields:
version
locked:[true/false],
lockedAt:[timestamp] (optional - in case the process crashed and was not able to unlock, you may want to retry after x amount of time)
attempts:0 (optional - increment this if you want to know how many process attempts were done, good to count retries)
then, for your code:
findAndModify: where version=oldVersion and locked=false, set locked=true, lockedAt=now
process the entry
if process succeeded, set locked=false, version=newVersion
if process failed, set locked=false
optional: for retry after ttl you can also query by "or locked=true and lockedAt<now-ttl"
and about:
i have a vps in new york and one in hong kong and i would like to
apply the lock on both database servers. So those two vps servers wont
perform the same task at any chance.
I think the answer to this depends on why you need 2 database servers and why they have the same entries,
if one of them is a secondary in cross-region replicas for high availability, findAndModify will query the primary since writing to secondary replica is not allowed and that's why you dont need to worry about 2 servers being in sync (it might have latency issue tho, but you'll have it anyways since you're communicating between 2 regions).
if you want it just for sharding and horizontal scaling, no need to worry about it because each shard will hold different entries, therefore entry lock is relevant just for one shard.
Hope it will help someone in the future
relevant questions:
MongoDB as a queue service?
Can I trust a MongoDB collection as a task queue?

Related

change stream in NodeJs for elasticsearch

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)
Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

How to optimize mongodb insert query in nodejs?

We are doing manipulation and insertion of data in mongo db. So for single insert in mongo db it is taking 28ms. I have to insert 2 times per request. At a time, if I get 6000 requests, I have to insert each data individually and it takes lot more time. How can I optimize this? Kindly help me on this.
var obj = new gnModel({
id: data.EID,
val: data.MID,
});
let response = await insertIntoMongo(gnModel);
If it is not vital for the data to be stored immediately, you can implement some form of batching.
For example, you can have a service which queues operations and commits them to the database every X seconds. In the service itself, you can use mongo's Bulk and more specifically for insertion: Bulk.insert(). It lets you queue operations to be executed as a single query(or at least minimal amount of queries/round trips).
It would also be a good idea to serialize and store this operation log/cache somewhere, as server restart will wipe it out if it is stored entirely in memory. A possible solution is Redis as it can both persist data to disk and is also distributed thus enabling you to queue operations from different application instances.
You'll achieve even better performance if the operations are unrelated and not dependent on each other. In this case you can use db.collection.initializeUnorderedBulkOp() which will allow mongo to execute the operations in parallel instead of sequentially and a single operation fail won't affect the execution of the rest of the set(contrary to OrderedBulkOp).

MongoDB Transaction, prevent multiple update on multi-server

On scale-out servers, several servers will compete the update process each other for the multiple data.
So I want to prevent multiple update on the same data.
=Coffee
CollectionRooms.find(isProcessed: false).forEach (room) ->
if room.isProcessed then return
#update something
CollectionRooms.update _id: room._id,
$set: isProcessed: true
The question between two servers(SERVER1, SERVER2) with same MongoDB is,
After the moment of SERVER1's find action,
If SERVER2 update that data to isProcessed = true,
Then the data in SERVER1's forEach could be isProcessed true?
I think I need to make the question simple.
find() will returns the cursor,
then inside of the loop in .forEach function,
each loop's actual data is different with find() function started?.
Sorry for ugly expressions and thanks.
Use optimistic locking.
Have an additional field on your document (a timestamp, or version number) which is updated every time the document is written. Then use this version in your update queries. The update will fail if the version has changed since reading.

Bulk operation by mongoose

I want to store bulk data (more than 1000 or 10000 records) in a single operation by MongoOSE. But MongoOSE does not support bulk operations so I will use the native driver (MongoDB, for insertion). I know that I will bypass all MongoOSE middlewares but its ok. (Please correct me If I am wrong! :) )
I have an option to store data by insert method. But MongoDB also provides Bulk class (ordered and unordered operations). Now I have the following questions:
Difference between insert and bulk operation (both can store bulk data) ?
Any specific difference between initializeUnorderedBulkOp() (performs operation in serially) and initializeOrderedBulkOp() (performs operations in parallel) ?
If I will use initializeUnorderedBulkOp then it will effect on by range search or any side-effects ?
Can I do it by Promisification (by BlueBird) ?? (I am trying to do it.)
Thanks
EDIT: I am talking about bulk vs insert regarding to multiple insertions. Which one is better? Insertion one by one by bulk builder OR insertion by batches (1000) in insert method. I hope now it will clear Mongoose (mongodb) batch insert? this link
If you are calling this from a mongoose model you need the .collection accessor
var bulk = Model.collection.initializeOrderedBulkOp();
// examples
bulk.insert({ "a": 1 });
bulk.find({ "a": 1 }).updateOne({ "$set": { "a": 2 } });
bulk.execute(function(err,result) {
// result contains stats of the operations
});
You need to be "careful" when doing this though. Apart from not being bound to the same checks and validation that can be attached to mongoose schemas, when you call .collection you need to be "sure" that the connection to the database has already been made. Mongoose methods look after this for you, but once you use the underlying driver methods you are all on your own.
As for diffferences it's all there in the naming:
Ordered: Means that the batched instructions are executed in the same order they are added. They execute one after the other in sequence and one at a time. If an error occurs at any point, the execution of the batch is halted and the error response returned. All operations up until then are "comitted". This is not a rollback.
UnOrdered: Means that batched operations can execute in "any" sequence and often in parallel. This can lead to faster updates, but of course cannot be used where one bulk operation in the batch is meant to occur before another ( example above ). Any errors that occur are merely "reported" in the result, and the whole batch will complete as sent to the server.
Of course the core difference for either type of execution from the standard methods is that the "whole batch" ( actually in lots of 1000 maximum ) is sent to the server and you only get one response back. This saves network traffic and waiting for each idividual .insert() or other like operation to complete.
As for can a "promise" be used, well anything else with a callback that you can convert to returning a promise follows the same rules as here. Remember though that the "callback/promise" is on the .execute() method, and that what you get back complies to the rules of what is returned from Bulk operations results.
For more information see "Bulk" in the core documentation.

MongoDB Bulk Insert where many documents already exist

I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.
Anybody know (or want to guess) if it would be more efficient to:
Do the bulk update but with continueOnError = true, e.g.
db.collection.insert(myArray, {continueOnError: true}, callback)
do individual inserts, checking first if the _ID exists?
First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?
I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)
What's happening is that a few new documents will be added.
My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.
In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.
The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).
Are bulk inserts appropriate for this use case?
Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.
I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.
I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.
MongoDB 2.6
As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.
The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.
For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.
collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
if (err && err.code != "11000"){
throw err;
}
db.close();
callBack();
});
For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).
I wouldn't use the continueOnError and send all documents ... it's less efficient.
I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:
From the documentation:
upsert
Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.
multi
Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.
db.collection.update(
<query>,
<update>,
{ upsert: <boolean>, multi: <boolean> }
)
Here is the referenced documentation:
http://docs.mongodb.org/manual/reference/method/db.collection.update/

Resources