change stream in NodeJs for elasticsearch - node.js

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)

Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

Related

Firestore fetching all documents on a node.js server. Scalability

Every night at 12pm I am fetching all of the users from my firestore database with this code.
const usersRef = db.collection('users');
const snapshot = await usersRef.get();
snapshot.forEach(doc => {
let docData = doc.data()
// some code and evaluations
})
I just want to know if this is a reliable way to read through all of the data each night without overloading the system. For instance if I have 50k users and I want to update their info each night on the server, will this require a lot of memory server-side to do? Also, is there a better way to handle what I am attempting to do, with the generic premise of updating the users data each night.
Your code is loading all documents in a collection in one go. Even on a server, that will at some point run out of memory.
You'll want to instead read a limited number of documents, process those documents, then read/process a next batch of documents, until you're done. This is known as paginating through data with queries and ensures you can handle any number of documents, instead of only the number that can fit into memory.

What is the difference between a changeStream and tailable cursor in MongoDB

I am trying to determine what the difference is between a changestream:
https://docs.mongodb.com/manual/changeStreams
https://docs.mongodb.com/manual/reference/method/db.collection.watch/
which looks like so:
const changeStream = collection.watch();
changeStream.next(function(err, next) {
expect(err).to.equal(null);
client.close();
done();
});
and a tailable cursor:
https://docs.mongodb.com/manual/core/tailable-cursors/
which looks like so:
const cursor = coll.find(self.query || query)
.addCursorFlag('tailable', true)
.addCursorFlag('awaitData', true) // true or false?
.addCursorFlag('noCursorTimeout', true)
.addCursorFlag('oplogReplay', true)
.setCursorOption('numberOfRetries', Number.MAX_VALUE)
.setCursorOption('tailableRetryInterval', 200);
const strm = cursor.stream(); // Node.js transform stream
do they have a different use case? when would it be good to use one over the other?
Change Streams (available in MongoDB v3.6+) is a feature that allows you to access real-time data changes without the complexity and risk of tailing the oplog. Key benefits of change streams over tailing the oplog are:
Utilise the built-in MongoDB Role-Based Access Control. Applications can only open change streams against collections they have read access to. Refined and specific authorisation.
Provide a well defined API that are reliable. The change events output that are returned by change streams are well documented. Also, all of the official MongoDB drivers follow the same specifications when implementing change streams interface.
Change events that are returned as part of change streams are at least committed to the majority of the replica set. This means the change events that are sent to the client are durable. Applications don't need to handle data rollback in the event of failover.
Provide a total ordering of changes across shards by utilising a global logical clock. MongoDB guarantees the order of changes are preserved and change events can be safely interpreted in the order received. For example, a change stream cursor opened against a 3-shard sharded cluster returns change events respecting the total order of those changes across all three shards.
Due to the ordering characteristic, change streams are also inherently resumable. The _id of change event output is a resume token. MongoDB official drivers automatically cache this resume token, and in the case of network transient error the driver will retry once. Additionally, applications can also resume manually by utilising parameter resume_after. See also Resume a Change Stream.
Utilise MongoDB aggregation pipeline. Applications can modify the change events output. Currently there are five pipeline stages available to modify the event output. For example, change event outputs can be filtered out (server side) before being sent out using $match stage. See Modify Change Stream Output for more information.
when would it be good to use one over the other?
If your MongoDB deployment is version 3.6+, I would recommend to utilise MongoDB Change Streams over tailing the oplog.
You may also find Change Streams Production Recommendations a useful resource.
With tailable cursor, you follow ALL changes to all collections. With changeStream, you see only changes to the selected collection. Much less traffic and more reliable.

Mongodb queries from multiple processes; how to implement atomicity?

I have a mongodb database where multiple node processes read and write documents. I would like to know how can I make that so only one process can work on a document at a time. (Some sort of locking) that is freed after the process finished updating that entry.
My application should do the following:
Walk through each entry one by one with a cursor.
(Lock the entry so no other processes can work with it)
Fetch information from a thirdparty site.
Calculate new information and update the entry.
(Unlock the document)
Also after unlocking the document there will be no need for other processes to update it for a few hours.
Later on I would like to set up multiple mongodb clusters so that I can reduce the load on the databases. So the solution should apply to both single and multiple database servers. Or at least using multiple mongo servers.
An elegant solution that doesn't involve locks is:
Add a version property to your document.
When updating the document, increment the version property.
When updating the document, include the last read version in the find query. If your document has been updated elsewhere, the find query will yield no results and your update will fail.
If your update fails, you can retry the operation.
I have used this pattern with great success in the past.
Example
Imagine you have a document {_id: 123, version: 1}.
Imagine now you have 3 Mongo clients doing db.collection.findAndModify({ query: {_id: 123, version: 1}, update: { $inc: 1 }});, concurrently.
The first update will apply, the remaining will fail. Why? because now version is 2, and the query included version: 1.
Per MongoDB documentation:
isolated: Prevents a write operation that affects multiple documents from yielding to other reads or writes once the first document is written... $isolated operator causes write operations to acquire an exclusive lock on the collection...will make WiredTiger single-threaded for the duration of the operation. So if you are updating multiple documents, you could first get the data from the third-party API, parse the info into an array for example, and then use something like this in Mongo shell:
db.foo.update(
{ status : "A" , $isolated : 1 },
{ $set: { < your key >: < your info >}}, //use the info in your array
{ multi: true }
)
Or if you have to update the document one by one, you could use findAndModify() or updateOne() of the Node Driver for MongoDB. Please note that per MongoDB documentation 'When modifying a single document, both findAndModify() and the update() method atomically update the document.'
An example of updating one by one: first you connect to the Mongod with the NodeJS driver, then connect to the third part API using NodeJS's Request module, for example, get and parse the data, before using the data to modify your documents, something like below:
var request = require('request');
var MongoClient = require('mongodb').MongoClient,
test = require('assert');
MongoClient.connect('mongodb://localhost:27017/test', function(err, db) {
var collection = db.collection('simple_query');
collection.find().forEach(
function(doc) {
request('http://www.google.com', function(error, response, body) {
console.log('body:', body); // parse body for your info
collection.findAndModify({
<query based on your doc>
}, {
$set: { < your key >: < your info >
}
})
});
}, function(err) {
});
});
Encountered this question today,
I feel like it's been left open,
First, findAndModify really seems like the way to go about it,
But, I found vulnerabilities in both answers suggested:
in Treefish Zhang's answer - if you run multiple processes in parallel they will query the same documents because in the beginning you use "find" and not "findAndModify", you use "findAndModify" only after the process was done - during processing it's still not updated and other processes can query it as well.
in arboreal84's answer - what happens if the process crashes in the middle of handling the entry? if you update the version while querying, then the process crashes, you have no clue whether the operation succeeded or not.
therefore, I think the most reliable approach would be to have multiple fields:
version
locked:[true/false],
lockedAt:[timestamp] (optional - in case the process crashed and was not able to unlock, you may want to retry after x amount of time)
attempts:0 (optional - increment this if you want to know how many process attempts were done, good to count retries)
then, for your code:
findAndModify: where version=oldVersion and locked=false, set locked=true, lockedAt=now
process the entry
if process succeeded, set locked=false, version=newVersion
if process failed, set locked=false
optional: for retry after ttl you can also query by "or locked=true and lockedAt<now-ttl"
and about:
i have a vps in new york and one in hong kong and i would like to
apply the lock on both database servers. So those two vps servers wont
perform the same task at any chance.
I think the answer to this depends on why you need 2 database servers and why they have the same entries,
if one of them is a secondary in cross-region replicas for high availability, findAndModify will query the primary since writing to secondary replica is not allowed and that's why you dont need to worry about 2 servers being in sync (it might have latency issue tho, but you'll have it anyways since you're communicating between 2 regions).
if you want it just for sharding and horizontal scaling, no need to worry about it because each shard will hold different entries, therefore entry lock is relevant just for one shard.
Hope it will help someone in the future
relevant questions:
MongoDB as a queue service?
Can I trust a MongoDB collection as a task queue?

MongoDb + Mongoose QueryStream - Following document changes

I'm trying to make use of Mongoose and its querystream in a scheduling application, but maybe I'm misunderstanding how it works. I've read this question here on SO [Mongoose QueryStream new results and it seems I'm correct, but someone please explain:
If I'm filtering a query like so -
Model.find().stream()
when I add or change something that matches the .find(), it should throw a data event, correct? Or am I completely wrong in my understanding of this issue?
For example, I'm trying to look at some data like so:
Events.find({'title':/^word/}).stream();
I'm changing titles in the mongodb console, and not seeing any changes.
Can anyone explain why?
Your understanding is indeed incorrect as a stream is just an output stream of the current query response and not something that "listens for new data" by itself. The returned result here is basically just a node streaming interface, which is an optional choice as opposed to a "cursor", or indeed the direct translation to an array as mongoose methods do by default.
So a "stream" does not just "follow" anything. It is reall just another way of dealing with the normal results of a query, but in a way that does not "slurp" all of the results into memory at once. It rather uses event listeners to process each result as it is fetched from the server cursor.
What you are in fact talking about is a "tailable cursor", or some variant thereof. In basic MongoDB operations, a "tailable cursor" can be implemented on a capped collection. This is a special type of collection with specific rules, so it might not suit your purposes. They are intended for "insert only" operations which is typically suited to event queues.
On a model that is using a capped collection ( and only where a capped collection has been set ) then you implement like this:
var query = Events.find({'title':/^word/}).sort({ "$natural": -1}).limit(1);
var stream = query.tailable({ "awaitdata": true}).stream();
// fires on data received
stream.on("data",function(data) {
console.log(data);
});
The "awaitdata" there is just as an important option as the "tailable" option itself, as it is the main thing that tells the query cursor to remain "active" and "tail" the additions to the collection that meet the query conditions. But your collection must be "capped" for this to work.
An alternate and more adavanced approach to this is to do something like the meteor distribution does, where the "capped collection" that is being tailed is in fact the MongoDB oplog. This requires a replica set configuration, however just as meteor does out of the box, there is nothing wrong with having a single node as a replica set in itself. It's just not wise to do so in production.
This is more adavnced than a simple answer, but the basic concept is since the "oplog" is a capped collection you are able to "tail" it for all write operations on the database. This event data is then inspected to determine such details as the collection you want to watch for writes has been written to. Then that data can be used to query the new information and do something like return the updated or new results to a client via a websocket or similar.
But a stream in itself is just a stream. To "follow" the changes on a collection you either need to implement it as capped, or consider implementing a process based on watching the changes in the oplog as described.

Handling conflict in find, modify, save flow in MongoDB with Mongoose

I would like to update a document that involves reading other collection and complex modifications, so the update operators in findAndModify() cannot serve my purpose.
Here's what I have:
Collection.findById(id, function (err, doc) {
// read from other collection, validation
// modify fields in doc according to user input
// (with decent amount of logic)
doc.save(function (err, doc) {
if (err) {
return res.json(500, { message: err });
}
return res.json(200, doc);
});
}
My worry is that this flow might cause conflict if multiple clients happens to modify the same document.
It is said here that:
Operations on a single document are always atomic with MongoDB databases
I'm a bit confused about what Operations mean.
Does this means that the findById() will acquire the lock until doc is out of scope (after the response is sent), so there wouldn't be conflicts? (I don't think so)
If not, how to modify my code to support multiple clients knowing that they will modify Collection?
Will Mongoose report conflict if it occurs?
How to handle the possible conflict? Is it possible to manually lock the Collection?
I see suggestion to use Mongoose's versionKey (or timestamp) and retry for stale document
Don't use MongoDB altogether...
Thanks.
EDIT
Thanks #jibsales for the pointer, I now use Mongoose's versionKey (timestamp will also work) to avoid committing conflicts.
aaronheckmann — Mongoose v3 part 1 :: Versioning
See this sample code:
https://gist.github.com/anonymous/9dc837b1ef2831c97fe8
Operations refers to reads/writes. Bare in mind that MongoDB is not an ACID compliant data layer and if you need true ACID compliance, you're better off picking another tech. That said, you can achieve atomicity and isolation via the Two Phase Commit technique outlined in this article in the MongoDB docs. This is no small undertaking, so be prepared for some heavy lifting as you'll need to work with the native driver instead of Mongoose. Again, my ultimate suggestion is to not drink the NoSQL koolaid if you need transaction support which it sounds like you do.
When MongoDB receives a request to update a document, it will lock the database until it has completed the operation. Any other requests that MongoDB receives will wait until the locking operation has completed and the database is unlocked. This lock/wait behavior is automatic, so there aren't any conflicts to handle. You can find a lot more information about this behavior in the Concurrency section of the FAQ.
See jibsales answer for links to MongoDB's recommended technique for doing multi-document transactions.
There are a couple of NoSQL databases that do full ACID transactions, which would make your life a lot easier. FoundationDB is one such database. Data is stored as Key-Value but it supports multiple data models through layers.
Full disclosure: I'm an engineer at FoundationDB.
In my case I was wrong when "try to query the dynamic field with the upsert option". This guide helped me: How to solve error E11000 duplicate
In above guide, you're probably making one of two mistakes:
Upsert a document when findOneAndupdate() but the query finds a non-unique field.
Use insert many new documents in one go but don't use "ordered = false"

Resources