Mongodb 3.6 changestream resumeToken timestamp - mongodb-3.6

I am currently using changestream feature of MongoDB 3.6
We are a heavy update/insert operation and we use changestream to send data for analytics. We need to sync the data in the realtime but since resumeToken is binary, I have hard time finding the timestamp of the operation and hence can't calculate the synchronization lag to analytics.
Is there any way to fetch timestamp from resumeToken or any other way to fetch the operation timestamp.

Is there any way to fetch timestamp from resumeToken or any other way to fetch the operation timestamp.
You can't find out the timestamp of the operation in MongoDB 3.6. There is a plan to add a tool to inspect resumeToken binary to decode it into something useful outside of the server : SERVER-32283.
In MongoDB 4.0 however, every Change Streams event will also include a field called clusterTime, which is the timestamp of the oplog entry associated with the event. See also change events.

Related

change stream in NodeJs for elasticsearch

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)
Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

Why does .find() show fast MongoDB transaction times & slow node.js transaction times, and .findOne() shows the complete opposite?

I recently stress-tested my express server with the following two queries:
db.collection.find(queryCondition).limit(1).toArray()
// and
db.collection.findOne(queryCondition)
THESE ARE THE NEW RELIC RESULTS
Can someone explain why .find() shows fast transaction times for MongoDB yet slow transaction times for node.js? Then, in complete contrast, .findOne() shows slow MongoDB times but fast node.js times?
For context, my express server is on a t2.micro instance and my database is on another t2.micro instance.
Let's compare the the performance of .find() and .findOne() in nodejs and on the mongodb level.
MongoDb:
Here, find().limit() should emrge as a clear winner as it fetches the cursor to the result, which is a pointer to the result of the query, instead of the data itself, and that is precisely the case as per your observation.
Nodejs:
Here, theoretically, .find().limit() should also emerge faster, however, it seems that in the New Relic results screenshot that you've linked, you're actually doing .find().limit().toArray() which fetches you an array of data as per your query instead of just fetching the cursor, and findOne() just fetches you a document (in the form of a JS object in nodejs).
As per the mongodb driver docs for nodejs, .find() quickly returns a cursor and is, therefore, a synchronous operation that does not require a .then() or await, on the other hand, since .toArray() is a method of Cursor and fetches all the documents matching the query in an array (not unlike fetching the cursor and putting all the documents that .next() can fetch in an array yourself). This can be time-consuming depending on the query, and therefore, it returns a promise.
In your test, what seems to be happening is that with .findOne(), you're fetching just one document (which is time consuming on the MongoDb level and at least as time consuming in nodejs as well) but with find(), you're first fetching the cursor (fast on the mongodb level) then telling the nodejs driver to fetch the data from that cursor (time consuming), which is why .find().limit(1).toArray() is appearing to be perhaps more time consuming than findOne() in nodejs, and in the bottom graph in your link, the space is almost entirely blue, which represents nodejs.
I suggest you try simply doing .find().limit() and checking the result, but then heed that you won't be getting your actual data, just a cursor that's pretty useless until you fetch data from it.
I hope this has been of use.

Neo4j JavaScript driver - Subscribing to changes

I'm trying to subscribe to changes in the database using neo4j-javascript-driver. Currently the driver.rxSession() is returning a stream or rows, instead I want to get a stream of results as the database changes. Currently I'm using this query:
rxSession.run('match (n) return n')
.records()
.pipe(
toArray()
)
I'm not sure how resource intensive it's gonna be on Neo4j to update on every change on the query result, but does the driver support such a behavior or is there another way to do that?
You can write your own plugin to monitor all changes to the DB, but it will have to be written in Java.
And you can take a look at how the APOC plugin's trigger procedures are implemented for some ideas.

What is the difference between a changeStream and tailable cursor in MongoDB

I am trying to determine what the difference is between a changestream:
https://docs.mongodb.com/manual/changeStreams
https://docs.mongodb.com/manual/reference/method/db.collection.watch/
which looks like so:
const changeStream = collection.watch();
changeStream.next(function(err, next) {
expect(err).to.equal(null);
client.close();
done();
});
and a tailable cursor:
https://docs.mongodb.com/manual/core/tailable-cursors/
which looks like so:
const cursor = coll.find(self.query || query)
.addCursorFlag('tailable', true)
.addCursorFlag('awaitData', true) // true or false?
.addCursorFlag('noCursorTimeout', true)
.addCursorFlag('oplogReplay', true)
.setCursorOption('numberOfRetries', Number.MAX_VALUE)
.setCursorOption('tailableRetryInterval', 200);
const strm = cursor.stream(); // Node.js transform stream
do they have a different use case? when would it be good to use one over the other?
Change Streams (available in MongoDB v3.6+) is a feature that allows you to access real-time data changes without the complexity and risk of tailing the oplog. Key benefits of change streams over tailing the oplog are:
Utilise the built-in MongoDB Role-Based Access Control. Applications can only open change streams against collections they have read access to. Refined and specific authorisation.
Provide a well defined API that are reliable. The change events output that are returned by change streams are well documented. Also, all of the official MongoDB drivers follow the same specifications when implementing change streams interface.
Change events that are returned as part of change streams are at least committed to the majority of the replica set. This means the change events that are sent to the client are durable. Applications don't need to handle data rollback in the event of failover.
Provide a total ordering of changes across shards by utilising a global logical clock. MongoDB guarantees the order of changes are preserved and change events can be safely interpreted in the order received. For example, a change stream cursor opened against a 3-shard sharded cluster returns change events respecting the total order of those changes across all three shards.
Due to the ordering characteristic, change streams are also inherently resumable. The _id of change event output is a resume token. MongoDB official drivers automatically cache this resume token, and in the case of network transient error the driver will retry once. Additionally, applications can also resume manually by utilising parameter resume_after. See also Resume a Change Stream.
Utilise MongoDB aggregation pipeline. Applications can modify the change events output. Currently there are five pipeline stages available to modify the event output. For example, change event outputs can be filtered out (server side) before being sent out using $match stage. See Modify Change Stream Output for more information.
when would it be good to use one over the other?
If your MongoDB deployment is version 3.6+, I would recommend to utilise MongoDB Change Streams over tailing the oplog.
You may also find Change Streams Production Recommendations a useful resource.
With tailable cursor, you follow ALL changes to all collections. With changeStream, you see only changes to the selected collection. Much less traffic and more reliable.

Whats the proper way to keep track of changes to documents so that client can poll for deltas?

I'm storing key-value documents on a mongo collection, while multiple clients are pushing updates to this collection (posting to an API endpoint) at a very fast pace (updates will come in faster than once per second).
I need to expose another endpoint so that a watcher can poll all changes, in delta format, since last poll. Each diff must have a sequence number and/or timestamp.
What I'm thinking is:
For each update I calculate a diff and store it.
I store each diff on a mongo collection, with current timestamp (using Node Date object)
On each poll for changes: get all diffs from the collection, delete them and return.
The questions are:
Is it safe to use timestamps for sequencing changes?
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
thanks!
On each poll for changes: get all diffs from the collection, delete them and return.
This sounds terribly fragile. What if client didn't receive the data (he crashed/network disappeared in the middle of receiving the response)? He retries the request, but oops, doesn't see anything. What I would do is that client remembers last version it saw and asks for updates like this:
GET /thing/:id/deltas?after_version=XYZ
When it receives a new batch of deltas, it gets the last version of that batch and updates its cached value.
Is it safe to use timestamps for sequencing changes?
I think so. ObjectId already contains a timestamp, so you might use just that, no need for separate time field.
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
Depends on your requirements. Mongo should work well here. Especially if you'll be cleaning old data.
at a very fast pace (updates will come in faster than once per second)
By modern standards, 1 per second is nothing. 10 per second - same. 10k per second - now we're talking.

Resources