I'm trying to subscribe to changes in the database using neo4j-javascript-driver. Currently the driver.rxSession() is returning a stream or rows, instead I want to get a stream of results as the database changes. Currently I'm using this query:
rxSession.run('match (n) return n')
.records()
.pipe(
toArray()
)
I'm not sure how resource intensive it's gonna be on Neo4j to update on every change on the query result, but does the driver support such a behavior or is there another way to do that?
You can write your own plugin to monitor all changes to the DB, but it will have to be written in Java.
And you can take a look at how the APOC plugin's trigger procedures are implemented for some ideas.
Related
The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)
Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.
I have an original source, Source A, where I am fetching data from. I am reformatting and filtering the data from Source A, and storing it in Source B.
I have created a REST API using Node/Express that uses Source B.
Source A gets updated once every day. I want to update Source B at the same rate. What is the best way to do it?
Ideas so far:
For every API call I get to my server, before returning the data, check if the data was last updated within a day. If not then update the data and send it. This would mean that one API call per day would be extremely slow.
Perform Cron Scheduling
I would like to know if there are more ways to do this and I would like a comparison of different ways? I would also like if any of you guys do something like that in production and what method has worked?
Note: In my case Source A is a CSV file on a github repo, and Source B is MongoDB collection.
The best case you can achieve here is automatic updates to the MongoDB collection whenever that github CSV file is updated. If you can hook your job execution into whatever event is triggering the CSV upload, you would be golden. If you have to go through github, look into github hooks and see if you can subscribe your workload to one of those events
There is a nice option 3 that you can do with mongo, by the way. Mongo queues are great for scheduling jobs at precise intervals.
Recently I have used Jdbc_streaming filter plugin of logstash, it is very helpful plugin which allows me to connect with my database on the fly and perform checks against my events.
But are there any drawbacks or pitfall of using this filter.
I mean I have the following queries :
For example , I am firing select query against each of my events.
Is it a good idea to query my database for each event. I mean what if I am processing a syslog event of a server which is continuously sending me data, in that case for each event I will be triggering a select query on my database so how will my database will react in terms of load and response time.
What about the number of connections, how they are managed.
How this will behave if I join multiple tables.
I hope I am able to convey my question.
I just want to understand , how exactly it is working in back end and does querying my database at massive speed will degrade my database performance.
I am not sure whether this answer is correct or not.
But as per my experience , logstash works in sequential manner for the above plugin.
It creates only single connection to RDS and query's the DB for each record.
So there is no connection overhead, but then it degrades the performance by many folds.
This answer is just from my experience, it might be possible that this can be a completely wrong answer. Any edits or answers are welcome.
I am trying to determine what the difference is between a changestream:
https://docs.mongodb.com/manual/changeStreams
https://docs.mongodb.com/manual/reference/method/db.collection.watch/
which looks like so:
const changeStream = collection.watch();
changeStream.next(function(err, next) {
expect(err).to.equal(null);
client.close();
done();
});
and a tailable cursor:
https://docs.mongodb.com/manual/core/tailable-cursors/
which looks like so:
const cursor = coll.find(self.query || query)
.addCursorFlag('tailable', true)
.addCursorFlag('awaitData', true) // true or false?
.addCursorFlag('noCursorTimeout', true)
.addCursorFlag('oplogReplay', true)
.setCursorOption('numberOfRetries', Number.MAX_VALUE)
.setCursorOption('tailableRetryInterval', 200);
const strm = cursor.stream(); // Node.js transform stream
do they have a different use case? when would it be good to use one over the other?
Change Streams (available in MongoDB v3.6+) is a feature that allows you to access real-time data changes without the complexity and risk of tailing the oplog. Key benefits of change streams over tailing the oplog are:
Utilise the built-in MongoDB Role-Based Access Control. Applications can only open change streams against collections they have read access to. Refined and specific authorisation.
Provide a well defined API that are reliable. The change events output that are returned by change streams are well documented. Also, all of the official MongoDB drivers follow the same specifications when implementing change streams interface.
Change events that are returned as part of change streams are at least committed to the majority of the replica set. This means the change events that are sent to the client are durable. Applications don't need to handle data rollback in the event of failover.
Provide a total ordering of changes across shards by utilising a global logical clock. MongoDB guarantees the order of changes are preserved and change events can be safely interpreted in the order received. For example, a change stream cursor opened against a 3-shard sharded cluster returns change events respecting the total order of those changes across all three shards.
Due to the ordering characteristic, change streams are also inherently resumable. The _id of change event output is a resume token. MongoDB official drivers automatically cache this resume token, and in the case of network transient error the driver will retry once. Additionally, applications can also resume manually by utilising parameter resume_after. See also Resume a Change Stream.
Utilise MongoDB aggregation pipeline. Applications can modify the change events output. Currently there are five pipeline stages available to modify the event output. For example, change event outputs can be filtered out (server side) before being sent out using $match stage. See Modify Change Stream Output for more information.
when would it be good to use one over the other?
If your MongoDB deployment is version 3.6+, I would recommend to utilise MongoDB Change Streams over tailing the oplog.
You may also find Change Streams Production Recommendations a useful resource.
With tailable cursor, you follow ALL changes to all collections. With changeStream, you see only changes to the selected collection. Much less traffic and more reliable.
I'm trying to make use of Mongoose and its querystream in a scheduling application, but maybe I'm misunderstanding how it works. I've read this question here on SO [Mongoose QueryStream new results and it seems I'm correct, but someone please explain:
If I'm filtering a query like so -
Model.find().stream()
when I add or change something that matches the .find(), it should throw a data event, correct? Or am I completely wrong in my understanding of this issue?
For example, I'm trying to look at some data like so:
Events.find({'title':/^word/}).stream();
I'm changing titles in the mongodb console, and not seeing any changes.
Can anyone explain why?
Your understanding is indeed incorrect as a stream is just an output stream of the current query response and not something that "listens for new data" by itself. The returned result here is basically just a node streaming interface, which is an optional choice as opposed to a "cursor", or indeed the direct translation to an array as mongoose methods do by default.
So a "stream" does not just "follow" anything. It is reall just another way of dealing with the normal results of a query, but in a way that does not "slurp" all of the results into memory at once. It rather uses event listeners to process each result as it is fetched from the server cursor.
What you are in fact talking about is a "tailable cursor", or some variant thereof. In basic MongoDB operations, a "tailable cursor" can be implemented on a capped collection. This is a special type of collection with specific rules, so it might not suit your purposes. They are intended for "insert only" operations which is typically suited to event queues.
On a model that is using a capped collection ( and only where a capped collection has been set ) then you implement like this:
var query = Events.find({'title':/^word/}).sort({ "$natural": -1}).limit(1);
var stream = query.tailable({ "awaitdata": true}).stream();
// fires on data received
stream.on("data",function(data) {
console.log(data);
});
The "awaitdata" there is just as an important option as the "tailable" option itself, as it is the main thing that tells the query cursor to remain "active" and "tail" the additions to the collection that meet the query conditions. But your collection must be "capped" for this to work.
An alternate and more adavanced approach to this is to do something like the meteor distribution does, where the "capped collection" that is being tailed is in fact the MongoDB oplog. This requires a replica set configuration, however just as meteor does out of the box, there is nothing wrong with having a single node as a replica set in itself. It's just not wise to do so in production.
This is more adavnced than a simple answer, but the basic concept is since the "oplog" is a capped collection you are able to "tail" it for all write operations on the database. This event data is then inspected to determine such details as the collection you want to watch for writes has been written to. Then that data can be used to query the new information and do something like return the updated or new results to a client via a websocket or similar.
But a stream in itself is just a stream. To "follow" the changes on a collection you either need to implement it as capped, or consider implementing a process based on watching the changes in the oplog as described.