Node.js - Scaling with Redis atomic updates - node.js

I have a Node.js app that preforms the following:
get data from Redis
preform calculation on data
write new result back to Redis
This process may take place several times per second. The issue I now face is that I wish to run multiple instances of this process, and I am obviously seeing out of date date being updated due to each node updating after another has got the last value.
How would I make the above process atomic?
I cannot add the operation to a transaction within Redis as I need to get the data (which would force a commit) before I can process and update.
Can anyone advise?

Apologies for the lack of clarity with the question.
After further reading, indeed I can use transactions however the area I was struggling to understand was that I need separate out the read from the update, and just wrap the update in the transaction along with using WATCH on the read. This causes the update transaction to fail if another update has taken place.
So the workflow is:
WATCH key
GET key
MULTI
SET key
EXEC
Hopefully this is useful for anyone else looking to an atomic get and update.

Redis supports atomic transactions http://redis.io/topics/transactions

Related

Accidentally deleted a table from redis, is there any roll back like operations?

Been working on the Redis since a year, have not faced this issue. Suddenly went to delete a particular record in the Table and deleted the whole table. I need some help.
According to Redis Documentation, it doesn't support rollback transactions, the fact that Redis commands can fail during a transaction without it rolling back may be odd to you if you have a relational DB background.
However there are good opinions for this behavior:
Redis commands can fail only if called with a wrong syntax (and the problem is not detectable during the command queueing), or against
keys holding the wrong data type: this means that in practical terms a
failing command is the result of a programming errors, and a kind of
error that is very likely to be detected during development, and not
in production.
Redis is internally simplified and faster because it does not need the ability to roll back.
Refer to the Documentation
Redis does not have the rollback feature, except under certain conditions you can cheat with restoring from a file. I mean, you can lock Redis's dump.rdb file for write and restart the service. Redis's state will be rolled back to the time of last fsync to the file. Wouldn't recommend doing this, though. Default timer to save Redis state is 15 to 1 min depending on number of writes.
I mean seriously, don't do this.

Whats the proper way to keep track of changes to documents so that client can poll for deltas?

I'm storing key-value documents on a mongo collection, while multiple clients are pushing updates to this collection (posting to an API endpoint) at a very fast pace (updates will come in faster than once per second).
I need to expose another endpoint so that a watcher can poll all changes, in delta format, since last poll. Each diff must have a sequence number and/or timestamp.
What I'm thinking is:
For each update I calculate a diff and store it.
I store each diff on a mongo collection, with current timestamp (using Node Date object)
On each poll for changes: get all diffs from the collection, delete them and return.
The questions are:
Is it safe to use timestamps for sequencing changes?
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
thanks!
On each poll for changes: get all diffs from the collection, delete them and return.
This sounds terribly fragile. What if client didn't receive the data (he crashed/network disappeared in the middle of receiving the response)? He retries the request, but oops, doesn't see anything. What I would do is that client remembers last version it saw and asks for updates like this:
GET /thing/:id/deltas?after_version=XYZ
When it receives a new batch of deltas, it gets the last version of that batch and updates its cached value.
Is it safe to use timestamps for sequencing changes?
I think so. ObjectId already contains a timestamp, so you might use just that, no need for separate time field.
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
Depends on your requirements. Mongo should work well here. Especially if you'll be cleaning old data.
at a very fast pace (updates will come in faster than once per second)
By modern standards, 1 per second is nothing. 10 per second - same. 10k per second - now we're talking.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

How to compute the time taken in data re balancing in hazelcast, in case of a node failure?

I am trying to figure out, how much time does hazelcast take to re-balance (re-partition) the data in case of a node failure. with varying backup counts.
Is there any way to figure this out.
I tried using the migration listener, but its not notified in case of a node exit. The call back happens only in case of a node being added. I have tried this with three nodes, so as to rule out data being reclaimed from the backup, and thus no migration.
The other approach I tried was using the "isClusterSafe" API. So when a member is notified of node exit (using MembershipListener), I measure the time till the "isClusterSafe" API returns true.
Is there any other way to figure out this? And will my second approach give accurate values?
You should take a look at the MigrationListener (not to be confused with PartitionListener). Seems this has a hook to know when a partition's migration has started and finished. You can calculate the time taken per partition ID.
I would then use this in conjunction with the MembershipListener so that you could figure out when a partition has been migrated to another node due to a node failure (and not just some sort of scheduled rebalancing).

Multiple node instances with a single database

I'm currently writing a Node app and I'm thinking ahead in scaling. As I understand, horizontal scaling is one of the easier ways to scale up an application to handle more concurrent requests. My working copy currently uses MongoDb on the backend.
My question is thus this: I have a data structure that resembles a linked list that requires the order to be strictly maintained. My (imaginary) concern is that when there is a race condition to the database via multiple node instances, it is possible that the resolution of the linked list will be incorrect.
To give an example: Imagine the server having this list a->b. Instance 1 comes in with object c and instance 2 comes in with object d. It is possible that there is a race condition in which both instances read a->b and decides to append their own objects to the list. Instance 1 will then imagine it's insertion to be a->b->c while instance 2 think it's a->b->d when the database actually holds a->b->c->d.
In general, this sounds like a job for optimistic locking, however, as I understand, neither MongoDB or Redis (the other database that I am considering) does transactions in the SQL manner.
I therefore imagine the solution to be one of the below:
Implement my own transaction in MongoDB using flags. The client does a findAndModify on the lock variable and if successful, performs the operations. If unsuccessful, the client retries after a certain timeout.
Use Redis transactions and pubsub to achieve the same effect. I'm not exactly sure how to do this yet, but it sounds like it might be plausible.
Implement some sort of smart load balancing. If multiple clients is operating on the same item, route them to the same instance. Since JS is single threaded, the problem would be solved. Unfortunately, I didn't find a straightforward solution to that.
I sure there exists a better, more elegant way to achieve the above, and I would love to hear any solutions or suggestions. Thank you!
If I understood correctly, and the list is being stored as one single document, you might be looking at row versioning. So add a property to the document that will handle the version, when you update, you increase (or change) the version and you make that a conditional update:
//update(condition, value)
update({version: whateverYouReceivedWhenYouDidFind}, newValue)
Hope it helps.
Gus
You want the findAndModify command on mongodb that will guarantee an atomic modification while returning the newly modified doc. As the changes are serial and atomic instance 1 will have a->b->c and instance 2 will have a->b->c->d
Cheers
If all you are doing is adding new elements to the list, you could use a Redis list and include the time in every value you add. The list may be unsorted on redis but should be quickly sortable when retrieved.

Resources