PouchDB/CouchDB Conflict Resolution Server Side

PouchDB/CouchDB Conflict Resolution Server Side - node.js

I'm new to pouch/couch and looking for some guidance on handling conflicts. Specifically, I have an extension running pouchdb (distributed to two users). Then the idea is to have a pouchdb-server or couchdb (does it matter for this small a use case?) instance running remotely. The crux of my concern is handling conflicts, the data will be changing frequently and though the extensions won't be doing live sync, they will be syncing very often. I have conflict handling written into the data submission functions, however there could still be conflicts when syncing occurs with multiple users.
I was looking at the pouch-resolve-conflicts plugin and see immediately the author state:
"Conflict resolution should better be done server side to avoid hard to debug loops when multiple clients resolves conflicts on the same documents".
This makes sense to me, but I am unsure how to implement such conflict
resolution. The only way I can think would be to place REST API layer
in front of the remote database that handles all updates/conflicts etc with custom logic.
But then how could I use the pouch sync functionality? At that point I
may as well just use a different database.
I've just been unable to find any resources discussing how to implement conflict resolution server-side, in fact the opposite.

With your use case, you could probably write to a local pouchdb instance and sync it with the master database. Then, you could have a daemon that automatically resolve conflicts on your master database.
Below is my approach to solve a similar problem.
I have made a NodeJS daemon that automatically resolve conflicts. It integrates deconflict, a NodeJS library that allows you to resolve a document in three ways:
Merge all revisions together
Keep the latest revisions (based on a custom key. Eg: updated_at)
Pick a certain revision (Here you can use your own logic)
Revision deconflict
The way I use CouchDB, every write is partial. We always take some changes and apply them to the latest document. With this approach, we can easily take the merge all revision strategy.
Conflict scanner
When the daemon boot, two processes are executed. One that go through all the changes. If a conflict is detected, it's added to a conflict queue.
Another process is executed and remain active: Continuous changes scanner.
It listen to all new changes and add conflicted documents to the conflict queue
Queue processing
Another process is started and keeps polling the queue for new conflicted documents. It gets conflicted documents in batch and resolve them on by one. If there's not documents, it just wait a certain period and starts the polling again.

Having worked a little bit with Redux I realized that the same concept of unidirectional flow would help me avoid the problem of conflicts altogether.
Redux flows like this...
So, my clientside code never write definitive data to the master database, instead they write insert/update/delete requests locally which PouchDB then pushes to the CouchDB master database. On the same server as the master CouchDB I have PouchDB in NodeJS replicating these requests. "Superviser" software in NodeJS examines each new request, changes their status to "processing" writes the requested updates, inserts and deletes, then marks the request "processed". To ensure they're processed one at time the code that receives each request, stuffs them into a FIFO. The processing code pulls them from the other end.
I'm not dealing with super high volume, so the latency is not a concern.
I'm also not facing a situation where numerous people might be trying to update exactly the same record at the same time. If that's your situation, your client-side update requests will need to specify the rev number and your "supervisors" will need to reject change requests that refer to a superseded version. You'll have to figure out how your client code would get and respond to those rejections.

Related

Migrate legacy database to cqrs/event sourcing view

We got old legacy application with complex business logic which we need to rewrite. We consider to use cqrs and event sourcing. But it's not clear how to migrate data from the old database. Probable we need migrate it to the read database only, as we can't reproduce all the events to populate event store. But we atleast need to create some initial records in event store for each aggregate, like AggregateCreated? Or we need write a scripts and to use all the commands one by one to recreate aggregates in same way we will normally with event sourcing?

Using the existing database, or a transformed version of it, as a start of your read-side persistence is never a good idea. Your event-sourced system needs to have its start, so you get one of the main benefits of event sourcing - being able to create projections on-demand, using polyglot persistence.
Using commands for migration is also not a good idea for a simple reason that commands, by definition, can fail due to pre or post-condition check of invariant control. It also does not convey the meaning of migration, which is to represent the current system state as it is right now. Remember, that the current system stay is not something you can accept or deny. It is given to you and your job is to capture it.
The best practice for such a migration is to emit so-called migration events, like EntityXMigratedFromLegacy. Of course, the work might be substantial. Mainly because the legacy system model will most probably not match the new model, otherwise the reason for such a migration isn't entirely clear.
By using migration events you explicitly state the fact that a piece of state was moved from another place, as-is. You will always know how the migrated entity started its lifecycle in the new system - either by being migrated from legacy or by being initialised in the new system.

Probable we need migrate it to the read database only
No, your read model db can be dropped and recreated any time based on write side, only write side is your source of truth.
But we atleast need to create some initial records in event store for
each aggregate, like AggregateCreated?
Of course, and having ONLY the initial event could be not enough. If your current OrderAggregate has reservations, you must create ItemReservedEvent for-each reservation it has.
Or we need write a scripts and to use all the commands one by one to
recreate aggregates in same way we will normally with event sourcing?
Feels like that's the way you should go. Read old aggregate/entity from db and try to map it to a new one.

In an Event-Driven Microservice, how to I update private database with older data

I'm working on a new project, and I am still learning about how to use Microservice/Domain Driven Design.
If the recommended architecture is to have a Database-Per-Service, and use Events to achieve eventual consistency, how does the service's database get initialized with all the data that it needs?
If the events indicating an update to the database occurred before the new service/db was ever designed, do I need to start with a copy of the previous database?
Or should I publish a 'New Service On The Block' event, and allow all the other services to vomit back everything back to me again? Which could be a LOT of chatty-ness, and cause performance issues.

how does the service's database get initialized with all the data that it needs?
It asks for it; which is to say that you design a protocol so that the service that is spinning up can get copies of all of the information that it needs. That often includes tracking checkpoints, and queries that allow you to ask what has happened since some checkpoint.
Think "pull", rather than "push".
Part of the point of "services": designing the right data boundaries. The need to copy a lot of data between services often indicates that the service boundaries need to be reconsidered.

There is a special streaming platform named Apache Kafka, that solves something similar.
With Kafka you would publish events for other services to consume. What makes Kafka special is the fact, that events never (depends on configuration) get deleted and can be consumed again by new services spinning up. This feature can be used for initially populating the database (by setting the offset for a Topic to 0 and therefore re-read the history of events).
There also is another feature, called GlobalKTable what is a TableView of all events for a particular Topic. The GlobalKTable holds the latest value for each key (like primary key) and can be turned into an state-store (RocksDB under the hood), what makes it queryable. This state-store initializes itself whenever the application starts up. So the application does not need to have a database itself, because the state-store would be kept up-to-date automatically (consistency still is a thing to keep in mind). Only for more complex queries that state-store would need to be accompanied with a database (with kafka you would try to pre-compute the results of those queries and make them accessible to a distinct state-store itself).
This would be a complex endeavor, but if it suits your needs it is a fun thing to do!

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?

If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.

If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

Options for getting a CPU intensive job off my web server?

I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?

It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.

I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.

This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access

How to handle domain model updates and immutability of stored events?

I understand that events in event sourcing should never be allowed to change. But what about the in-memory state? If the domain model needs to be updated in some way, shouldn't old event still be replayed to old models? I mean shouldn't it be possible to always replay events and get the exact same state as before or is it acceptable if this state evolves too as long as the stored events remains the same? Ideally I think I'd like to be able to get a state as it was with it's old models, rules and what not. But other than that I of course also want to replay old events into new models. What does the theory say about this?

Anticipate event structure changes
You should always try to reflect the fact that an event had a different structure in your event application mechanism (i.e. where you read events and apply them to the model). After all, the earlier structure of an event was a valid structure at that time.
This means that you need to be prepared for this situation. Design the event application mechanism flexible enough so that you can support this case.
Migrating stored events
Only as a very last resort should you migrate the stored events. If you do it, make sure you understand the consequences:
Which other systems consumed the legacy events?
Do we have a problem with them if we change a stored event?
Does the migration work for our system (verify in a QA environment with a full data set)?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string