What is the best way to resolve CouchDB document conflicts across 2 DB instances? - couchdb

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?

If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.

If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

PouchDB/CouchDB Conflict Resolution Server Side

I'm new to pouch/couch and looking for some guidance on handling conflicts. Specifically, I have an extension running pouchdb (distributed to two users). Then the idea is to have a pouchdb-server or couchdb (does it matter for this small a use case?) instance running remotely. The crux of my concern is handling conflicts, the data will be changing frequently and though the extensions won't be doing live sync, they will be syncing very often. I have conflict handling written into the data submission functions, however there could still be conflicts when syncing occurs with multiple users.
I was looking at the pouch-resolve-conflicts plugin and see immediately the author state:
"Conflict resolution should better be done server side to avoid hard to debug loops when multiple clients resolves conflicts on the same documents".
This makes sense to me, but I am unsure how to implement such conflict
resolution. The only way I can think would be to place REST API layer
in front of the remote database that handles all updates/conflicts etc with custom logic.
But then how could I use the pouch sync functionality? At that point I
may as well just use a different database.
I've just been unable to find any resources discussing how to implement conflict resolution server-side, in fact the opposite.
With your use case, you could probably write to a local pouchdb instance and sync it with the master database. Then, you could have a daemon that automatically resolve conflicts on your master database.
Below is my approach to solve a similar problem.
I have made a NodeJS daemon that automatically resolve conflicts. It integrates deconflict, a NodeJS library that allows you to resolve a document in three ways:
Merge all revisions together
Keep the latest revisions (based on a custom key. Eg: updated_at)
Pick a certain revision (Here you can use your own logic)
Revision deconflict
The way I use CouchDB, every write is partial. We always take some changes and apply them to the latest document. With this approach, we can easily take the merge all revision strategy.
Conflict scanner
When the daemon boot, two processes are executed. One that go through all the changes. If a conflict is detected, it's added to a conflict queue.
Another process is executed and remain active: Continuous changes scanner.
It listen to all new changes and add conflicted documents to the conflict queue
Queue processing
Another process is started and keeps polling the queue for new conflicted documents. It gets conflicted documents in batch and resolve them on by one. If there's not documents, it just wait a certain period and starts the polling again.
Having worked a little bit with Redux I realized that the same concept of unidirectional flow would help me avoid the problem of conflicts altogether.
Redux flows like this...
So, my clientside code never write definitive data to the master database, instead they write insert/update/delete requests locally which PouchDB then pushes to the CouchDB master database. On the same server as the master CouchDB I have PouchDB in NodeJS replicating these requests. "Superviser" software in NodeJS examines each new request, changes their status to "processing" writes the requested updates, inserts and deletes, then marks the request "processed". To ensure they're processed one at time the code that receives each request, stuffs them into a FIFO. The processing code pulls them from the other end.
I'm not dealing with super high volume, so the latency is not a concern.
I'm also not facing a situation where numerous people might be trying to update exactly the same record at the same time. If that's your situation, your client-side update requests will need to specify the rev number and your "supervisors" will need to reject change requests that refer to a superseded version. You'll have to figure out how your client code would get and respond to those rejections.

How to keep CouchDB efficiently with a lot of DELETE, purge?

I have a couchdb database with ~2000 documents (50MB), but 150K deleted documents in 3 months, and will be increase.
So, What is the better strategy to keep the performance high?
Use purge + compact, periodically re-create entire database?
The couchdb documentation recommends re-create database when store short-term data, isn't my case but the delete is constant in some kind of documents.
DELETE operation
If your use case creates lots of deleted documents (for example, if you are storing short-term data like log entries, message queues, etc), you might want to periodically switch to a new database and delete the old one (once the entries in it have all expired).
Using Apache CouchDB v. 2.1.1
The purge operation is not implemented at cluster level in CouchDB 2.x series (from 2.0.0 to 2.2.0) so it doesn't seem to be an option in your case.
This seems that will be supported in the next release 2.3.0. You can check the related issue here.
The same issue includes a possible workaround based on a database switch approach described here.
In your case, with Apache CouchDB 2.1.1 database switch is the only viable option.

keeping track of history in graph databases

I am investigating the use of a graph database (like Neo4j - mainly because I need the python bindings) for modeling a real physical network. However, one of the requirements is to be able to track the history of where machines were, the state of network ports etc.
In a relational database, I can quite easily create an 'archive' table that I can use to do historical queries, however, I've been bitten many times with the issues of fixed table schemas and rather awkward left joins all over the place.
Does any one have any suggestions on how it would be best to maintain the historical relations and node properties in a graph database?
Depending on the number of nodes, you might be able to take snapshots of the graph network. Then index each node so that you can query it in each revision of the network.
You could also try versioning each node. Each time a node or one of its vertices changes, copy the node with references to the current version of each node it connects to. Then up the version number of the node you just modified.
Since Neo4J is based on a file system, you can easily keep the versions of your graph database via Git. Then go back and forth between versions to see how the graph was etc.
I know that Sones provides version control within the database.
"... place them under version control and administer various editions ..." Link

do couchdb views replicate?

I dont mean the view sources, stored in _design docs (those replicate since they're just docs). What I mean is do the view results (the computed btrees) replicate as well, or do just regular documents replicate (which is how I understand it right now).
the problematic scenario is:
there's a spike in traffic and I want to bring up a temporary server up, and replicate a portion of the dataset onto that new server. the views for those (to be replicated)docs have already been computed on the old server, and so dont need to be recomputed on the new server... so I want those old computed results to be transfered along with the portion of the docs.
another scenario is to use a backend cluster to compute complex views, and then replicate those results onto the a bunch of front-end servers that are actually hit by user requests.
As Till said, the results are not replicated. For more detail, you actually don't want them to be replicated. The general CouchDB paradigm that you should remember is that each installation is treated as an independent node - that's why _id, _rev, and sequence numbers are so important. This allows each node to work without taking any other node into consideration: if one of your nodes goes down, all of the others will continue to crank away without a care in the world.
Of course, this introduces new considerations around consistency that you might not be used to. For example, if you have multiple web servers that each has its own CouchDB node on it, and those nodes run replication between themselves so that each instance stays up to date, there will be a lag between the nodes. Here's an example flow:
User writes a change to web server A.
User makes a read request to web server B, because your load balancer decided that B was the better choice. The user gets their result.
Web server A sends the updated doc to web server B via replication.
As you can see, the user got the previous version of their document because web server B didn't know about the change yet. This can be defeated with...
Stick sessions, so that all of their reads and writes go to the same server. This could just end up defeating your load balancer.
Moving the CouchDB nodes off of the web servers and onto their own boxes. If you go with this then you probably want to take a look at the couchdb-lounge project (http://tilgovi.github.com/couchdb-lounge/).
Do your users really care if they get stale results? Your use case might be one where your users won't notice whether their results don't reflect the change that they just made. Make sure you're really getting a marked value out of this work.
Cheers.
The computed result is not replicated.
Here are some additional thoughts though:
When you partition your server and bring up a second server with it, how do you distribute read/writes and combine view results? This setup requires a proxy of some thought, I suggest you look into CouchDB-Lounge.
If you're doing master-master, you could keep the servers in sync using DRDB. It's been proven to work with mysql master-master replication, I don't see why it would not work here. This would also imply that the computed result is automatically in sync on both servers.
Let me know if this helps!

Resources