Surprising behavior with replicated, deleted CouchDb documents - couchdb

We have two CouchDb servers, let's call them A and B. There's one-way replication from A to B and documents are only created, modified, or deleted on A - basically you can think of B as just a backup. There was a document on A that was deleted. When I tried to retrieve the revision prior to deletion from A I got {"error":"not_found","reason":"missing"} but that DB hasn't been compacted (at least as I understand it compaction only happens if you start it manually and that wasn't done). However, while B knew the document had been deleted the old revision was available on B.
My understanding is that if we haven't manually run compaction the old revision should always be available on A. Furthermore, when B replicates if there were multiple revisions since the last replication it'll pull metadata for the old revisions but might not pull the documents. Thus, in this setup, the set of revisions available on B should always be a proper subset of those available on A. So how could B have a revision that A does not?
We're on CouchDb 2.3.0.

Related

How to re-replicate documents after they have been purged in remote database

I have an application where the documents available in a remote database are a subset of documents available on the server. When the subset required by the user changes, documents that are no longer needed in the remote database are purged (yes, purged, not deleted) and new documents replicated. If the subset required by the user was changed to include documents that have been previously purged, I can't find a way to make the purged documents replicate again to reinstate them on the client.
A simple scenario to consider is:
Create two databases, A and B
Create a document "D" in A
Replicate database A to B
In B, purge D
Replicate A to B again and notice that D is not replicated
I've tried compacting B, to no avail. I can understand that with continuous replication, D will not be sent again because it has not changed. But I can't get D to be re-replicated using one-time replication either. How can I make a replication copy D from A to B once CouchDB is in this state?
I'm using CouchDB 2.3.
CouchDB stores a local replication log on each node, when replication is done.
It's probably fetching this log, and picking up where it left off, thus ignoring changes that happened before the last replication (such as creating documents which are now purged).
I can think of two solutions to this:
Manually delete these replication logs, by looking for _local/ documents, and deleting them.
Change, even slightly, the replication parameters, so that CouchDB generates a new replication ID for the sake of logging. One way to do this would be to add a filter function (it could be a filter function that filters nothing).

CouchDB with continuous replication reverts document revision instead of deleting

We have a system that uses CouchDB as its database.
We are using continuous replication to create an always-updated copy of our database.
Recently we have discovered a strange behavior (maybe bug?) that I hope someone here could help me with:
We set the system with normal replication (NOT filtered).
We update the same document several times consecutively (each time waiting for CouchDB to return 200ok) - this part works fine and the document appears to be updated just fine in the replicated DB.
However, when we try to delete this document, even minutes after the consecutive updates, it is not deleted in the replication DB and instead just reverts to a revision before the consecutive updates.
It is important to note that we delete by adding a _deleted field set to true
I understand there is some problem with deletion using HTTP DELETE combined with filtered replication, but we're not using either.
Also, doing the same updates and just waiting a second between one and the other solves the problem just fine (or just combining them to one update).
However both solutions are not possible and at any case just go around the problem.
tl;dr:
1) CouchDB with normal continuous replication
2) Consecutive updates to document
3) _deleted = trueto document
4) Replicated DB does not delete, instead reverts to _rev before #2
Environment:
CouchDB version is 1.6.1
Windows computer
Using CouchDB-Lucene
Most likely you have introduced some conflicts in the documents. When a document is being edited in several replicas, CouchDB chooses a winning revision when replicating, but also keeps the losing revisions. If you delete the winning revision, the losing revision will be displayed again. You can read an introduction in the (now somewhat outdated) CouchDB Guide: http://guide.couchdb.org/draft/conflicts.html and in the CouchDB Docs: http://docs.couchdb.org/en/1.6.1/replication/conflicts.html
But in short, the replication database might have been edited by someone. It might be that you replicated several databases into one, or somebody edited the documents manually in the target database.
You can delete the target database and recreate an empty db. If you don't edit the target db by hand and don't replicate multiple dbs into one, _deletes will be replicated correctly from then.
Problem solved.
It was the revision limit.
Seems that quickly doing changes that are over the revision limit causes problem for the replication mechanism.
There is an unsolved bug in CouchDB about this issue:
https://issues.apache.org/jira/browse/COUCHDB-1649
Since the revision limit we had was 2, doing 3 consecutive updates to the same document and then deleting it caused this problem.
Setting the revision limit to 5 avoids it.

How does CouchDB compaction affects /db/_changes?

I have an application pulls CouchDB from the first doc to the latest one, batch by batch.
I tried compact my database from 1.7GB to 1.0GB, and /db/_changes seems the same.
Can anyone please clarify if CouchDB compaction affects /db/_changes ?
All compaction does is remove old references to documents in a given database. The changes feed deals exclusively with write operations, which are unaffected by compaction. (since those writes have already happened)
Now, it should be noted that the changes feed will give you the rev numbers as well. Upon compaction, all but the most recent rev are deleted, so those entries in the changes feed will have "dead" links. (so-to-speak)
See the docs for more information about compaction.

How many document revisions are kept in CouchDB / Cloudant, and for how long?

In CouchDB and Cloudant, when documents are changed, the database holds on to previous versions. What gets kept, and for how long?
Cloudant and CouchDB keep the document's metadata forever (id, rev, deleted and conflict). Document contents are deleted during compaction (automatic in Cloudant, manual in CouchDB), with one exception: in the case of a conflict, we'll keep the document contents until the conflict is resolved.
For each document, we keep the last X revisions, where X is the number returned by {username}.cloudant.com/{db}/_revs_limit, defaulting to 1000. Revisions older than the last 1000 get dropped. You can change _revs_limit by making a PUT request with a new value to that endpoint. For example:
curl -X PUT -d "1500" https://username.cloudant.com/test/_revs_limit
So, if a document is replicated to two nodes, edited 1001 times on node A, and then replicated again to node B, it will generate a conflict on node B (because we've lost the information necessary to join the old and new edit paths together).

In the CouchDB _changes response, why are the "changes" elements arrays?

The response from CouchDB to a _changes request comes back in this format:
{"seq":12,"id":"foo","changes":[{"rev":"1-23202479633c2b380f79507a776743d5"}]}
My question - why is the "changes" element an array? What scenario would return more than one item in the changes element? I have never seen an example online with more than one item, and in my own experience I have only seen one item.
I'm writing code that interacts with changes, and I'd like to understand what to do if, in fact, there were more than one item.
Thanks,
Mike
The changes element is an array to reflect all existed revision leafs for document. As you know, CouchDB doesn't deletes document completely, but sets gravestone instead to prevent his accidentally resurrection after replication from source that has his older revision that wasn't yet deleted. Also it's possible to have multiple leafs due to update conflicts that have occurs after replication. For example:
Mike have created document in database A and replicated him to database B:
{"results":[
{"seq":1,"id":"thing","changes":[{"rev":"1-967a00dff5e02add41819138abb3284d"}]}
],
"last_seq":1}
John have received your document and updated him in database B:
{"results":[
{"seq":2,"id":"thing","changes":[{"rev":"2-7051cbe5c8faecd085a3fa619e6e6337"}]}
],
"last_seq":2}
But at the same Mike also made few changes (forgot to clean up data or add for something important) for him in database A :
{"results":[
{"seq":2,"id":"thing","changes":[{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":2}
And replicated him again to database B. John receives document in conflict state and by looking at changes feed with query parameter style=all_docs see next result:
{"results":[
{"seq":3,"id":"thing","changes":[{"rev":"2-7051cbe5c8faecd085a3fa619e6e6337"},{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":3}
While direct access to document returns his data from wining revision (with higher seq number or just latest) it's possible for him to have many of conflicted revisions (imagine concurrent writes for single document it within dozen databases that are replicated between each other)
Now John had decided to resolve this conflict and update actual revision, but drop the other:
{"results":[
{"seq":4,"id":"thing","changes":[{"rev":"3-2502757951d6d7f61ccf48fa54b7e13c"},{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":4}
Wait, the Mike's revision is still there? Why? John in panic removes his document:
{"results":[
{"seq":5,"id":"thing","changes":[{"rev":"2-13839535feb250d3d8290998b8af17c3"}{"rev":"4-149c48caacb32c535ee201b6f02b027b"}]}
],
"last_seq":5}
Now his version of document is deleted, but he's able to access to Mike's one.
Replicating John changes from database B to database A will all brings tombstone:
{"results":[
{"seq":3,"id":"thing","changes":[{"rev":"3-2adcbbf57013d8634c2362630697aab6"},{"rev":"4-149c48caacb32c535ee201b6f02b027b"}]}
],
"last_seq":3}
Why so? Because this is the document history about his data "evolution": in real world your documents may have many intermediate leafs distributed among large number of databases and to prevent silent data overwrite due data replication process CouchDB keeps each leaf to help resolve such conflicts. More and probably better explanation you may found in CouchDB wiki about replication and conflicts. Changes feed query parameters are also described there as well.

Resources