CouchDB with continuous replication reverts document revision instead of deleting - couchdb

We have a system that uses CouchDB as its database.
We are using continuous replication to create an always-updated copy of our database.
Recently we have discovered a strange behavior (maybe bug?) that I hope someone here could help me with:
We set the system with normal replication (NOT filtered).
We update the same document several times consecutively (each time waiting for CouchDB to return 200ok) - this part works fine and the document appears to be updated just fine in the replicated DB.
However, when we try to delete this document, even minutes after the consecutive updates, it is not deleted in the replication DB and instead just reverts to a revision before the consecutive updates.
It is important to note that we delete by adding a _deleted field set to true
I understand there is some problem with deletion using HTTP DELETE combined with filtered replication, but we're not using either.
Also, doing the same updates and just waiting a second between one and the other solves the problem just fine (or just combining them to one update).
However both solutions are not possible and at any case just go around the problem.
tl;dr:
1) CouchDB with normal continuous replication
2) Consecutive updates to document
3) _deleted = trueto document
4) Replicated DB does not delete, instead reverts to _rev before #2
Environment:
CouchDB version is 1.6.1
Windows computer
Using CouchDB-Lucene

Most likely you have introduced some conflicts in the documents. When a document is being edited in several replicas, CouchDB chooses a winning revision when replicating, but also keeps the losing revisions. If you delete the winning revision, the losing revision will be displayed again. You can read an introduction in the (now somewhat outdated) CouchDB Guide: http://guide.couchdb.org/draft/conflicts.html and in the CouchDB Docs: http://docs.couchdb.org/en/1.6.1/replication/conflicts.html
But in short, the replication database might have been edited by someone. It might be that you replicated several databases into one, or somebody edited the documents manually in the target database.
You can delete the target database and recreate an empty db. If you don't edit the target db by hand and don't replicate multiple dbs into one, _deletes will be replicated correctly from then.

Problem solved.
It was the revision limit.
Seems that quickly doing changes that are over the revision limit causes problem for the replication mechanism.
There is an unsolved bug in CouchDB about this issue:
https://issues.apache.org/jira/browse/COUCHDB-1649
Since the revision limit we had was 2, doing 3 consecutive updates to the same document and then deleting it caused this problem.
Setting the revision limit to 5 avoids it.

Related

Deleting _deleted documents on CouchDB by date

My CouchDb database is getting bigger and I would like to remove documents by date also I would like to remove _deleted documents by date
I know how to replicate my DB removing documents by date but:
¿Is there a way to do the same with _deleted documents? I mean remove _deleted documents by date
There's not really a way to conditionally cause a deletion using filtered replication, nor can you replicate a complete removal of a document.
You have a variety of options:
you can avoid replicating updates on old documents by filtering on date, but if they have already been replicated they won't be deleted
you can make a view to return old documents, and use a script to delete them at the source database. The deletions will replicate to any target databases, but all database will retain at least a {_deleted:true} tombstone of the documents [that's how the deletion gets replicated in the first place]
you can find old documents and _purge them, but you'll have to do that on each replica
What is your main goal?
If you have hundreds of objects and you want to hide the old ones from the UI of all replicas, write a script to find and DELETE/_delete:true them from in a source/master replica and the changes will propagate.
If you have bazillions of e.g. log messages and you need to free up space by forgetting old ones, write a script to find and _purge and finally _compact, then run it on every replica. But for a case like that, it might be better to rotate databases instead, e.g. manually "shard" or bin into a different database each week, and every week simply drop the N+1 weeks old database on each replica.
If your database is getting bigger, this is probably due to the versionning of your documents. A simple way to free some space is to run database compaction (Documentation)
As for _deleted documents, you can only REALLY delete them by purging
Therefore, it's not recommended to purge _deleted documents. It should only be done to remove very important files such as credentials.

Split large database using Lotus Domino replica

I've a 34Gb database (on server A), and i want to delete part of its documents to improve performance, after creating a replica of database itself.
Followed these steps:
create a local replica of database
deleted several documents from original database
I want to be sure to recover deleted documents into original database, if needed, using replica database.
So i try to use a pull into database from local replica, or a push from replica to database.
Nothing happened, 0 documents added, i'm not able to "re-import" documents.
What's wrong?
They're not supposed to come back! Replication goes both ways, and the most recent change to a document overwrites an older version, but deletion always wins.
Well... almost always.
When a document is deleted in one replica, a 'deletion stub' is left in its place. As long as that stub exists in the replica, a version of that document in another replica will not replicate back. The stub blocks it. That's why deletion wins.
But stubs are purged after a period of time called the 'purge interval'. The default purge interval is 30 days. After a stub has been purged from a replica, deletion can't win any more because there is nothing left to block an old revision from replicating back from another replica. The thing is, usually this is a Bad Thing. Usually when documents are deleted, you want them to stay deleted. You don't want them to reappear just because somebody kept a replica off-line for 31 days.
Now, there are some ways that you can try and control this process carefully, purging stubs and using something else (e.g., selective replication settings) to prevent deletions from coming back except when you want them to. There are ways to try, but one slip up with one setting in one replica, and boom! Bad things happen. And that includes any replica, including ones that you are not controlling carefully. It's a bad idea. I agree completely with #Karl-Henry on this.
Also, selective replication is evil and should be avoided at all costs. That's just my opinion, anyhow, but I have a lot of scars left over from the days before I came to that conclusion.
Here are two Lotus tech notes about replica stubs and the purge interval: Purging documents in Lotus Notes, How to purge document deletion stubs immediately. Please use what you learn from these technotes wisely. I urge you not to to use this knowledge to try and construct a replication-based backup/restore scheme!
I would be very careful using a replica as an archive like that. I could see someone replicating the wrong way, and that would cause some issues...
I have designed some archive solutions for several of my big databases here at work. I simply have a separate database (same design) designated as archive. I then have a manually triggered or scheduled agent (different in different databases) that identify the document to be archived and moved them from the production database to the archive. I then have functions to move documents back into production of needed.

CouchDB delete and recreate a document

I'm trying to avoid revisions building up in my CouchDB, and also so I can use TouchDB's "bulk pull" for replication (it bulk-pulls on all 1st-revs.) Would it be bad practice to just delete a document, and recreate it rather than modifying it, in order for all documents to stay at rev-1?
Deleting a document in CouchDB, will not reset the _rev.
CouchDB never deletes a document, it simply marks the last revision as deleted. Compaction will delete previous revisions, keeping only the last one. This is needed for replication to work properly. And this is why the deleted revision of a document should not contain any data, but only the _id of the document and the _deleted flag.
The only method to completely remove any traces of deleted documents, is to copy all documents to a new database. But keep in mind the consequences on replication.
well I want to say that your proposal makes me feel dirty, but that wouldn't be an SO answer so..
You mention TouchDB and bulk pull so you have a mobile app with data which can be modified externally and I assume wants to be able to modify it's own data. So the biggest issue I can think of would be update conflict resolution. ie. how do you handle changes to the document on both the client and the server while the client is offline. I think you'll start having to do a lot of the synchronisation work that couch is meant to handle for you..

Replicating CouchDB to local couch reduces size - why?

I recently started using Couch for a large app I'm working on.
I database with 7907 documents, and wanted to rename the database. I poked around for a bit, but couldn't figure out how to rename it, so I figured I would just replicate it to a local database of the name I wanted.
The first time I tried, the replication failed, I believe the error was a timeout. I tried again, and it worked very quickly, which was a little disconcerting.
After the replication, I'm showing that the new database has the correct amount of records, but the database size is about 1/3 of the original.
Also a little odd is that if I refresh futon, the size of the original fluctuates between 94.6 and 95.5 mb
This leaves me with a few questions:
Is the 2nd database storing references to the first? If so, can I delete the first without causing harm?
Why would the size be so different? Had the original built indexes that the new one eventually will?
Why is the size fluctuating?
edit:
A few things that might be helpful:
This is on a cloudant couchdb install
I checked the first and last record of the new db, and they match, so I don't believe futon is underreporting.
Replicating to a new database is similar to compaction. Both involve certain side-effects (incidentally, and intentionally, respectively) which reduce the size of the new .couch file.
The b-tree indexes get balanced
Data from old document revisions is discarded.
Metadata from previous updates to the DB is discarded.
Replications store to/from checkpoints, so if you re-replicate from the same source, to the same location (i.e. re-run a replication that timed out), it will pick up where it left off.
Answers:
Replication does not create a reference to another database. You can delete the first without causing harm.
Replicating (and compacting) generally reduces disk usage. If you have any views in any design documents, those will re-build when you first query them. View indexes use their own .view file which also consumes space.
I am not sure why the size is fluctuating. Browser and proxy caches are the bane of CouchDB (and web) development. But perhaps it is also a result of internal Cloudant behavior (for example, different nodes in the cluster reporting slightly different sizes).

CouchDB Compaction and Doc Deletion - Compaction indifferent?

Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.

Resources