Deleting _deleted documents on CouchDB by date - couchdb

My CouchDb database is getting bigger and I would like to remove documents by date also I would like to remove _deleted documents by date
I know how to replicate my DB removing documents by date but:
¿Is there a way to do the same with _deleted documents? I mean remove _deleted documents by date

There's not really a way to conditionally cause a deletion using filtered replication, nor can you replicate a complete removal of a document.
You have a variety of options:
you can avoid replicating updates on old documents by filtering on date, but if they have already been replicated they won't be deleted
you can make a view to return old documents, and use a script to delete them at the source database. The deletions will replicate to any target databases, but all database will retain at least a {_deleted:true} tombstone of the documents [that's how the deletion gets replicated in the first place]
you can find old documents and _purge them, but you'll have to do that on each replica
What is your main goal?
If you have hundreds of objects and you want to hide the old ones from the UI of all replicas, write a script to find and DELETE/_delete:true them from in a source/master replica and the changes will propagate.
If you have bazillions of e.g. log messages and you need to free up space by forgetting old ones, write a script to find and _purge and finally _compact, then run it on every replica. But for a case like that, it might be better to rotate databases instead, e.g. manually "shard" or bin into a different database each week, and every week simply drop the N+1 weeks old database on each replica.

If your database is getting bigger, this is probably due to the versionning of your documents. A simple way to free some space is to run database compaction (Documentation)
As for _deleted documents, you can only REALLY delete them by purging
Therefore, it's not recommended to purge _deleted documents. It should only be done to remove very important files such as credentials.

Related

Mass update of all data with nosql

Trying to get my head around best practice for a full update.
Scenario example is if I'm storing a document for each file on a harddrive. A process runs daily update all the file information. No history needs to be kept of deleted files, the documents can just go.
Clearly querying each record would be inefficient, and wouldn't cover deleted files (data flow is one way only).
So I guess the options are:
1) Store all the records with a timestamp, and then delete all yesterday's records
2) Delete the database, but then I would lose all my views etc
3) Something else?
No history needs to be kept of deleted files
If you use one database, every time you will delete your documents, it will keep a tombstone of your documents (which doesn't use a lot of space but still...)
Here's the solution I would use:
Create a Template database with the design documents. Maintain your views there.
Create a script that does the following :
Create a database with today's date
Replicate _design documents from the template database
Add your data to the database
With this solution, each day content can be completely deleted from CouchDB.

rename collection vs updating collection

I have a mongo DB which i need to update daily(delete non relevant documents and add new ones).
the DB is not sharded.
I take the data from an external data master which is not so easy to work with.
There are 2 options:
1. reingest the entire DB (not so big) into a temp collection and then rename it to old collection name (with dropTarget set to true)
2. do the hard work myself, delete the old entires, and figure out from the data master which new documents are relavant and insert them to the DB
option 1 is prefrable obviously but what is the impact? I'm doing this maintenance in a late hour but I don't want the users to get errors when querying the DB during the rename process.
Is using rename to overwrite a collection a standard way to get things done or am I abusing the API ? :)
According to the documentation renameCollection blocks all database activity for the duration of the operation. If your users have set a sufficiently large time out , they will not directly be affected by this rename operation, however, as the dataset can change under their feet there might be side effects. For example, renaming a collection can invalidate open cursors which interrupts queries that are currently returning data.
Regarding renaming of collections in production, personally I would avoid this where possible, firstly because of the cursor issue above, but more importantly because an incomplete renameCollection operation can leave the target collection in an unusable state and require manual intervention to clean up. Instead I would use an update with upsert:true that overwrites the entire document or inserts a new record if it doesn't exist.

CouchDB delete and recreate a document

I'm trying to avoid revisions building up in my CouchDB, and also so I can use TouchDB's "bulk pull" for replication (it bulk-pulls on all 1st-revs.) Would it be bad practice to just delete a document, and recreate it rather than modifying it, in order for all documents to stay at rev-1?
Deleting a document in CouchDB, will not reset the _rev.
CouchDB never deletes a document, it simply marks the last revision as deleted. Compaction will delete previous revisions, keeping only the last one. This is needed for replication to work properly. And this is why the deleted revision of a document should not contain any data, but only the _id of the document and the _deleted flag.
The only method to completely remove any traces of deleted documents, is to copy all documents to a new database. But keep in mind the consequences on replication.
well I want to say that your proposal makes me feel dirty, but that wouldn't be an SO answer so..
You mention TouchDB and bulk pull so you have a mobile app with data which can be modified externally and I assume wants to be able to modify it's own data. So the biggest issue I can think of would be update conflict resolution. ie. how do you handle changes to the document on both the client and the server while the client is offline. I think you'll start having to do a lot of the synchronisation work that couch is meant to handle for you..

Delete all documents in a CouchDB database *except* the design documents

Is it possible to delete all documents in a couchdb database, except design documents, without creating a specific view for that?
My first approach has been to access the _all_docs standard view, and discard those documents starting with _design. This works but, for large databases, is too slow, since the documents need to be requested from the database (in order to get the document revision) one at a time.
If this is the only valid approach, I think it is much more practical to delete the complete database, and create it from scratch inserting the design documents again.
I can think of a couple of ideas.
Use _all_docs
You do not need to fetch all the documents, only the ID and revisions. By default, that is all that _all_docs returns. You can make a pretty big request in a batch (10k or 100k docs at a time should be fine).
Replicate then delete
You could use an _all_docs query to get the IDs of all design documents.
GET /db/_all_docs?startkey="_design/"&endkey="_design0"
Then replicate them somewhere temporary.
POST /_replicator
{ "source":"db", "target":"db_ddocs", "create_target":true
, "user_ctx": {"roles":["_admin"]}
, "doc_ids": ["_design/ddoc_1", "_design/ddoc_2", "etc..."]
}
Now you can just delete the original database and replicate the temporary one back by swapping the "source" and "target" values.
Deleting vs "deleting"
Note, these are really apples vs. oranges techniques. By deleting a database, you are wiping out the edit history of all its documents. In other words, you cannot replicate those deletion events to any other database. When you "delete" a document in CouchDB, it stores a record of that deletion. If you replicate that database, those deletions will be reflected in the target. (CouchDB stores "tombstones" indicating the document ID, its revision history, and its deleted state.)
That may or may not be important to you. The first idea is probably considered more "correct" however I can see the value of the second. You can visualize the entire program to accomplish this in your head. It's only a few queries and you're done. No looping through _all_docs batches, no headache. Your specific situation will probably make it obvious which is better.
Install couchapp, pull down the design doc to your hard disk, delete the db in futon, push the design doc back up to your recreated database. =)
You could write a shell script that goes through the list of all documents and deletes them all one by one except design docs. Apparently couch-batch can do that. Note that you don't need to fetch the whole docs to do that, just the id and revision.
Other than that, I think filtered replication (or the replication proposed by JasonSmith) is your best bet.

CouchDB Compaction and Doc Deletion - Compaction indifferent?

Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.

Resources