What is the recommended procedure to purge out all non-current data from a CouchDB database? - couchdb

Say I have a database with 100 records, each with 1000 revisions, and an additional 100,000 deleted documents each with an extensive history of revisions. In addition we also have a view document and some mango indexes.
For this hypothetical situation let's assume I can't delete and rebuild the database. Also replication safety is not a concern.
If I am required to create some kind of script utilizing curl to purge the database of all unused data so that the result of running this script is exactly the same as deleting and rebuilding the database with only 100 records with a single revision on-file, how should I go about doing this?

For your hypothetical situation, you could do the following:
Make a backup of the 100 required documents
Delete all documents in the DB
Use the Purge API to delete the revision history
Re-Create the 100 required documents
A safer approach for saving disk space and BTree size in a real-life scenario would be:
Properly configure CouchDB's compaction settings to not include too many revisions
Only purge documents that won't ever be modified again in the future.

Related

Does deleting Cosmos DB container consume RUs?

There are lot of junk documents in a Cosmos DB container. Is it better to bulk delete documents or drop the container and re-create it? Bulk delete consumes RUs, but what about deleting container? Does it consumes the RU to delete documents before dropping the container?
There's really no "right" answer - only you can decide which approach is "better" but... from an objective perspective:
Deletion of a container has negligible cost, and it's a one-time cost (that is... it's one single "delete" cost). And to your question regarding cost of deleting all the documents when taking the "delete collection" route: nope - it's just a collection-drop - you aren't charged RU for every document being removed, when dropping a collection.
Deleting documents, in bulk, would consume RU for every document deleted - it would absolutely cost RU, and runs the risk of throttling, depending on how aggressive your deletion activity is
Deleting all documents requires you to consider that deletes could occur across partitions - plan accordingly
If you delete the collection, there could be a time period where your app is now throwing exceptions due to the collection not existing (until you re-create the collection, which could take several seconds)
When re-creating a collection, be sure to re-create any related attributes of the collection as well (custom indexing, stored procedures, etc)
As an alternative to bulk-deleting, you can also take advantage of ttl to let old documents expire.

Deleting _deleted documents on CouchDB by date

My CouchDb database is getting bigger and I would like to remove documents by date also I would like to remove _deleted documents by date
I know how to replicate my DB removing documents by date but:
¿Is there a way to do the same with _deleted documents? I mean remove _deleted documents by date
There's not really a way to conditionally cause a deletion using filtered replication, nor can you replicate a complete removal of a document.
You have a variety of options:
you can avoid replicating updates on old documents by filtering on date, but if they have already been replicated they won't be deleted
you can make a view to return old documents, and use a script to delete them at the source database. The deletions will replicate to any target databases, but all database will retain at least a {_deleted:true} tombstone of the documents [that's how the deletion gets replicated in the first place]
you can find old documents and _purge them, but you'll have to do that on each replica
What is your main goal?
If you have hundreds of objects and you want to hide the old ones from the UI of all replicas, write a script to find and DELETE/_delete:true them from in a source/master replica and the changes will propagate.
If you have bazillions of e.g. log messages and you need to free up space by forgetting old ones, write a script to find and _purge and finally _compact, then run it on every replica. But for a case like that, it might be better to rotate databases instead, e.g. manually "shard" or bin into a different database each week, and every week simply drop the N+1 weeks old database on each replica.
If your database is getting bigger, this is probably due to the versionning of your documents. A simple way to free some space is to run database compaction (Documentation)
As for _deleted documents, you can only REALLY delete them by purging
Therefore, it's not recommended to purge _deleted documents. It should only be done to remove very important files such as credentials.

Mass update of all data with nosql

Trying to get my head around best practice for a full update.
Scenario example is if I'm storing a document for each file on a harddrive. A process runs daily update all the file information. No history needs to be kept of deleted files, the documents can just go.
Clearly querying each record would be inefficient, and wouldn't cover deleted files (data flow is one way only).
So I guess the options are:
1) Store all the records with a timestamp, and then delete all yesterday's records
2) Delete the database, but then I would lose all my views etc
3) Something else?
No history needs to be kept of deleted files
If you use one database, every time you will delete your documents, it will keep a tombstone of your documents (which doesn't use a lot of space but still...)
Here's the solution I would use:
Create a Template database with the design documents. Maintain your views there.
Create a script that does the following :
Create a database with today's date
Replicate _design documents from the template database
Add your data to the database
With this solution, each day content can be completely deleted from CouchDB.

Delete all documents in a CouchDB database *except* the design documents

Is it possible to delete all documents in a couchdb database, except design documents, without creating a specific view for that?
My first approach has been to access the _all_docs standard view, and discard those documents starting with _design. This works but, for large databases, is too slow, since the documents need to be requested from the database (in order to get the document revision) one at a time.
If this is the only valid approach, I think it is much more practical to delete the complete database, and create it from scratch inserting the design documents again.
I can think of a couple of ideas.
Use _all_docs
You do not need to fetch all the documents, only the ID and revisions. By default, that is all that _all_docs returns. You can make a pretty big request in a batch (10k or 100k docs at a time should be fine).
Replicate then delete
You could use an _all_docs query to get the IDs of all design documents.
GET /db/_all_docs?startkey="_design/"&endkey="_design0"
Then replicate them somewhere temporary.
POST /_replicator
{ "source":"db", "target":"db_ddocs", "create_target":true
, "user_ctx": {"roles":["_admin"]}
, "doc_ids": ["_design/ddoc_1", "_design/ddoc_2", "etc..."]
}
Now you can just delete the original database and replicate the temporary one back by swapping the "source" and "target" values.
Deleting vs "deleting"
Note, these are really apples vs. oranges techniques. By deleting a database, you are wiping out the edit history of all its documents. In other words, you cannot replicate those deletion events to any other database. When you "delete" a document in CouchDB, it stores a record of that deletion. If you replicate that database, those deletions will be reflected in the target. (CouchDB stores "tombstones" indicating the document ID, its revision history, and its deleted state.)
That may or may not be important to you. The first idea is probably considered more "correct" however I can see the value of the second. You can visualize the entire program to accomplish this in your head. It's only a few queries and you're done. No looping through _all_docs batches, no headache. Your specific situation will probably make it obvious which is better.
Install couchapp, pull down the design doc to your hard disk, delete the db in futon, push the design doc back up to your recreated database. =)
You could write a shell script that goes through the list of all documents and deletes them all one by one except design docs. Apparently couch-batch can do that. Note that you don't need to fetch the whole docs to do that, just the id and revision.
Other than that, I think filtered replication (or the replication proposed by JasonSmith) is your best bet.

CouchDB Compaction and Doc Deletion - Compaction indifferent?

Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.

Resources