There are lot of junk documents in a Cosmos DB container. Is it better to bulk delete documents or drop the container and re-create it? Bulk delete consumes RUs, but what about deleting container? Does it consumes the RU to delete documents before dropping the container?
There's really no "right" answer - only you can decide which approach is "better" but... from an objective perspective:
Deletion of a container has negligible cost, and it's a one-time cost (that is... it's one single "delete" cost). And to your question regarding cost of deleting all the documents when taking the "delete collection" route: nope - it's just a collection-drop - you aren't charged RU for every document being removed, when dropping a collection.
Deleting documents, in bulk, would consume RU for every document deleted - it would absolutely cost RU, and runs the risk of throttling, depending on how aggressive your deletion activity is
Deleting all documents requires you to consider that deletes could occur across partitions - plan accordingly
If you delete the collection, there could be a time period where your app is now throwing exceptions due to the collection not existing (until you re-create the collection, which could take several seconds)
When re-creating a collection, be sure to re-create any related attributes of the collection as well (custom indexing, stored procedures, etc)
As an alternative to bulk-deleting, you can also take advantage of ttl to let old documents expire.
Related
When I insert a new document, I'd like Cosmos to automatically assign _ts to another document property. I think this may be possible with a stored procedure, but going through one increases RU costs.
Script usage: As with queries, stored procedures and triggers consume
RUs based on the complexity of the operations that are performed. As
you develop your application, inspect the request charge header to
better understand how much RU capacity each operation consumes.
Is there another way to copy over _ts? The reason I want to do this is so I can track the original insertion time. _ts gets updated with each document change. I wish there was a way to configure a container so that Azure tracked the original insertion time in a separate system property.
Say I have a database with 100 records, each with 1000 revisions, and an additional 100,000 deleted documents each with an extensive history of revisions. In addition we also have a view document and some mango indexes.
For this hypothetical situation let's assume I can't delete and rebuild the database. Also replication safety is not a concern.
If I am required to create some kind of script utilizing curl to purge the database of all unused data so that the result of running this script is exactly the same as deleting and rebuilding the database with only 100 records with a single revision on-file, how should I go about doing this?
For your hypothetical situation, you could do the following:
Make a backup of the 100 required documents
Delete all documents in the DB
Use the Purge API to delete the revision history
Re-Create the 100 required documents
A safer approach for saving disk space and BTree size in a real-life scenario would be:
Properly configure CouchDB's compaction settings to not include too many revisions
Only purge documents that won't ever be modified again in the future.
I want to clear all my data before new data upload on regular interval. So I need to clear the data only. I don't want to delete and recreate my collections for the same.
There is no way to delete all documents in a collection, in a single operation. You would either need to do this on your own via whatever method you want, for enumerating and deleting documents. Alternatively, you can drop and re-add a collection, which should be significantly more efficient than deleting documents individually, and won't have an RU-based throttle.
One other alternative approach: Configure TTL on your documents. With a bit of creativity here, you could target the TTL on each of your added documents to expire around the same time, effectively resulting in an automatic document-deletion mechanism.
I'm a beginner to Azure. I'm using log monitors to view the logs for a Cosmos DB resource. I could see one log with Replace operation which is consuming a lot of average RUs.
Generally operation names should be CREATE/DELETE/UPDATE/READ . But why REPLACE operation has come in place over here - I could not understand it. And why the REPLACE operation is consuming lot of RUs?
What can I try next?
Updates in Cosmos are full replacement operations rather than in-place updates, as such these consume more RU/s than inserts. Also, the larger the document, the more throughput required for the update.
Strategies to optimize throughput consumption on update operations typically center around splitting documents into two with properties that don't change going into one document that is typically larger and another document with those properties that change frequently going into an other that is smaller. This will allow for the updates to be made on a smaller document which reduces RU/s consumed to do the operation.
All that said, 12 RU/s is not an inordinate amount of RU/s for a replace operation. I don't think you will get much, if any throughput reduction doing this. But you can certainly try.
My CouchDb database is getting bigger and I would like to remove documents by date also I would like to remove _deleted documents by date
I know how to replicate my DB removing documents by date but:
¿Is there a way to do the same with _deleted documents? I mean remove _deleted documents by date
There's not really a way to conditionally cause a deletion using filtered replication, nor can you replicate a complete removal of a document.
You have a variety of options:
you can avoid replicating updates on old documents by filtering on date, but if they have already been replicated they won't be deleted
you can make a view to return old documents, and use a script to delete them at the source database. The deletions will replicate to any target databases, but all database will retain at least a {_deleted:true} tombstone of the documents [that's how the deletion gets replicated in the first place]
you can find old documents and _purge them, but you'll have to do that on each replica
What is your main goal?
If you have hundreds of objects and you want to hide the old ones from the UI of all replicas, write a script to find and DELETE/_delete:true them from in a source/master replica and the changes will propagate.
If you have bazillions of e.g. log messages and you need to free up space by forgetting old ones, write a script to find and _purge and finally _compact, then run it on every replica. But for a case like that, it might be better to rotate databases instead, e.g. manually "shard" or bin into a different database each week, and every week simply drop the N+1 weeks old database on each replica.
If your database is getting bigger, this is probably due to the versionning of your documents. A simple way to free some space is to run database compaction (Documentation)
As for _deleted documents, you can only REALLY delete them by purging
Therefore, it's not recommended to purge _deleted documents. It should only be done to remove very important files such as credentials.