design pattern to expire documents on cloudant - couchdb

So when a document is deleted, the metadata is actually preserved forever. For a hosted service like cloudant, where storage costs every month, I instead would like to completely purge the deleted documents.
I read somewhere about a design pattern where you use dbcopy in a view to put the docs into a 'current' db then periodically delete the expired dbs. But I cant find the article, and I don't quite understand how database naming would work. How would the cloudant clients always know the 'current' database name?

Cloudant does not expose the _purge endpoint (the loose consistency guarantees between the clustered nodes make purging tricky).
The most common solution to this problem is to create a second database and use replication with a validate_document_update so that deleted documents with no existing entry in the target database are rejected. When replication is complete (or acceptably up-to-date if using continuous replication), switch your application to use the new database and delete the old one. There is currently no way to rename databases but you could use a virtual host which points to the "current" database.
I'd caution that a workload which generates a high ratio of deleted:active documents is generally an anti-pattern in Cloudant. I would first consider whether you can change your document model to avoid it.

Deleted documents are kept for ever in couchdb. Even after compaction .Though the size of document is pretty small as it contains only three fields
{_id:234wer,_rev:123,deleted:true}
The reason for this is to make sure that all the replicated databases are consistent. If a document that is replicated on several databases is deleted from one location there is no way to tell it to other replicated stores.
There is _purge but as explained in the wiki it is only to be used in special cases.

Related

How to keep CouchDB efficiently with a lot of DELETE, purge?

I have a couchdb database with ~2000 documents (50MB), but 150K deleted documents in 3 months, and will be increase.
So, What is the better strategy to keep the performance high?
Use purge + compact, periodically re-create entire database?
The couchdb documentation recommends re-create database when store short-term data, isn't my case but the delete is constant in some kind of documents.
DELETE operation
If your use case creates lots of deleted documents (for example, if you are storing short-term data like log entries, message queues, etc), you might want to periodically switch to a new database and delete the old one (once the entries in it have all expired).
Using Apache CouchDB v. 2.1.1
The purge operation is not implemented at cluster level in CouchDB 2.x series (from 2.0.0 to 2.2.0) so it doesn't seem to be an option in your case.
This seems that will be supported in the next release 2.3.0. You can check the related issue here.
The same issue includes a possible workaround based on a database switch approach described here.
In your case, with Apache CouchDB 2.1.1 database switch is the only viable option.

CouchDB - update multiple databases

I'm quite new to CouchDB... let say that I have a multiple databases in my CouchDB, one per user, each of db have a one config document. I need to add a property to that document across all dbs. Is it possible to update that config document in all databases (not doing it one by one)? if yes what is the best way to achieve this?
If I'm reading your question correctly, you should be able to update the document in one database, and then use filtered replication to update the other databases (though you can't have modified the document in those other databases, otherwise you'll get a conflict).
Whether it makes sense for the specific use case, depends. If it's just a setting shared by all users, I'd probably create a shared settings database instead.
I do not think that is possible, and I don't think that is the intended use for CouchDB. Even if you have everything in a single database it's not possible to do it in a quick way (there is no equivalent to a SQL update/where statement).

Azure DocumentDb Consistency level suggestion

I've setup an Azure batch process to read multiple csv files at the same time and write to Azure DocumentDb. I need a suggestion on the consistency level that fits the best for me.
I read through the consistency levels document(http://azure.microsoft.com/en-us/documentation/articles/documentdb-consistency-levels/) but am unable to relate my business case to the options provided in there.
My process
Get Document by Id
-If found then will pull a copy of the document, update changes and replace it.
-If not found, create a new entry.
if your writes and reads are from the same process (or you can share an instance of the documentclient) then session consistency will give you the best performance while ensuring you get consistent reads. This is because each SDK manages the session tokens ensuring that the read goes to a replica that has seen the write. Even if you don't do this, in your case the write will fail if you use the same document id. Within a collection, document ids are guaranteed to be unique.
Short version - session consistency (the default) is probably a good choice.

Does Cloudant has conflic without outside replication into it?

I'm planing on having my database stored in Cloudant.
We do not plan to have replication into Cloudant, only outside for backup purposes.
Is it safe to assume that there should not be any conflict in documents from the inner-working of BigCouch?
It is safe to assume that the clustered "big-couch inspired" code we run at Cloudant does not normally create additional conflicts in your documents. If you want to become a power user you can read up on 'quorum' in docs.cloudant.com, but you can safely ignore that to first order.

SaaS, central database, database per user, or combination?

Problem at hand is as follows:
SaaS to keep maintenance records
95% of data would be specific to each user i.e. no need to be accessed by other users
5% of data shared (and contributed by all users), like parts that are used in maintenance
SaaS to be delivered as CouchApp i.e. with public facing CouchDB
So I am torn between database per user, and single database for all users.
Database per user seems to offer much easier backup and maintenance, smaller data set, and easier access control. On the negative side how could I handle shared data?
Is it possible to have database per user, and one common database for shared information (parts)? Then replicate parts documents from all user databases to central one, from there back to all user databases? How to handle conflicts in that case (or even better avoid if possible)?
Or any much simpler approach? Or bite the bullet and go with just one central database?
It depends on the nature of the shared data, I guess. It seems natural to have filtered replication flowing from the user databases to the shared databases and unfiltered replication from the shared database to the user databases; I think that covers your requirements? It makes it so that each user only has to read/write from/to their specific database, while you can still distribute out the shared docs.
It may be easier to query from the shared database directly instead of replicating it back into the user databases, but that really depends on what kind of data would be in there.

Resources