How to keep CouchDB efficiently with a lot of DELETE, purge? - couchdb

I have a couchdb database with ~2000 documents (50MB), but 150K deleted documents in 3 months, and will be increase.
So, What is the better strategy to keep the performance high?
Use purge + compact, periodically re-create entire database?
The couchdb documentation recommends re-create database when store short-term data, isn't my case but the delete is constant in some kind of documents.
DELETE operation
If your use case creates lots of deleted documents (for example, if you are storing short-term data like log entries, message queues, etc), you might want to periodically switch to a new database and delete the old one (once the entries in it have all expired).
Using Apache CouchDB v. 2.1.1

The purge operation is not implemented at cluster level in CouchDB 2.x series (from 2.0.0 to 2.2.0) so it doesn't seem to be an option in your case.
This seems that will be supported in the next release 2.3.0. You can check the related issue here.
The same issue includes a possible workaround based on a database switch approach described here.
In your case, with Apache CouchDB 2.1.1 database switch is the only viable option.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

Why CouchDB generates a conflict when syncing a PouchDB document with a _rev difference higher than the limit of revisions?

This is a strange behavior. First, we sync CouchDB and PouchDB databases. Second, the PouchDB database goes offline. After many modifications to a document, it goes online and sync with CouchDB. If PouchDB document _rev number is higher than CouchDB _rev number plus the revs limit, CouchDB generates a 409 "Document update conflict". Why? And what can we do to avoid it?
Unfortunately, this is the expected behaviour for the revs_limit option in PouchDB. The documentation says
revs_limit: Specify how many old revisions we keep track (not a copy) of. Specifying a low value means Pouch may not be able to figure out whether a new revision received via replication is related to any it currently has which could result in a conflict. Defaults to 1000.
If your PouchDB revs_limit is too low, it cannot be determined whether your local revision actually has the server revision in its history and therefore throws a conflict.
The straighforward way would be to increase the local revision limit. But if you're generating over 1000 revisions between syncs, you should consider changing your data model, splitting up large JSONs and storing the modifications in a new document and merging the data in the app instead of modifying the full document.
If that's not an option, simply check for conflicts and resolve them by deleting the server version whenever they occur.

Azure "/_partitionKey"

I created multiple collections through code without realising the importance of having a partition key. I have since read the only way to add a partition key and redistribute the data is by deleting the collection and recreating it.
I don't really want to have to do this as I have quite a-lot of data already and want to avoid the downtime. When I look the Scale & Settings menu in Azure for each of my collections is see this below.
Can someone explain this - I thought my partition key was null but looks like MS have given me one called _partitionKey? Can I not just add _partitionKey to my documents, run a script to update them all to the key I want to use (e.g country)?
This is a new feature which allows non-partitioned collections (now called containers in the latest SDKs) to start using partitions with 0 downtime. The big caveat is that you need to be using the latest SDKs (which will be announced GA really soon (in fact most are already published, just waiting on doc publishing/etc.). Portal got the feature first since it's using the latest SDK under the covers already.

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?
If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.
If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

design pattern to expire documents on cloudant

So when a document is deleted, the metadata is actually preserved forever. For a hosted service like cloudant, where storage costs every month, I instead would like to completely purge the deleted documents.
I read somewhere about a design pattern where you use dbcopy in a view to put the docs into a 'current' db then periodically delete the expired dbs. But I cant find the article, and I don't quite understand how database naming would work. How would the cloudant clients always know the 'current' database name?
Cloudant does not expose the _purge endpoint (the loose consistency guarantees between the clustered nodes make purging tricky).
The most common solution to this problem is to create a second database and use replication with a validate_document_update so that deleted documents with no existing entry in the target database are rejected. When replication is complete (or acceptably up-to-date if using continuous replication), switch your application to use the new database and delete the old one. There is currently no way to rename databases but you could use a virtual host which points to the "current" database.
I'd caution that a workload which generates a high ratio of deleted:active documents is generally an anti-pattern in Cloudant. I would first consider whether you can change your document model to avoid it.
Deleted documents are kept for ever in couchdb. Even after compaction .Though the size of document is pretty small as it contains only three fields
{_id:234wer,_rev:123,deleted:true}
The reason for this is to make sure that all the replicated databases are consistent. If a document that is replicated on several databases is deleted from one location there is no way to tell it to other replicated stores.
There is _purge but as explained in the wiki it is only to be used in special cases.

Resources