I've a 34Gb database (on server A), and i want to delete part of its documents to improve performance, after creating a replica of database itself.
Followed these steps:
create a local replica of database
deleted several documents from original database
I want to be sure to recover deleted documents into original database, if needed, using replica database.
So i try to use a pull into database from local replica, or a push from replica to database.
Nothing happened, 0 documents added, i'm not able to "re-import" documents.
What's wrong?
They're not supposed to come back! Replication goes both ways, and the most recent change to a document overwrites an older version, but deletion always wins.
Well... almost always.
When a document is deleted in one replica, a 'deletion stub' is left in its place. As long as that stub exists in the replica, a version of that document in another replica will not replicate back. The stub blocks it. That's why deletion wins.
But stubs are purged after a period of time called the 'purge interval'. The default purge interval is 30 days. After a stub has been purged from a replica, deletion can't win any more because there is nothing left to block an old revision from replicating back from another replica. The thing is, usually this is a Bad Thing. Usually when documents are deleted, you want them to stay deleted. You don't want them to reappear just because somebody kept a replica off-line for 31 days.
Now, there are some ways that you can try and control this process carefully, purging stubs and using something else (e.g., selective replication settings) to prevent deletions from coming back except when you want them to. There are ways to try, but one slip up with one setting in one replica, and boom! Bad things happen. And that includes any replica, including ones that you are not controlling carefully. It's a bad idea. I agree completely with #Karl-Henry on this.
Also, selective replication is evil and should be avoided at all costs. That's just my opinion, anyhow, but I have a lot of scars left over from the days before I came to that conclusion.
Here are two Lotus tech notes about replica stubs and the purge interval: Purging documents in Lotus Notes, How to purge document deletion stubs immediately. Please use what you learn from these technotes wisely. I urge you not to to use this knowledge to try and construct a replication-based backup/restore scheme!
I would be very careful using a replica as an archive like that. I could see someone replicating the wrong way, and that would cause some issues...
I have designed some archive solutions for several of my big databases here at work. I simply have a separate database (same design) designated as archive. I then have a manually triggered or scheduled agent (different in different databases) that identify the document to be archived and moved them from the production database to the archive. I then have functions to move documents back into production of needed.
Related
We have a system that uses CouchDB as its database.
We are using continuous replication to create an always-updated copy of our database.
Recently we have discovered a strange behavior (maybe bug?) that I hope someone here could help me with:
We set the system with normal replication (NOT filtered).
We update the same document several times consecutively (each time waiting for CouchDB to return 200ok) - this part works fine and the document appears to be updated just fine in the replicated DB.
However, when we try to delete this document, even minutes after the consecutive updates, it is not deleted in the replication DB and instead just reverts to a revision before the consecutive updates.
It is important to note that we delete by adding a _deleted field set to true
I understand there is some problem with deletion using HTTP DELETE combined with filtered replication, but we're not using either.
Also, doing the same updates and just waiting a second between one and the other solves the problem just fine (or just combining them to one update).
However both solutions are not possible and at any case just go around the problem.
tl;dr:
1) CouchDB with normal continuous replication
2) Consecutive updates to document
3) _deleted = trueto document
4) Replicated DB does not delete, instead reverts to _rev before #2
Environment:
CouchDB version is 1.6.1
Windows computer
Using CouchDB-Lucene
Most likely you have introduced some conflicts in the documents. When a document is being edited in several replicas, CouchDB chooses a winning revision when replicating, but also keeps the losing revisions. If you delete the winning revision, the losing revision will be displayed again. You can read an introduction in the (now somewhat outdated) CouchDB Guide: http://guide.couchdb.org/draft/conflicts.html and in the CouchDB Docs: http://docs.couchdb.org/en/1.6.1/replication/conflicts.html
But in short, the replication database might have been edited by someone. It might be that you replicated several databases into one, or somebody edited the documents manually in the target database.
You can delete the target database and recreate an empty db. If you don't edit the target db by hand and don't replicate multiple dbs into one, _deletes will be replicated correctly from then.
Problem solved.
It was the revision limit.
Seems that quickly doing changes that are over the revision limit causes problem for the replication mechanism.
There is an unsolved bug in CouchDB about this issue:
https://issues.apache.org/jira/browse/COUCHDB-1649
Since the revision limit we had was 2, doing 3 consecutive updates to the same document and then deleting it caused this problem.
Setting the revision limit to 5 avoids it.
I'm trying to avoid revisions building up in my CouchDB, and also so I can use TouchDB's "bulk pull" for replication (it bulk-pulls on all 1st-revs.) Would it be bad practice to just delete a document, and recreate it rather than modifying it, in order for all documents to stay at rev-1?
Deleting a document in CouchDB, will not reset the _rev.
CouchDB never deletes a document, it simply marks the last revision as deleted. Compaction will delete previous revisions, keeping only the last one. This is needed for replication to work properly. And this is why the deleted revision of a document should not contain any data, but only the _id of the document and the _deleted flag.
The only method to completely remove any traces of deleted documents, is to copy all documents to a new database. But keep in mind the consequences on replication.
well I want to say that your proposal makes me feel dirty, but that wouldn't be an SO answer so..
You mention TouchDB and bulk pull so you have a mobile app with data which can be modified externally and I assume wants to be able to modify it's own data. So the biggest issue I can think of would be update conflict resolution. ie. how do you handle changes to the document on both the client and the server while the client is offline. I think you'll start having to do a lot of the synchronisation work that couch is meant to handle for you..
Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.
I recently started using Couch for a large app I'm working on.
I database with 7907 documents, and wanted to rename the database. I poked around for a bit, but couldn't figure out how to rename it, so I figured I would just replicate it to a local database of the name I wanted.
The first time I tried, the replication failed, I believe the error was a timeout. I tried again, and it worked very quickly, which was a little disconcerting.
After the replication, I'm showing that the new database has the correct amount of records, but the database size is about 1/3 of the original.
Also a little odd is that if I refresh futon, the size of the original fluctuates between 94.6 and 95.5 mb
This leaves me with a few questions:
Is the 2nd database storing references to the first? If so, can I delete the first without causing harm?
Why would the size be so different? Had the original built indexes that the new one eventually will?
Why is the size fluctuating?
edit:
A few things that might be helpful:
This is on a cloudant couchdb install
I checked the first and last record of the new db, and they match, so I don't believe futon is underreporting.
Replicating to a new database is similar to compaction. Both involve certain side-effects (incidentally, and intentionally, respectively) which reduce the size of the new .couch file.
The b-tree indexes get balanced
Data from old document revisions is discarded.
Metadata from previous updates to the DB is discarded.
Replications store to/from checkpoints, so if you re-replicate from the same source, to the same location (i.e. re-run a replication that timed out), it will pick up where it left off.
Answers:
Replication does not create a reference to another database. You can delete the first without causing harm.
Replicating (and compacting) generally reduces disk usage. If you have any views in any design documents, those will re-build when you first query them. View indexes use their own .view file which also consumes space.
I am not sure why the size is fluctuating. Browser and proxy caches are the bane of CouchDB (and web) development. But perhaps it is also a result of internal Cloudant behavior (for example, different nodes in the cluster reporting slightly different sizes).
Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.