I am very new to CouchDB. Missing SQL already.
Anyways, I need to create a view that emits a few attributes of my documents along with all the revision IDs.
Something like this
function(doc) {
if (doc.type == 'template') {
emit(doc.owner, {_id: doc._id, _rev: doc._rev, owner: doc.owner, meta: doc.meta, key: doc.key, revisions_ids: What goes here?});
}
}
But how do I tell it to include all the revisions?
I know I can call
http://localhost:5984/main/94c4db9eb51f757ceab86e4a9b00cddf
for each document (from my app), but that really does not scale well.
Is there a batch way to fetch revision info?
Any help would be appreciated!!
CouchDB revisions are not intended to be a version control system. They are only used for ensuring write consistency. (and preventing the need for locks during concurrent writes)
That being said, only the most recent _rev number is useful for any given doc. Not only that, but a database compaction will delete all the old revisions as well. (a compaction is never run automatically, but is should be part of routine maintenance)
As you may have already noticed, your view outputs the most recent _rev number in the value of your view output. Also, if you are using include_docs=true, then the _rev number is also shown in the doc portion of your view result.
Strategies do exist for using CouchDB for revision history, but they are generally complicated, and not usually recommended. (check out this question and this blogpost for more information on that subject)
Related
Related to Ways to implement data versioning in MongoDB and structure of documents for versioning of a time series on mongodb
What data structure should I adopt for versioning when I also need to be able to handle queries?
Suppose I have 8500 documents of the form
{ _id: '12345-11',
noFTEs: 5
}
Each month I get details of a change to noFTEs in about 30 docs, I want to store the new data along with the previous one(s), together with a date.
That would seem to result in:
{ _id: '12345-11',
noFTEs: {
'2015-10-28T00:00:00+01:00': 5,
'2015-1-8T00:00:00+01:00': 3
}
}
But I also want to be able to do searches on the most recent data (e.g. noFTEs > 4, and the element should be considered as 5, not 3). At that stage I all I know is I want to use the most recent data, and will not know the key. So an alternative would be an array
{ _id: '12345-11',
noFTEs: [
{date: '2015-10-28T00:00:00+01:00', val: 5},
{date: '2015-1-8T00:00:00+01:00', val: 3}
}
}
Another alternative - as suggested by #thomasbormans in the comments below - would be
{ _id: '12345-11',
versions: [
{noFTEs: 5, lastModified: '2015-10-28T00:00:00+01:00', other data...},
{noFTEs: 3, lastModified: '2015-1-8T00:00:00+01:00', other...}
}
}
I'd really appreciate some insights about considerations I need to make before jumping all the way in, I fear I am resulting in a query that is pretty high workload for Mongo. (In practise there are 3 other fields that can be combined for searching, and one of these is also likely to see changes over time.)
When you model a noSQL database, there are some things you need to keep in mind.
First of all is the size of each document. If you use arrays in your document, be sure that it won't pass the 16 Mb size limit for each document.
Second thing, you must model your database in order to retrieve things easily. Some "denormalization" is acceptable in favor of speed and easy of use to your application.
So if you need to know the current noFTE value, and you need to keep a history only to audit purposes, you could go with 2 collections:
collection["current"] = [
{
_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
}
]
collection["history"] = [
{ _id: ...an object id...
source_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
},
{
_id: ...an object id...
source_id: '12345-11',
noFTEs: 3,
lastModified: '2015-1-8T00:00:00+01:00'
}
]
By doing this way, you keep your most frequent accessed records smaller (I suppose the current version is more frequently accessed). This will make mongo more prone to keep the "current" collection in memory cache. And documents will be retrieved faster from disk, because they are smaller.
I seem this design to be best in therms of memory optimisation. But this decision is directly related on what use you will make of your data.
EDIT: I changed my original response in order to create separated inserts for each history entry. In my original answer, I tried to keep your history entries close to your original solution to focus on denormalization topic. However, keeping history in an array is a poor design decision and I decided to make this answer more complete.
The choice to keep separated inserts in the history instead of creating an array are many:
1) Whenever you change the size of a document (for example, inserting more data into it), mongo may need to move this document to an empty part of your disk in order to accommodate the larger document. This way, you end up creating storage gaps making your collections larger.
2) Whenever you insert a new document, Mongo tries to predict how big it can become based on previous inserts/updates. This way, if your history documents' sizes are similar, the padding factor will become next to optimal. However, when you maintain growing arrays, this prediction won't be good and mongo will waste space with padding.
3) In the future, you will probably want to shrink your history collection if it grows too large. Usually, we define a policy for history retention (example: 5 years), and you can backup and prune data older than that. If you have kept separated documents for each history entry, it will be much easier to do this operation.
I can find other reasons, but I believe those 3 are enough to get into the point.
To add versioning without compromising usability and speed of access for the most recent data, consider creating two collections: one with the most recent documents and one to archive the old versions of the documents when they get changed.
You can use currentVersionCollection.findAndModify to update a document while also receiving the previous (or new, depending on parameters) version of said document in one command. You then just need to remove the _id of the returned document, add a timestamp and/or revision number (when you don't have these already) and insert it into the archive collection.
By storing each old version in an own document you also avoid document growth and prevent documents from bursting the 16MB document limit when they get changed a lot.
So I've been trying to wrap my head around this one for weeks, but I just can't seem to figure it out. So MongoDB isn't equipped to deal with rollbacks as we typically understand them (i.e. when a client adds information to the database, like a username for example, but quits in the middle of the registration process. Now the DB is left with some "hanging" information that isn't assocaited with anything. How can MongoDb handle that? Or if no one can answer that question, maybe they can point me to a source/example that can? Thanks.
MongoDB does not support transactions, you can't perform atomic multistatement transactions to ensure consistency. You can only perform an atomic operation on a single collection at a time. When dealing with NoSQL databases you need to validate your data as much as you can, they seldom complain about something. There are some workarounds or patterns to achieve SQL like transactions. For example, in your case, you can store user's information in a temporary collection, check data validity, and store it to user's collection afterwards.
This should be straight forwards, but things get more complicated when we deal with multiple documents. In this case, you need create a designated collection for transactions. For instance,
transaction collection
{
id: ..,
state : "new_transaction",
value1 : values From document_1 before updating document_1,
value2 : values From document_2 before updating document_2
}
// update document 1
// update document 2
Ooohh!! something went wrong while updating document 1 or 2? No worries, we can still restore the old values from the transaction collection.
This pattern is known as compensation to mimic the transactional behavior of SQL.
The response from CouchDB to a _changes request comes back in this format:
{"seq":12,"id":"foo","changes":[{"rev":"1-23202479633c2b380f79507a776743d5"}]}
My question - why is the "changes" element an array? What scenario would return more than one item in the changes element? I have never seen an example online with more than one item, and in my own experience I have only seen one item.
I'm writing code that interacts with changes, and I'd like to understand what to do if, in fact, there were more than one item.
Thanks,
Mike
The changes element is an array to reflect all existed revision leafs for document. As you know, CouchDB doesn't deletes document completely, but sets gravestone instead to prevent his accidentally resurrection after replication from source that has his older revision that wasn't yet deleted. Also it's possible to have multiple leafs due to update conflicts that have occurs after replication. For example:
Mike have created document in database A and replicated him to database B:
{"results":[
{"seq":1,"id":"thing","changes":[{"rev":"1-967a00dff5e02add41819138abb3284d"}]}
],
"last_seq":1}
John have received your document and updated him in database B:
{"results":[
{"seq":2,"id":"thing","changes":[{"rev":"2-7051cbe5c8faecd085a3fa619e6e6337"}]}
],
"last_seq":2}
But at the same Mike also made few changes (forgot to clean up data or add for something important) for him in database A :
{"results":[
{"seq":2,"id":"thing","changes":[{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":2}
And replicated him again to database B. John receives document in conflict state and by looking at changes feed with query parameter style=all_docs see next result:
{"results":[
{"seq":3,"id":"thing","changes":[{"rev":"2-7051cbe5c8faecd085a3fa619e6e6337"},{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":3}
While direct access to document returns his data from wining revision (with higher seq number or just latest) it's possible for him to have many of conflicted revisions (imagine concurrent writes for single document it within dozen databases that are replicated between each other)
Now John had decided to resolve this conflict and update actual revision, but drop the other:
{"results":[
{"seq":4,"id":"thing","changes":[{"rev":"3-2502757951d6d7f61ccf48fa54b7e13c"},{"rev":"2-13839535feb250d3d8290998b8af17c3"}]}
],
"last_seq":4}
Wait, the Mike's revision is still there? Why? John in panic removes his document:
{"results":[
{"seq":5,"id":"thing","changes":[{"rev":"2-13839535feb250d3d8290998b8af17c3"}{"rev":"4-149c48caacb32c535ee201b6f02b027b"}]}
],
"last_seq":5}
Now his version of document is deleted, but he's able to access to Mike's one.
Replicating John changes from database B to database A will all brings tombstone:
{"results":[
{"seq":3,"id":"thing","changes":[{"rev":"3-2adcbbf57013d8634c2362630697aab6"},{"rev":"4-149c48caacb32c535ee201b6f02b027b"}]}
],
"last_seq":3}
Why so? Because this is the document history about his data "evolution": in real world your documents may have many intermediate leafs distributed among large number of databases and to prevent silent data overwrite due data replication process CouchDB keeps each leaf to help resolve such conflicts. More and probably better explanation you may found in CouchDB wiki about replication and conflicts. Changes feed query parameters are also described there as well.
Is it possible to delete all documents in a couchdb database, except design documents, without creating a specific view for that?
My first approach has been to access the _all_docs standard view, and discard those documents starting with _design. This works but, for large databases, is too slow, since the documents need to be requested from the database (in order to get the document revision) one at a time.
If this is the only valid approach, I think it is much more practical to delete the complete database, and create it from scratch inserting the design documents again.
I can think of a couple of ideas.
Use _all_docs
You do not need to fetch all the documents, only the ID and revisions. By default, that is all that _all_docs returns. You can make a pretty big request in a batch (10k or 100k docs at a time should be fine).
Replicate then delete
You could use an _all_docs query to get the IDs of all design documents.
GET /db/_all_docs?startkey="_design/"&endkey="_design0"
Then replicate them somewhere temporary.
POST /_replicator
{ "source":"db", "target":"db_ddocs", "create_target":true
, "user_ctx": {"roles":["_admin"]}
, "doc_ids": ["_design/ddoc_1", "_design/ddoc_2", "etc..."]
}
Now you can just delete the original database and replicate the temporary one back by swapping the "source" and "target" values.
Deleting vs "deleting"
Note, these are really apples vs. oranges techniques. By deleting a database, you are wiping out the edit history of all its documents. In other words, you cannot replicate those deletion events to any other database. When you "delete" a document in CouchDB, it stores a record of that deletion. If you replicate that database, those deletions will be reflected in the target. (CouchDB stores "tombstones" indicating the document ID, its revision history, and its deleted state.)
That may or may not be important to you. The first idea is probably considered more "correct" however I can see the value of the second. You can visualize the entire program to accomplish this in your head. It's only a few queries and you're done. No looping through _all_docs batches, no headache. Your specific situation will probably make it obvious which is better.
Install couchapp, pull down the design doc to your hard disk, delete the db in futon, push the design doc back up to your recreated database. =)
You could write a shell script that goes through the list of all documents and deletes them all one by one except design docs. Apparently couch-batch can do that. Note that you don't need to fetch the whole docs to do that, just the id and revision.
Other than that, I think filtered replication (or the replication proposed by JasonSmith) is your best bet.
Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.