Can I retrieve all revisions of a deleted document? - couchdb

I know I can retrieve all revisions of an "available" document, but can I retrieve the last "available" version of a deleted document? I do not know the revision id prior to the delete. This is the command I am currently running...it returns {"error":"not_found","reason":"deleted"}.
curl -X GET http://localhost:5984/test_database/a213ccad?revs_info=true

I've got this problem, trying to recover deleted document, here is my solution:
0) until you run a compaction, get deleted history, e.g.:
curl http://example.iriscouch.com/test/_changes
1) you'll see deleted documents with $id and $rev, put empty document as new version, e.g.:
curl -X PUT http://example.iriscouch.com/test/$id?rev=$rev -H "Content-Type: application/json" -d {}
2) now you can get all revisions info, e.g:
curl http://example.iriscouch.com/test/$id?revs_info=true
See also Retrieve just deleted document

Besides _changes, another good way to do this is to use keys with _all_docs:
GET $MYDB/_all_docs?keys=["foo"] ->
{
"offset": 0,
"rows": [
{
"id": "foo",
"key": "foo",
"value": {
"deleted": true,
"rev": "2-eec205a9d413992850a6e32678485900"
}
}
],
"total_rows": 0
}
Note that it has to be keys; key will not work, because only keys returns info for deleted docs.

You can get the last revision of a deleted document, however first you must first determine its revision id. To do that, you can query the _changes feed and scan for the document's deletion record — this will contain the last revision and you can then fetch it using docid?rev=N-XXXXX.
I remember some mailinglist discussion of making this easier (as doing a full scan of the changes feed is obviously not ideal for routine usage), but I'm not sure anything came of it.

I've hit this several times recently, so for anyone else wandering by ...
This question typically results from a programming model that needs to know which document was deleted. Since user keys such as 'type' don't survive deletion and _id is best assigned by couch, it would often be nice to peak under the covers and see something about the doc that was deleted. An alternative is to have a process that sets deleted:True (no underscore) for documents, and to adjust any listener filters, etc., to look for deleted:True. One of the processes can then actually delete the document. This means that any process triggering on the document doesn't need to track an _id for eventual deletion.

Related

Elasticsearch - Why am I not getting the same search results after updating a document?

Here's what I'm doing:
First, I make a search and get some documents
curl -XPOST index/type/_search
{
"query" : {
"match_all": {}
},
"size": 10
}
Then, I'm updating one of the documents resulted in the search
curl -XPOST index/type/_id/_update
{
"doc" : {
"some_field" : "Some modification goes here."
}
}
And finally, I'm doing exactly the same search as above.
But the curious thing is that I get all the previous documents, except the updated one. Why is it no longer among the documents in the search?
Thank you!
Since you're not sorting your documents, they are sorted by score. Your modification might have changed the document score after which the documents are sorted by default.
And since you're only taking the first 10 documents, you have no guarantee that your new document will come back in those 10 documents.

Using a script to conditionally update a document in Elasticsearch

I have a use case in which concurrent update requests make hit my Elasticsearch cluster. In order to make sure that a stale event (one that is made irrelevant by a newer request) does not update a document after a newer event has already reached the cluster, I would like to pass a script with my update requests to compare a field to determine if the incoming request is relevant or not. The request would look like this:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '
{
"script": " IF ctx._source.user_update_time > my_new_time THEN do not update ELSE proceed with update",
"params": {
"my_new_time": "2014-09-01T17:36:17.517""
},
"doc": {
"name": "new_name"
},
"doc_as_upsert": true
}'
Is the pseudo code I wrote in the "script" field possible in Elasticsearch ? If so, I would love some help with the syntax (groovy, python or javascript).
Any alternative approach suggestions would be greatly appreciated too.
Elasticsearch has built-in optimistic concurrency control (+ here and here).
The way it works is that the Update API allows you two use the version parameter in order to control whether the update should proceed or not.
So taking your above example, the first index/update operation would create a document with version: 1. Then take the case where you have two concurrent requests. Both components A and B will send an updated document, they initially have both retrieved the document with version: 1 and will specify that version in their request (see version=1 in the query string below). Elasticsearch will update the document if and only if the provided version is the same as the current one
Component A and B both send this, but A's request is the first to make it:
curl -XPOST 'localhost:9200/test/type1/1/_update?version=1' -d '{
"doc": {
"name": "new_name"
},
"doc_as_upsert": true
}'
At this point the version of the document will be 2 and B's request will end up with HTTP 409 Conflict, because B assumed the document was still at version 1, even though the version increased in the meantime due to A's request.
B can definitely retrieve the document with the new version (i.e. 2) and try its update again, but this time with ?version=2in the URL. If it's the first one to reach ES, the update will succeed.
I think the script should be like this:
"script": "if(ctx._source.user_update_time > my_new_time) ctx._source.user_update_time=my_new_time;"
or
"script": "ctx._source.user_update_time > my_new_time ? ctx.op=\"none\" : ctx._source.user_update_time=my_new_time"

validate_on_update prevents deletion of a specified document

I have a simple validate_on_update function:
if (!newDoc.type) {
throw({forbidden: "All documents must have a type specified"});
}
If I do
curl -X DELETE $HOST/$DB/$DOC?rev=$REV
I get back
{"error":"forbidden","reason":"All documents must have a type specified"}
This happens even if I do
rev=$REV&type=type
Or if I do
-d'{"type":"type"}'
with curl
How can I bypass validation for deletion of documents?
CouchDB internals only know reads and updates. An update can eb the creation of the doc, an edit of a doc, or the deletion of the doc. Update funs can’t be circumvented for any type. To solve this, use if(!newDoc.type || doc._deleted) {

I am using river-couchdb with ElasticSearch and would like to reset its last_seq to 0. Anybody know if that's possible?

I am using this a plugin with elasticSearch called river-couchdb to create a full text index of my couchdb. It uses the couchdb _changes api to listen for documents. I assume it is keeping track of the last seq from the _changes api.
Sometimes we rebuild our CouchDB and set our last-seq back to 0. Only way I've found to reset the river-couchdb seq is to delete both its index and the river itself and recreate it. Is there a better way?
As far as I remember, you have a _seq document in your _river index for your river.
This document has a _last_seq entry.
If you want to restart from scratch, I think you can simply delete this document:
curl -XDELETE localhost:9200/_river/yourrivername/_seq
Does it help?
From couchdb-river manual:
starting-at-a-specific-sequence
curl -XDELETE localhost:9200/_river/yourrivername/_seq
curl -XPUT 'localhost:9200/_river/yourrivername/_seq' -d '
{
"couchdb": {
"last_seq": "100"
}
}'
Elasticsearch cant update last_seq without deleting old document

Are old data accessible in CouchDB?

I've read a bit about CouchDB and I'm really intrigued by the fact that it's "append-only". I may be misunderstanding that, but as I understand it, it works a bit like this:
data is added at time t0 to the DB telling that a user with ID 1's name is "Cedrik Martin"
a query asking "what is the name of the user with ID 1?" returns "Cedrik Martin"
at time t1 an update is made to the DB telling: "User with ID 1's name is Cedric Martin" (changing the 'k' to a 'c').
a query asking again "what is the name of the user with ID 1" now returns "Cedric Martin"
It's a silly example, but it's because I'd like to understand something fundamental about CouchDB.
Seen that the update has been made using an append at the end of the DB, is it possible to query the DB "as it was at time t0", without doing anything special?
Can I ask CouchDB "What was the name of the user with ID 1 at time t0?" ?
EDIT the first answer is very interesting and so I've got a more precise question: as long as I'm not "compacting" a CouchDB, I can write queries that are somehow "referentially transparent" (i.e. they'll always produce the same result)? For example if I query for "document d at revision r", am I guaranteed to always get the same answer back as long as I'm not compacting the DB?
Perhaps the most common mistake made with CouchDB is to believe it provides a versioning system for your data. It does not.
Compaction removes all non-latest revisions of all documents and replication only replicates the latest revisions of any document. If you need historical versions, you must preserve them in your latest revision using any scheme that seems good to you.
"_rev" is, as noted, an unfortunate name, but no other word has been suggested that is any clearer. "_mvcc" and "_mcvv_token" have been suggested before. The issue with both is that any description of what's going on there will inevitably include the "old versions remain on disk until compaction" which will still imply that it's a user versioning system.
To answer the question "Can I ask CouchDB "What was the name of the user with ID 1 at time t0?" ?", the short answer is "NO". The long answer is "YES, but then later it won't work", which is just another way of saying "NO". :)
As already said, it is technically possible and you shouldn't count on it. It isn't only about compaction, it's also about replication, one of CouchDB's biggest strengths. But yes, if you never compact and if you don't replicate, then you will be able to always fetch all previous versions of all documents. I think it will not work with queries, though, they can't work with older versions.
Basically, calling it "rev" was the biggest mistake in CouchDB's design, it should have been called "mvcc_token" or something like that -- it really only implements MVCC, it isn't meant to be used for versioning.
Answer to the second Question:
YES.
Changed Data is always Added to the tree with a higher revision number. same rev is never changed.
For Your Info:
The revision (1-abcdef) ist built that way: 1=Number of Version ( here: first version),
second is a hash over the document-content (not sure, if there is some more "salt" in there)...
so the same doc content will always produce the same revision number ( with the same setup of couchdb) even on other machines, when on the same changing-level ( 1-, 2-, 3-)
Another way is: if you need to keep old versions, you can store documents inside a bigger doc:
{
id:"docHistoryContainer_5374",
"doc_id":"5374",
"versions":[
{"v":1,
"date":[2012,03,15],
"doc":{ .... doc_content v1....}
},
{"v":2,
"date":[2012,03,16],
"doc":{ .... doc_content v2....}
}
]
}
then you can ask for revisions:
View "byRev":
for (var curRev in doc.versions) {
map([doc.doc_id,doc.versions[curRev].v],doc.versions[curRev]);
}
call:
/byRev?startkey=["5374"]&endkey=["5374",{}]
result:
{ id:"docHistoryContainer_5374",key=[5374,1]value={...doc_content v1 ....} }
{ id:"docHistoryContainer_5374",key=[5374,2]value={...doc_content v2 ....} }
Additionaly you now can write also a map-function that amits the date in the key, so you can ask for revisions in a date-range
t0(t1...) is in couchdb called "revision". Each time you change a document, the revision-number increases.
The docs old revisions are stored until you don't want to have old revisions anymore, and tell the database "compact".
Look at "Accessing Previous Revisions" in http://wiki.apache.org/couchdb/HTTP_Document_API

Resources