I have some documents I would like to replicate to per user databases but only with a subset of fields on the main document. The "selector" option works great to easily select the docs I need but it does not appear to support the "fields" option. I can't find a way to do this without manually filtering the document fields on my node server, which seems quite inefficient.
It is not possible to replicate part of a document. The CouchDB replication mechanism is replicating the whole document as a unit.
I suggest you to review your document design in order to have a better fit with the replication requirements. Maybe it is more accurate to have two separate documents in your case.
We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.
Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.
A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.
Using /_changes?filter=_design I can get all the changes for design documents.
How do I get all the changes for documents only?
Is there such a thing like /_changes?filter=_docs_only ???
There is no built in filter for this. You will need to write your own filter function (http://couchdb.readthedocs.org/en/latest/couchapp/ddocs.html#filterfun) that excludes design documents (check the doc's _id for "_design/", etc.) from the feed. You then reference this filter function when you query the changes feed (http://couchdb.readthedocs.org/en/latest/api/database/changes.html?highlight=changes). However, most applications don't run into this too often since design documents are typically only updated when there is an application change.
It would probably be more efficient to implement this filter on the client side instead of streaming all your changes to the couchjs process (always inefficient). As your application loops through the changes simply check whether it is a design doc there.
Cheers.
I'm new to CouchDB and I know my mindset is probably still too much in the relational DB sphere, but here goes:
It appears that querying on Couch is all done via Views. I read that temporary views are very inefficient and should be avoided in production.
So my question really is how would one do effective querying with parameters (as the views do not accept them). For example if I were to use Couch to power a blog site would I have to create a new view for each post equivalent to 'select post from posts where id=1'.
I understand that I can use lucene along side the querying to perfom a full text search on the results, but this is only really useful for textual content not numbers.
Im happy creating a boat load of static views as they can be created very simply on the fly. My worry is that this is not how Couch was supposed to be used and I'm missing something. Feel free to enlighten me.
Cheers, Chris.
Views do accept url parameters, key being the one your are looking for. You can even limit how many rows you get and sort as well.
Your views can be indexed by arbitrary JSON keys. This means you can create a view that emits documents like so, [username docid] => doc. Then you can query this view with http://url/to/view?key=[username docid].
You could create a view that emits [username type date] => doc. Now you can get all documents of a certain between a certain date (using startKey and endKey url parameters).
Your example of the blog is one that CouchDB is particularly well suited for. In fact I believe it's an example in the upcoming CouchDB book from O'reilly.
That said, some kinds of queries are not easily handled by CouchDB alone. couchdb-lucene can help here. Don't assume that's it's only good for full text search. I've been using it to run general complex queries against the database to good effect.
I come from a RDBMS background, and I have an application here which requires good scalability and low latency. I want to give CouchDB a try. However, I need to detect when a particular INSERT operation fails due to a unique key constraint. Does CouchDB support this? I took a look at the docs, but I could not come across anything relevant.
The _id for each document is unique (within the same database), but there are no constraints for other fields in the document.
Particularly, there are no constraints that run across two or more documents.
You can set up validation documents to set up validation rules for documents, but again they are on a document by document basis.
As the above poster says, there are no constraints for other fields than the document _id. The _id can be automatically generated by couchdb or you can create your own. (for my purposes I have created my own as I knew I could guarantee the key's uniqueness).
At the lowest API level, if you attempt a PUT request of an existing document id, it will be rejected with a HTTP 409 error - unless you supply the correct revision (_rev property) of the existing document.
I wouldn't run anything mission-critical with couchdb but the code is out of Apache incubation and quite functional. A number of people are running websites with it.