I've been searching all over on this one. I'm running CouchDB 2.0 and understand I have a choice to make between using traditional views or the newer Mango query when retrieving a set of data.
So I'm currently using the Mango query syntax and getting the results I need - however I now need to implement pagination. When researching pagination in CouchDB 2.0 I found this excellent discussion around the topic:
http://docs.couchdb.org/en/2.0.0/couchapp/views/pagination.html
It suggests that the best way to paginate large data sets is not to use skip but instead to use startkey and perform a kind of linked list pagination from one page to the next.
So this makes sense to me and works for my application, but when I then turn to the Mango/_find API I can't see any way to pass in startkey:
http://docs.couchdb.org/en/2.0.0/api/database/find.html
Confusingly enough, it does accept a skip parameter, but there is no startkey.
Is anybody able to explain what is happening here? Are the performance characteristics much different in Mango/_find such that we can safely use skip on large data sets? Or should we be using views with startkey when traversing larger collections of data?
This particular question doesn't seem to get answered in any recent documentation AFAIK. Any help would be greatly appreciated.
You could perhaps work around the lack of startkey/endkey support by including a constraint in the selector:
"selector": {
"_id": { "$gte": "myStartKey", "$lte": "myEndKey"}
}
(just my two cents; maybe somebody else has a more complete answer)
The CouchDB documented approach of pagination applies only to map-reduce/views kind of queries and cannot be applied to Mango queries. Primarily becos, for views, there is a single key field which is used for sorting, hence it's easy to skip previous doc by using that 'startkey' and in case of non-unique key by adding a startkey_docid.
For selector queries, to effectively skip previous records you will have to look at the sort keys specified in the original query and add another condition(s) to skip those docs which are already processed. For example, if you sorted (in asc) on a numeric field and processed till value = 10, then you could add a { "field" : { "$gte" : 10 } } as $and logic within original selector. This becomes complicated if you have multiple sort fields. A skip/limit might be an easier approach to pagination for selector queries.
Related
I have been using solr for my project but recently I encountered Elasticsearch which seems to be very promising. My project requires ability to handle nested documents and I would like to know which one does better job. Solr just added child documents recently but is it as good as Elasticsearch's? Could Elasticsearch perform query on both parent and children at once? Thanks
I've been looking into the subject recently and to my understanding ElasticSearch makes the life a lot easier when working with nested documents, although Solr also supports nesting (but is less flexible in querying).
So the features of ElasticSearch are:
"Seamlessly" supports nesting: you don't have to change your
nested documents structure or add specific fields. However, you need
to indicate in the mapping what fields are nested when creating the
index
Supports nested query with "nested" and "path":
Supports aggregation and filtering with nested docs: also via
"nested" and "path".
With Solr you will have to:
Modify your schema.xml by adding the _ root _ field
Modify your dataset so that parent and child documents would have a specific distinguishing field, in particular, childDocuments to indicate children (see more at this question)
Aggregation and filtering on nested documents promises to be very complicated if not impossible.
Also, nested fields are not supported at all.
Recent Solr versions (5.1 and up) can eventually be configured to support nesting (including you'll have to change your input data structure), however, documentation is not very clear and there is not much information on the Internet because these features are recent.
The bottomline is that in the sense of nested documents ElasticSearch can do everything that Solr can and even more with less effort and smoother learning curve. So going with ElasticSearch seems more reasonable in this case.
I am not aware of Elastic Search, so this is always 50% answer.
Solr works best with denormalized data. However, given that you have nested documents, you can use solr in two scenarios:
Query for parent, with a child attribute
Query for all children of a parent.
You can use block join to perform the above queries. Even though, you deal with nested levels, solr internally manages them as denormalized. I mean, when a parent have 2 children, you end up with three high level documents in solr. And solr manages the relation part.
CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?
Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.
We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.
Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.
A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.
Edit: I added an answer with a more generic approach for NoSQL situations.
I am working on a project using Riak (with LevelDB).
Using the REST API that Riak offers, I am able to get data based on indexes and a range, which returns the results sorted alpha-num by the index, and a continuation hash.
Example call:
http://server/buckets/bucketname/index/someindex_int/333333333/555555555?max_results=10&return_terms=true&continuation=somehashhere
Example results:
{
results: [
{
about_river: "12312"
},
{
balloon_tall: "45345"
},
{
basket_written: "23434523"
}
],
continuation: "g2987392479789879087987asdfasdf="
}
I am also making a separate call without specifying max_results and return_terms to get a count of the docs that are in the set. I will know the number of docs per set and the total number of docs, which easily lets us know the number of "pages".
While I am able to make a call for each set of documents based on the hash, then receive a next hash with the results set, I am looking for a way to predict the hashes, therefore pre-populate the client with pagination links.
Is this possible? Are the hashes dynamic based on the index/range info or are they some random value generated by the node your data is returned from?
A coworker has mentioned that the hashes are based on what node you are hitting in the cluster, but I am unable to find documentation on this.
Secondarily, the idea was brought up to cycle through the entire set in the background to get the hashes. This will work, but seems pretty expensive.
I am brand new to Riak and any advice here would be great. I am not able to find any good examples of pagination with Riak. The one that did exist is gone from the internet as far as I can tell.
No, the continuation is not "predictable" nor is anything your co-worker saying correct.
Unfortunately there is no way to know the total number of objects in the range specified except for querying the range without the max_results parameter as you are doing (outside of a 1:1 relation between index key and object key, obviously).
The other answer was the answer I needed, but with some help from CodingHorror, I came up with the answer I wanted.
No pagination. With no pagination, only getting the hash for the next results set is no problem, in fact, it's ideal for my use-case. Just merge that next set onto your existing set(s). But don't let it go on forever.
My inspiration: http://blog.codinghorror.com/the-end-of-pagination/
Thanks, Jeff Atwood!
Ain't the number of results in the response the same?
something like
RiakFuture<SearchOperation.Response, BinaryValue> searchResult = client.executeAsync(searchOp);
searchResult.await();
com.basho.riak.client.core.operations.SearchOperation.Response response = searchResult.get();
logger.debug("number of results {} ", response.numResults());
I am currently trying to create a view and query to fit this SQL query:
SELECT * FROM articles
WHERE articles.location="NY" OR articles.location="CA"
ORDER BY articles.release_date DESC
I tried to create a view with a complex key:
function(doc) {
if(doc.type == "Article") {
emit([doc.location, doc.release_date], doc)
}
}
And then using startkey and endkey to retrieve one location and ordering the result on the release date.
.../_view/articles?startkey=["NY", {}]&endkey=["NY"]&limit=5&descending=true
This works fine.
However, how can I send multiple startkeys and endkeys to my view in order to mimic
WHERE articles.location="NY" OR articles.location="CA" ?
My arch nemesis, Dominic, is right.
Furthermore, it is never possible to query by criteria A and then sort by criteria B in CouchDB. In exchange for that inconvenience, CouchDB guarantees scalable, dependable, logarithmic query times. You have a choice.
Store the view output in its own database, and make a new view to sort by criteria B
or, sort the rows afterward, which can be either
Sort client-side, once you receive the rows
Sort server-side, in a _list function. The is great, but remember it's not ultimately scalable. If you have millions of rows, the _list function will probably crash.
The short answer is, you currently cannot use multiple startkey/endkey combinations.
You'll either have to make 2 separate queries, or you could always add on the lucene search engine to get much more robust searching capabilities.
It is possible to use multiple key parameters in a query. See the Couchbase CouchDB documentation on multi-document fetching.