Couchdb 2 _find query not using index - couchdb

I'm struggling with something that should be easy but it's making no sense to me, I have these 2 documents in a database:
{ "name": "foo", "type": "typeA" },
{ "name": "bar", "type": "typeB" }
And I'm posting this to _find:
{
"selector": {
"type": "typeA"
},
"sort": ["name"]
}
Which works as expected but I get a warning that there's no matching index, so I've tried posting various combinations of the following to _index which makes no difference:
{
"index": {
"fields": ["type"]
}
}
{
"index": {
"fields": ["name"]
}
}
{
"index": {
"fields": ["name", "type"]
}
}
If I remove the sort by name and only index the type it works fine except it's not sorted, is this a limitation with couchdbs' mango implementation or am I missing something?
Using a view and map function works fine but I'm curious what mango is/isn't doing here.

With just the type index, I think it will normally be almost as efficient unless you have many documents of each type (as it has to do the sorting stage in memory.)
But since fields are ordered, it would be necessary to do:
{
"index": {
"fields": ["type", "name"]
}
}
to have a contiguous slice of this index for each type that is already ordered by name. But the query planner may not determine that this index applies.
As an example, the current pouchdb-find (which should be similar) needs the more complicated but equivalent query:
{
selector: {type: 'typeA', name: {$gte: null} },
sort: ['type','name']
}
to choose this index and build a plan that doesn't resort to building in memory for any step.

Related

Mango index "does not contain a valid index for this query" even when specified manually

I'm trying to efficiently query data via Mango (as that seems to be the only option given my requirements Searching for sub-objects with a date range containing the queried date value), but I can't even get a very simple index/query pair to work: although I specify my index manually for the query, I'm told that my index "was not used because it does not contain a valid index for this query. No matching index found, create an index to optimize query time."
(I'm doing all of this via Fauxton on CouchDB v. 3.0.0)
Let's say my documents look like this:
{
"tenant": "TNNT_a",
"$doctype": "JobOpening",
// a bunch of other fields
}
All documents with a $doctype of "JobOpening" are guaranteed to have a tenant property. The searches I wish to perform will only ever be for documents with $doctype of "JobOpening" and a tenant selector will always be provided when querying.
Here's the test index I've configured:
{
"index": {
"fields": [
"tenant",
"$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
And here's the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Why isn't the index being used for the query?
I've tried not using a partial index, and I think the $doctype escaping is done properly in the requisite places, but nothing seems to keep CouchDB from performing a full scan.
The index isn't being used because the $doctype field is not being recognized by the query planner as expected.
Changing the fields declaration from $doctype to \\$doctype in the design document solves the issue.
{
"index": {
"fields": [
"tenant",
"\\$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
After that small refactor, the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Returns the expected result, and produces an "explain" which confirms the job-openings-doctype-index was queried:
{
"dbname": "stack",
"index": {
"ddoc": "_design/job-openings-doctype-index",
"name": "7f5c5cea5acd90f11fffca3e3355b6a03677ad53",
"type": "json",
"def": {
"fields": [
{
"tenant": "asc"
},
{
"\\$doctype": "asc"
}
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
}
},
// etc etc etc
Whether this change is intuitive or not is unclear, however it is consistent - and perhaps reveals leading field names with a "special" character may not be desirable.
Regarding the indexing of the filtered field, as per the documentation regarding partial_filter_selector
Technically, we don’t need to include the filter on the "status" [e.g.
$doctype here] field in the query selector ‐ the partial index
ensures this is always true - but including it makes the intent of the
selector clearer and will make it easier to take advantage of future
improvements to query planning (e.g. automatic selection of partial
indexes).
Despite that, I would not choose to index a field whose value is constant.

How to define an index to use in a Mango Query

I am trying to create a CouchDB Mango Query with an index with the hope that the query runs faster. At the moment I have the following Mango Query which returns what I am looking for but it's slow. Therefore, I assume, I need to create an index to make it faster. I need help figuring out how to create that index.
selector: {
categoryIds: {
$in: categoryIds,
},
},
sort: [{ publicationDate: 'desc' }],
You can assume that my documents are let say news articles from different categories. Therefore in each document I have a field that contains one or more categories that the news article belongs to. For that I have an array of categoryIds for each document. My query needs to be optimized for queries like "Give me all news that have categoryId1 in their array of categoryIds sorted by publicationDate". What I don't know how to do is 1. How to define an index 2. What that index should be 3. How to use that index in "use_index" field of the Mango Query. Any help is appreciated.
Update after "Alexis Côté" answer:
If I define the index like this:
{
"_id": "_design/0f11ca4ef1ea06de05b31e6bd8265916c1bbe821",
"_rev": "6-adce50034e870aa02dc7e1e075c78361",
"language": "query",
"views": {
"categoryIds-json-index": {
"map": {
"fields": {
"categoryIds": "asc"
},
"partial_filter_selector": {}
},
"reduce": "_count",
"options": {
"def": {
"fields": [
"categoryIds"
]
}
}
}
}
}
And run the Mango Query like this:
{
"selector": {
"categoryIds": {
"$in": [
"e0bd5f97ac35bdf6893351337d269230"
]
}
},
"use_index": "categoryIds-json-index"
}
It still does return the results but they are not sorted in the order I want by publicationDate. So I am not clear what you are suggesting the solution is.
You can create an index as documented here
In your case, you will need an index on the "categoryIds" field.
You can specify the index using "use_index": "_design/<name>"
Note:The query planner should automatically pick this index if it's compatible.

Elasticsearch query_string combined with match_phrase

I think it's best if I describe my intent and try to break it down to code.
I want users to have the ability of complex queries should they choose to that query_string offers. For example 'AND' and 'OR' and '~', etc.
I want to have fuzziness in effect, which has made me do things I feel dirty about like "#{query}~" to the sent to ES, in other words I am specifying fuzzy query on the user's behalf because we offer transliteration which could be difficult to get the exact spelling.
At times, users search a number of words that are suppose to be in a phrase. query_string searches them individually and not as a phrase. For example 'he who will' should bring me the top match to be when those three words are in that order, then give me whatever later.
Current query:
{
"indices_boost": {},
"aggregations": {
"by_ayah_key": {
"terms": {
"field": "ayah.ayah_key",
"size": 6236,
"order": {
"average_score": "desc"
}
},
"aggregations": {
"match": {
"top_hits": {
"highlight": {
"fields": {
"text": {
"type": "fvh",
"matched_fields": [
"text.root",
"text.stem_clean",
"text.lemma_clean",
"text.stemmed",
"text"
],
"number_of_fragments": 0
}
},
"tags_schema": "styled"
},
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"_source": {
"include": [
"text",
"resource.*",
"language.*"
]
},
"size": 5
}
},
"average_score": {
"avg": {
"script": "_score"
}
}
}
}
},
"from": 0,
"size": 0,
"_source": [
"text",
"resource.*",
"language.*"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "inna alatheena",
"fuzziness": 1,
"fields": [
"text^1.6",
"text.stemmed"
],
"minimum_should_match": "85%"
}
}
],
"should": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase"
}
}
}
]
}
}
}
Note: alatheena searched without the ~ will not return anything although I have allatheena in the indices. So I must do a fuzzy search.
Any thoughts?
I see that you're doing ES indexing of Qur'anic verses, +1 ...
Much of your problem domain, if I understood it correctly, can be solved simply by storing lots of transliteration variants (and permutations of their combining) in a separate field on your Aayah documents.
First off, you should make a char filter that replaces all double letters with single letters [aa] => [a], [ll] => [l]
Maybe also make a separate field containing all of [a, e, i] (because of their "vocative"/transcribal ambiguity) replaced with € or something similar, and do the same while querying in order to get as many matches as possible...
Also, TH in "allatheena" (which as a footnote may really be Dhaal, Thaa, Zhaa, Taa+Haa, Taa+Hhaa, Ttaa+Hhaa transcribed ...) should be replaced by something, or both the Dhaal AND the Thaa should be transcribed multiple times.
Then, because it's Qur'anic script, all Alefs without diacritics, Hamza, Madda, etc should be treated as Alef (or Hamzat) ul-Wasl, and that should also be considered when indexing / searching, because of Waqf / Wasl in reading arabic. (consider all the Wasl`s in the first Aayah of Surat Al-Alaq for example)
Dunno if this is answering your question in any way, but I hope it's of some assistance in implementing your application nontheless.
You should use Dis Max Query to achieve that.
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with
different boost factors (so that the fields cannot be combined
equivalently into a single search field). We want the primary score to
be the one associated with the highest boost.
Quick example how to use it:
POST /_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"boost": 5
}
}
},
{
"match": {
"text": {
"query": "inna alatheena",
"type": "phrase",
"fuzziness": "AUTO",
"boost": 3
}
}
},
{
"query_string": {
"default_field": "text",
"query": "inna alatheena"
}
}
]
}
}
}
It will run all of your queries, and the one, which scored highest compared to others, will be taken. So just define your rules using it. You should achieve what you wanted.

Elasticsearch term filter on inner object field not matching

I have just organized my document structure to have a more OO design (e.g. moved top level properties like venueId and venueName into a venue object with id and name fields).
However I can now not get a simple term filter working for fields on the child venue inner object.
Here is my mapping:
{
"deal": {
"properties": {
"textId": {"type":"string","name":"textId","index":"no"},
"displayId": {"type":"string","name":"displayId","index":"no"},
"active": {"name":"active","type":"boolean","index":"not_analyzed"},
"venue": {
"type":"object",
"path":"full",
"properties": {
"textId": {"type":"string","name":"textId","index":"not_analyzed"},
"regionId": {"type":"string","name":"regionId","index":"not_analyzed"},
"displayId": {"type":"string","name":"displayId","index":"not_analyzed"},
"name": {"type":"string","name":"name"},
"address": {"type":"string","name":"address"},
"area": {
"type":"multi_field",
"fields": {
"area": {"type":"string","index":"not_analyzed"},
"area_search": {"type":"string","index":"analyzed"}}},
"location": {"type":"geo_point","lat_lon":true}}},
"tags": {
"type":"multi_field",
"fields": {
"tags":{"type":"string","index":"not_analyzed"},
"tags_search":{"type":"string","index":"analyzed"}}},
"days": {
"type":"multi_field",
"fields": {
"days":{"type":"string","index":"not_analyzed"},
"days_search":{"type":"string","index":"analyzed"}}},
"value": {"type":"string","name":"value"},
"title": {"type":"string","name":"title"},
"subtitle": {"type":"string","name":"subtitle"},
"description": {"type":"string","name":"description"},
"time": {"type":"string","name":"time"},
"link": {"type":"string","name":"link","index":"no"},
"previewImage": {"type":"string","name":"previewImage","index":"no"},
"detailImage": {"type":"string","name":"detailImage","index":"no"}}}
}
Here is an example document:
GET /production/deals/wa-au-some-venue-weekends-some-deal
{
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"_version":1,
"exists":true,
"_source" : {
"id":"921d5fe0-8867-4d5c-81b4-7c1caf11325f",
"textId":"wa-au-some-venue-weekends-some-deal",
"displayId":"some-venue-weekends-some-deal",
"active":true,
"venue":{
"id":"46a7cb64-395c-4bc4-814a-a7735591f9de",
"textId":"wa-au-some-venue",
"regionId":"wa-au",
"displayId":"some-venue",
"name":"Some Venue",
"address":"sdgfdg",
"area":"Swan Valley & Surrounds"},
"tags":["Lunch"],
"days":["Saturday","Sunday"],
"value":"$1",
"title":"Some Deal",
"subtitle":"",
"description":"",
"time":"5pm - Late"
}
}
And here is an 'explain' test on that same document:
POST /production/deals/wa-au-some-venue-weekends-some-deal/_explain
{
"query": {
"filtered": {
"filter": {
"term": {
"venue.regionId": "wa-au"
}
}
}
}
}
{
"ok":true,
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"matched":false,
"explanation":{
"value":0.0,
"description":"ConstantScore(cache(venue.regionId:wa-au)) doesn't match id 0"
}
}
Is there any way to get more useful debugging info?
Is there something wrong with the explain result description? Simply saying "doesn't match id 0" does not really make sense to me... the field is called 'regionId' (not 'id') and the value is definitely not 0...???
That happens because the type you submitted the mapping for is called deal, while the type you indexed the document in is called deals.
If you look at the mapping for your type deals, you'll see that was automatically generated and the field venue.regionId is analyzed, thus you most likely have two tokens in your index: wa and au. Only searching for those tokens on that type you would get back that document.
Anything else looks just great! Only a small character is wrong ;)

Query all unique values of a field with Elasticsearch

How do I search for all unique values of a given field with Elasticsearch?
I have such a kind of query like select full_name from authors, so I can display the list to the users on a form.
You could make a terms facet on your 'full_name' field. But in order to do that properly you need to make sure you're not tokenizing it while indexing, otherwise every entry in the facet will be a different term that is part of the field content. You most likely need to configure it as 'not_analyzed' in your mapping. If you are also searching on it and you still want to tokenize it you can just index it in two different ways using multi field.
You also need to take into account that depending on the number of unique terms that are part of the full_name field, this operation can be expensive and require quite some memory.
For Elasticsearch 1.0 and later, you can leverage terms aggregation to do this,
query DSL:
{
"aggs": {
"NAME": {
"terms": {
"field": "",
"size": 10
}
}
}
}
A real example:
{
"aggs": {
"full_name": {
"terms": {
"field": "authors",
"size": 0
}
}
}
}
Then you can get all unique values of authors field.
size=0 means not limit the number of terms(this requires es to be 1.1.0 or later).
Response:
{
...
"aggregations" : {
"full_name" : {
"buckets" : [
{
"key" : "Ken",
"doc_count" : 10
},
{
"key" : "Jim Gray",
"doc_count" : 10
},
]
}
}
}
see Elasticsearch terms aggregations.
Intuition:
In SQL parlance:
Select distinct full_name from authors;
is equivalent to
Select full_name from authors group by full_name;
So, we can use the grouping/aggregate syntax in ElasticSearch to find distinct entries.
Assume the following is the structure stored in elastic search :
[{
"author": "Brian Kernighan"
},
{
"author": "Charles Dickens"
}]
What did not work: Plain aggregation
{
"aggs": {
"full_name": {
"terms": {
"field": "author"
}
}
}
}
I got the following error:
{
"error": {
"root_cause": [
{
"reason": "Fielddata is disabled on text fields by default...",
"type": "illegal_argument_exception"
}
]
}
}
What worked like a charm: Appending .keyword with the field
{
"aggs": {
"full_name": {
"terms": {
"field": "author.keyword"
}
}
}
}
And the sample output could be:
{
"aggregations": {
"full_name": {
"buckets": [
{
"doc_count": 372,
"key": "Charles Dickens"
},
{
"doc_count": 283,
"key": "Brian Kernighan"
}
],
"doc_count": 1000
}
}
}
Bonus tip:
Let us assume the field in question is nested as follows:
[{
"authors": [{
"details": [{
"name": "Brian Kernighan"
}]
}]
},
{
"authors": [{
"details": [{
"name": "Charles Dickens"
}]
}]
}
]
Now the correct query becomes:
{
"aggregations": {
"full_name": {
"aggregations": {
"author_details": {
"terms": {
"field": "authors.details.name"
}
}
},
"nested": {
"path": "authors.details"
}
}
},
"size": 0
}
Working for Elasticsearch 5.2.2
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever" : {
"terms" : { "field" : "yourfield", "size":10000 }
}
},
"size" : 0
}'
The "size":10000 means get (at most) 10000 unique values. Without this, if you have more than 10 unique values, only 10 values are returned.
The "size":0 means that in result, "hits" will contain no documents. By default, 10 documents are returned, which we don't need.
Reference: bucket terms aggregation
Also note, according to this page, facets have been replaced by aggregations in Elasticsearch 1.0, which are a superset of facets.
The existing answers did not work for me in Elasticsearch 5.X, for the following reasons:
I needed to tokenize my input while indexing.
"size": 0 failed to parse because "[size] must be greater than 0."
"Fielddata is disabled on text fields by default." This means by default you cannot search on the full_name field. However, an unanalyzed keyword field can be used for aggregations.
Solution 1: use the Scroll API. It works by keeping a search context and making multiple requests, each time returning subsequent batches of results. If you are using Python, the elasticsearch module has the scan() helper function to handle scrolling for you and return all results.
Solution 2: use the Search After API. It is similar to Scroll, but provides a live cursor instead of keeping a search context. Thus it is more efficient for real-time requests.

Resources