I18n search and filtering in Elasticsearch - search

tldr;
How to match and filter localized search with a localized index ?
long version
I have an application where the user search must be done in the context of it's language.
In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language).
The mapping of the document should look like :
'entry': {
'properties': {
'name' : {'type': 'string'}, /* unlocalized properties */
'category': { /* localized properties */
"properties" : {
"lang_fr" : {
"type" : "string"
},
"lang_de" : {
"type" : "string"
}
}
},}}
having that, I have two requirements:
1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work :
{
"query": {
"match": {
"*.lang_fr": "full_text"
}
}
}
However, "categories.lang_fr": "full_text" works well. But I don't want to maintain the list of fields in the query. I want a general rule like you can do in SolR.
2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang" , but include all fields being 'lang_fr'. I tried the following but it doesn't work:
{
"_source": {
"include": [ "*", "*.lang_fr" ],
"exclude": [ "*.lang_*" ],
}
...}
the wildcard operator doesn't seems to work. I partially have what I want if I specify "categories.lang_de", but again, I don't want to maintain the list of fields, I want a generic rule. The include/exclude operation doesn't work as I would like. The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as :
{
"_source": {
"exclude": [ "categories.lang_de", "categories.lang_en", "categories.lang_it",
"another_field.lang_de", "catanother_fieldgories.lang_en", "another_field.lang_it"],
}
...}
for 'fr' search.
I'm quite surprised I couldn't find anything on google. I see it as a very standard case of i18n applied to elasticsearch. Maybe I'm modelizing i18n the wrong way in ES ?
thank you in advance !

You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names:
{
"query": {
"query_string": {
"query": "\\*.lang_fr:full_text"
}
}
}
or you can also specify the field name in the fields parameter, like this
{
"query": {
"query_string": {
"query": "full_text"
"fields": ["*.lang_fr"]
}
}
}
As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). That would allow you to add localized fields as you go without having to change the query.
{
"_source": {
"exclude": [ "*.lang_de", "*.lang_it" ],
}
...}

Related

Mango index "does not contain a valid index for this query" even when specified manually

I'm trying to efficiently query data via Mango (as that seems to be the only option given my requirements Searching for sub-objects with a date range containing the queried date value), but I can't even get a very simple index/query pair to work: although I specify my index manually for the query, I'm told that my index "was not used because it does not contain a valid index for this query. No matching index found, create an index to optimize query time."
(I'm doing all of this via Fauxton on CouchDB v. 3.0.0)
Let's say my documents look like this:
{
"tenant": "TNNT_a",
"$doctype": "JobOpening",
// a bunch of other fields
}
All documents with a $doctype of "JobOpening" are guaranteed to have a tenant property. The searches I wish to perform will only ever be for documents with $doctype of "JobOpening" and a tenant selector will always be provided when querying.
Here's the test index I've configured:
{
"index": {
"fields": [
"tenant",
"$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
And here's the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Why isn't the index being used for the query?
I've tried not using a partial index, and I think the $doctype escaping is done properly in the requisite places, but nothing seems to keep CouchDB from performing a full scan.
The index isn't being used because the $doctype field is not being recognized by the query planner as expected.
Changing the fields declaration from $doctype to \\$doctype in the design document solves the issue.
{
"index": {
"fields": [
"tenant",
"\\$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
After that small refactor, the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Returns the expected result, and produces an "explain" which confirms the job-openings-doctype-index was queried:
{
"dbname": "stack",
"index": {
"ddoc": "_design/job-openings-doctype-index",
"name": "7f5c5cea5acd90f11fffca3e3355b6a03677ad53",
"type": "json",
"def": {
"fields": [
{
"tenant": "asc"
},
{
"\\$doctype": "asc"
}
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
}
},
// etc etc etc
Whether this change is intuitive or not is unclear, however it is consistent - and perhaps reveals leading field names with a "special" character may not be desirable.
Regarding the indexing of the filtered field, as per the documentation regarding partial_filter_selector
Technically, we don’t need to include the filter on the "status" [e.g.
$doctype here] field in the query selector ‐ the partial index
ensures this is always true - but including it makes the intent of the
selector clearer and will make it easier to take advantage of future
improvements to query planning (e.g. automatic selection of partial
indexes).
Despite that, I would not choose to index a field whose value is constant.

How to sort on multiple fields individually using a single index

I am trying to declare multiple fields in a single index like below and trying to sort on the single field only. is it possible?
Is there any way by which using a single combine fields index I can sort on individual fields dynamically.
{
"index": {
"fields": ["name","createdDate","updatedDate"]
},
"name" : "multi-filter",
"ddoc" : "MultiFilter"
"type" : "json"
}
after that, I can apply sort on the same sequence and list like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}, {"createdDate": "asc"},{"updatedDate": "asc"}]
}
BUT if I change the sequence or want to use a filter/sort on a single field like
{
"selector": {"name": "Robert De Niro"},
"sort": [{"name": "asc"}]
}
it gives an error saying, my motive is to use the single index, but sort individual fields. It looks like it is a limitation of couch DB and I need to create three separate indexes for the same to make it work, still hoping for the best option
{"error":"no_usable_index","reason":"No index exists for this sort, try indexing by the sort fields."}
I found this answer here: "Unknown Error: mango_idx :: {no_usable_index,missing_sort_index}"}
you could define an index only with the good field, eg:
{
"index": {
"fields": ["name"]
},
"name" : "name_sort",
"type" : "json"
}

CouchDB Search of Multiple Fields

I am trying to use CouchDB 3.1 for the first time. I'm trying to do a dynamic query where multiple fields can be searched and is totally optional. Example of my data:
{
"_id": "464e9db4d9216e1621b354794a0181d4",
"_rev": "1-fade491c3e255bbbfa60f1d7462fa9a2",
"app_id": "0000001",
"username": "john#gmail.com",
"transaction": "registration",
"customer_name": "John Doe",
"status": "complete",
"request_datetime": "2020-01-31 12:05:00"
}
So what I'm trying to do is, the documents can be searched by "transaction", "transaction" and "app_id", or combination of the fields "app_id" / "username" / "transaction" / "username" / "status" / "request_datetime" based on the search input from the user. (Some of the field such as "app_id" might be null based on the "transaction")
I have tried to make View to search by "app_id" and "transaction" :
function (doc) {
if(doc.transaction && doc.app_id) {
emit([doc.transaction, doc.app_id], doc);
}
}
But this is not gonna work when the app_id itself is null due to key in CouchDB is the index.
So my question is whether this can be achieved using vanilla CouchDB without using GeoCouch or Lucene? Do I need to make different views based on different combination of search fields?
Any help is greatly appreciated. Thank you very much.
With /db/_find, you can define a selector that accepts combination operators and condition operators. This lets you create simple and really complex queries. Given your document structure, such a selector could look as follows.
"selector":{
"$and":[
{
"app_id":{
"$eq":"0000001"
}
},
{
"username":{
"$eq":"john#gmail.com"
}
},
{
"request_datetime": {
"$gte": "2020-01-31 12:00:00",
"$lt": "2020-01-31 13:00:00"
}
}
]
}
The $or operator, combined with $eq and $exists may be used for checking fields that can be null. The $regex operator offers you even much more power.
Here's a simple example using CURL (replace with the name of your database).
curl -H 'Content-Type: application/json' -X POST http://localhost:5984/<db>/_find -d '{"selector":{"username":{"$eq": "john#gmail.com"}}}'

How would I query keys such that it would partially match?

Let's take this document for example:
{
"id":1
"planet":"earth-616"
"data":[
["wolverine","mutant"],
["Storm","mutant"],
["Mark Zuckerberg","human"]]
}
I created a search index to index the name and type, for example if searched for name:wolverine or type:mutant I'd get the document that has it. But as per my requirement I don't want the whole document, I only want ["wolverine","mutant"] I've created a view that outputs as:
{
"id":1,
"key":"earth-616",
"value":["earth-616","wolverine","mutant"]
}
Then I found out I can query only with keys. (Is it possible to create search indexes on views?, Couldn't find anything in the documentation)
Or should I create views along with the one above like this:
{
"id":1,
"key":"wolverine",
"value":["earth-616","wolverine","mutant"]
}
And
{
"id":,
"key":"mutant"
"value":["earth-616","wolverine","mutant"]
}
This way I can query with keys that I want but I can't seem to partial match keys(Am I missing something?)
If you need the output to be exactly as described then I believe you have to use views, and to support wildcard searches I believe you will have to index every substring of a key.
One alternative is to use Cloudant Query, although admittedly you cannot get the exact output you are looking for. If you issue a query like so:
{
"selector": {
"_id": {
"$gt": 0
},
"data": {
"$elemMatch": {
"$elemMatch": {
"$regex": "(?i)zuck"
}
}
}
},
"fields": [
"data"
]
}
The result will be the entire data array:
{
"data": [
["wolverine", "mutant"],
["Storm", "mutant"],
["Mark Zuckerberg", "human"]
]
}

ElasticSearch: Full-Text Search made easy

I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.

Resources