I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.
Related
I created an Azure index for my DocumentDB collection, and it seems to be working fine. The index has properties for a user account like FirstName, LastName, and Username. The problem is the default tokenizer seems to be tokenizing the Username field. While I want token matches for the first two fields, I'd like character matching for the usernames. Is there an easy way to achieve this through the Azure portal? If not, how can I achieve this?
Adding another answer based on your above comments. So basically in the best case, what you want to do is prefix, suffix and wildcard search. So if the username was user246392, you could find it by typing "use", "392" or even "er246". The prefix is easy, because you could search use* and it would find it.
Kendra Little did a really nice blog post on how to leverage RegEx with Azure Search, which can allow you to do the full wildcard part of your ask (i.e. search for "392").
If you wanted to do the suffix search, you can do a trick that is quite efficient where you create a new field that would be a custom analyzer that would index the words in opposite order. Here is an example of a index schema that would allow this (over suffixName field)
{
"name":"people",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true, "searchable":false },
{"name": "suffixName", "type": "Edm.String", "searchable":true, "indexAnalyzer":"suffixIndexingAnalyzer", "searchAnalyzer":"reverseText"}
],
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "suffixIndexingAnalyzer",
"tokenizer": "keyword_v2",
"tokenFilters": [
"asciifolding",
"lowercase",
"reverse",
"my_edgeNGramForSuffix"
],
"charFilters": []
},
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "reverseText",
"tokenizer": "classic",
"tokenFilters": [
"lowercase",
"reverse"
],
"charFilters": []
}
],
"tokenFilters":[
{
"#odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"name": "my_edgeNGramForSuffix",
"minGram": 2,
"maxGram": 25,
"side": "front"
}
]
}
Can you give us an example of what you would want to do over this username field? I am not sure what you mean by character matching. Is it a RegEx based character match? If so, perhaps a custom analyzer that enabled RegEx searched might help for this field? Please note, RegEx is not as performant as typical indexing as we would need to scan the entire content as opposed to going to the inverted index to find token matches.
I'm trying to execute some aggregate queries against data in TSI. For example:
{
"searchSpan": {
"from": "2018-08-25T00:00:00Z",
"to": "2019-01-01T00:00:00Z"
},
"top": {
"sort": [
{
"input": {
"builtInProperty": "$ts"
}
}
]
},
"aggregates": [
{
"dimension": {
"uniqueValues": {
"input": {
"builtInProperty": "$esn"
},
"take": 100
}
},
"measures": [
{
"count": {}
}
]
}
]
}
The above query, however, does not return any record, although there are many events stored in TSI for that specific searchSpan. Here is the response:
{
"warnings": [],
"events": []
}
The query is based on the examples in the documentation which can be found here and which is actually lacking crucial information for requirements and even some examples do not work...
Any help would be appreciated. Thanks!
#Vladislav,
I'm sorry to hear you're having issues. In reviewing your API call, I see two fixes that should help remedy this issue:
1) It looks like you're using our /events API with payload for /aggregates API. Notice the "events" in the response. Additionally, “top” will be redundant for /aggregates API as we don't support top-level limit clause for our /aggregates API.
2) We do not enforce "count" property to be present in limit clause (“take”, “top” or “sample”) and it looks like you did not specify it, so by default, the value was set to 0, that’s why the call is returning 0 events.
I would recommend that you use /aggregates API rather than /events, and that “count” is specified in the limit clause to ensure you get some data back.
Additionally, I'll note your feedback on documentation. We are ramping up a new hire on documentation now, so we hope to improve the quality soon.
I hope this helps!
Andrew
tldr;
How to match and filter localized search with a localized index ?
long version
I have an application where the user search must be done in the context of it's language.
In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language).
The mapping of the document should look like :
'entry': {
'properties': {
'name' : {'type': 'string'}, /* unlocalized properties */
'category': { /* localized properties */
"properties" : {
"lang_fr" : {
"type" : "string"
},
"lang_de" : {
"type" : "string"
}
}
},}}
having that, I have two requirements:
1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work :
{
"query": {
"match": {
"*.lang_fr": "full_text"
}
}
}
However, "categories.lang_fr": "full_text" works well. But I don't want to maintain the list of fields in the query. I want a general rule like you can do in SolR.
2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang" , but include all fields being 'lang_fr'. I tried the following but it doesn't work:
{
"_source": {
"include": [ "*", "*.lang_fr" ],
"exclude": [ "*.lang_*" ],
}
...}
the wildcard operator doesn't seems to work. I partially have what I want if I specify "categories.lang_de", but again, I don't want to maintain the list of fields, I want a generic rule. The include/exclude operation doesn't work as I would like. The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as :
{
"_source": {
"exclude": [ "categories.lang_de", "categories.lang_en", "categories.lang_it",
"another_field.lang_de", "catanother_fieldgories.lang_en", "another_field.lang_it"],
}
...}
for 'fr' search.
I'm quite surprised I couldn't find anything on google. I see it as a very standard case of i18n applied to elasticsearch. Maybe I'm modelizing i18n the wrong way in ES ?
thank you in advance !
You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names:
{
"query": {
"query_string": {
"query": "\\*.lang_fr:full_text"
}
}
}
or you can also specify the field name in the fields parameter, like this
{
"query": {
"query_string": {
"query": "full_text"
"fields": ["*.lang_fr"]
}
}
}
As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). That would allow you to add localized fields as you go without having to change the query.
{
"_source": {
"exclude": [ "*.lang_de", "*.lang_it" ],
}
...}
Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.
The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post
With Elasticsearch I have created an index using a custom mapping and custom set of analszers, however I'm not able to do query search on the _all field.
I'm using these analyzers:
{
"analysis": {
"analyzer": {
"case_insensitive": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": "punctuation"
}
},
"char_filter": {
"punctuation": {
"type": "mapping",
"mappings": [
".=>\\u0020",
"-=>\\u0020",
"_=>\\u0020"
]
}
}
}
}
and this mapping:
{
"article": {
"_all": {
"enabled": true,
"store": "yes",
"index_analyzer": "case_insensitive",
"search_analyzer": "case_insensitive"
},
"properties": {
"title": {
"type": "string",
"index": "analyzed"
},
"subtitle": {
"type": "string",
"analyzer": "case_insensitive"
},
"comment": {
"type": "string",
"index": "not_analyzed"
},
"review": {
"type":"string",
"index": "not_analyzed",
"include_in_all":false
}
}
}
}
Then I add a document like this:
{
"title": "This is the story of a wonderful man.",
"subtitle":"A man goes on vacation in the worst place possible.",
"comment": "I like the movie very much, however I did not undertand it.",
"review":"Very well"
}
and I expect the following 3 out of 4 fields shall be included in _all, in particular title, subtitle and comment.
The analyzer is working as following (tested using the analyzer test in elasticsearch):
"I like the movie very much, however I did not undertand it." -> "i like the movie very much, however i did not undertand it "
"This is the story of a wonderful man." -> "this is the story of a wonderful man "
I expect that at least searching on _all using the query: "This is the story of a wonderful man." I should be able to find the document.
What am I doing wrong?
How is elasticsearch populating the _all field?
If the field 'title' shall be added to the _all field, which data is used and how? is it using the output of the analyzer selected for the 'title' field as input for the analyzer of the _all or is using the raw data?
How is the flow of data in the _all field? For example
input -> analyzer -> title -> index_analyser -> _all
or
input -> analyzer -> title
-> index_analyser -> _all
Thank you in advance...
Your mapping looks ok to me. The only thing I would try is to set one of the fields explicitly to include_in_all=true and then rerun your query.
According to the docs, it may be that as you are overriding the default value of include_in_all for one of the fields, it may have changed it for all the other fields of the objects. See here _all
Relevant text from the documentation is below:
Inclusion in the _all field can be controlled on a field-by-field basis by using the include_in_all setting, which defaults to true. Setting include_in_all on an object (or on the root object) changes the default for all fields within that object.
UPDATE:
I think I know why its not working. Here is what I did. First, I removed the custom analysers from the _all_ field (so using the standard analyser). With this I was able to query and get the results as expected. Results were returned for terms that were in any of the document attributes but review. At least this confirms that the general behaviour of _all is correct. Next to test the analysers, I did a query on the subtitle field with the exact text(as it is using keyword analyser). This also worked. Then I realised that _all is an aggregated field and then analysed.
So the query should include all the text from all the fields to work. But again, how do we know in which order they were aggregated :)
This link _all custom analyser has some information. Relevant bits extracted below (from Shay).
You don't want to set the analyzer for _all to be keyword, _all is an aggregation of all the other fields int the doc, so you basically treat the whole aggregation of text as a single token.