ElasticSearch autocomplete - search

I have got four documents with a field named "fullname".
Documents:
Abigail Harrison
Abigale Hardison
Abilene Havington
Abilene-Havington
I would like to make an autocompleter for this field. Some examples:
Search: "Abi"
Result: "Abigail Harrison", "Abigale Hardison", "Abilene Havington"
Search: "Abig"
Result: "Abigail Harrison", "Abigale Hardison"
Search: "Abigail Har"
Result: "Abigail Harrison", "Abigale Hardison"
Search: "Abilene Hav"
Result: "Abilene Havington", "Abilene-Havington"
Search: "Har"
Result: "Abigail Harrison", "Abigale Hardison"
I do not want something like this: (!)
Search: "iga"
Result: "Abigail Harrison", "Abigale Hardison"
Whitespaces and hyphens should be ignored and I'd like to have all generated tokens lowercase, so the search query should not be case-sensitive.
My ES settings are the following.
{
"mappings": {
"person": {
"properties": {
"fullname": {
"index": "analyzed",
"index_analyzer": "autocomplete",
"search_analyzer": "standard",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"edgengram"
],
"tokenizer": "whitespace"
}
},
"filter": {
"edgengram": {
"max_gram": 50,
"min_gram": 3,
"type": "edgeNGram"
}
}
}
}
}
}

While indexing you should use a standard tokenizer along with lowercase, asciifolding, suggestion_shingle, edgengram and while searching use a keyword analyzer.
Try using something like this:
"index":{
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"suggestions_shingle",
"edgengram"
]
}
},
"filter": {
"suggestions_shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 5
},
"edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 30,
"side": "front"
}
}
}
}
"mappings": {
"person": {
"properties": {
"fullname": {
"index": "analyzed",
"index_analyzer": "autocomplete",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
And then try searching using a match query. It should solve your problem.
Thanks

Related

Adding index to mappings in MongoDB

Hey I have a problem with my mongoDb search. I need to add one more index to my search for better result. But since its not in mappings whenever i perform my search as following it returns me nothing.
`
embeddedDocument: {
path: 'produces',
operator: {
compound: {
"must": [
{
autocomplete: {
"path": "produces.name",
"query": val,
"fuzzy": {
"maxEdits": 2,
"prefixLength": 3
}
}
},
{
equals: { path: 'produces.deleted', value: false }
}
],
}
}
}
and here is my index definition
{
"mappings": {
"dynamic": false,
"fields": {
"companyName": [
{
"analyzer": "lucene.standard",
"type": "string"
},
{
"foldDiacritics": true,
"maxGrams": 6,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
],
"produces": {
"type": "embeddedDocuments",
"fields": {
"name": [
{
"analyzer": "lucene.standard",
"type": "string"
},
{
"foldDiacritics": true,
"maxGrams": 6,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
]
}
},
"status": {
"type": "string"
}
}
}
}
So i need to update my index definition as following.
{
"mappings": {
"dynamic": false,
"fields": {
"companyName": [
{
"analyzer": "lucene.standard",
"type": "string"
},
{
"foldDiacritics": true,
"maxGrams": 6,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
],
"produces": {
"type": "embeddedDocuments",
"fields": {
"deleted": {
"type": "boolean"
},
"name": [
{
"analyzer": "lucene.standard",
"type": "string"
},
{
"foldDiacritics": true,
"maxGrams": 6,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
]
}
},
"status": {
"type": "string"
}
}
}
}
`
But my problem is i dont know where to add this index definition on the mongoDB side so i can filter with deleted false criteria.
I tried to add index to mongoDB to produces collection but it doesnt work and i still get empty result.

ElasticSearch query stops working with big amount of data

The problem: I have 2 identical in terms of settings and mappings indexes.
The first index contains only 1 document.
The second index contains the same document + 16M of others.
When I'm running the query on the first index it returns the document, but when I do the same query on the second — I receive nothing.
Indexes settings:
{
"tasks_test": {
"settings": {
"index": {
"analysis": {
"analyzer": {
"tag_analyzer": {
"filter": [
"lowercase",
"tag_filter"
],
"tokenizer": "whitespace",
"type": "custom"
}
},
"filter": {
"tag_filter": {
"type": "word_delimiter",
"type_table": "# => ALPHA"
}
}
},
"creation_date": "1444127141035",
"number_of_replicas": "2",
"number_of_shards": "5",
"uuid": "wTe6WVtLRTq0XwmaLb7BLg",
"version": {
"created": "1050199"
}
}
}
}
}
Mappings:
{
"tasks_test": {
"mappings": {
"Task": {
"dynamic": "false",
"properties": {
"format": "dateOptionalTime",
"include_in_all": false,
"type": "date"
},
"is_private": {
"type": "boolean"
},
"last_timestamp": {
"type": "integer"
},
"name": {
"analyzer": "tag_analyzer",
"type": "string"
},
"project_id": {
"include_in_all": false,
"type": "integer"
},
"user_id": {
"include_in_all": false,
"type": "integer"
}
}
}
}
}
The document:
{
"_index": "tasks_test",
"_type": "Task",
"_id": "1",
"_source": {
"is_private": false,
"name": "135548- test with number",
"project_id": 2,
"user_id": 1
}
}
The query:
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
[
{
"match": {
"_all": {
"query": "135548",
"type": "phrase_prefix"
}
}
}
]
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"is_private": false
}
},
{
"terms": {
"project_id": [
2
]
}
},
{
"terms": {
"user_id": [
1
]
}
}
]
}
}
}
}
}
Also, some findings:
if I replace _all with name everything works
if I replace match_phrase_prefix with match_phrase works too
ES version: 1.5.1
So, the question is: how to make the query work for the second index without mentioned hacks?

Elasticsearch match with stemming

How do I do a search for a stemmed match?
I.e. at the moment I have many documents that contain the word "skateboard" in the item_title field, but only 3 documents that contain the word "skateboards". Because of this, when I do the following search:
POST /my_index/my_type/_search
{
"size": 100,
"query" : {
"multi_match": {
"query": "skateboards",
"fields": [ "item_title^3" ]
}
}
}
I only get 3 results. However, I would like also documents with the word "skateboard" to be returned.
From what I understand from Elasticsearch I would expect that this is done by specifying a mapping on the item_title field that contains an analyser which indexes the stemmed version of each word, but I can't seem to find the documentation on how to do this, which suggests that it's done in a different way.
Suggestions?
Here's one example:
PUT /stem
{
"settings": {
"analysis": {
"filter": {
"filter_stemmer": {
"type": "stemmer",
"language": "english"
}
},
"analyzer": {
"tags_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"filter_stemmer"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"test": {
"properties": {
"item_title": {
"analyzer": "tags_analyzer",
"type": "text"
}
}
}
}
}
Index some sample docs:
POST /stem/test/1
{
"item_title": "skateboards"
}
POST /stem/test/2
{
"item_title": "skateboard"
}
POST /stem/test/3
{
"item_title": "skate"
}
Perform the query:
GET /stem/test/_search
{
"query": {
"multi_match": {
"query": "skateboards",
"fields": [
"item_title^3"
]
}
},
"fielddata_fields": [
"item_title"
]
}
And see the results:
"hits": [
{
"_index": "stem",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"item_title": "skateboards"
},
"fields": {
"item_title": [
"skateboard"
]
}
},
{
"_index": "stem",
"_type": "test",
"_id": "2",
"_score": 1,
"_source": {
"item_title": "skateboard"
},
"fields": {
"item_title": [
"skateboard"
]
}
}
]
I have added, also, the fielddata_fields element so that you can see how the content of the field has been indexed. As you can see, in both cases, the indexed term is skateboard.

Prioritize exact words first and then others while using query_string query in elasticsearch

If I search for a term "Liebe" the current query and the analyzers used, returns me the results containing the word "Liebe" as a part of a different word "Verlieben" are prioritized over those with only this word.
It should be the other way.
I am also using some advance filters and aggregations too. But here is the most basic query that I use to search.
{
"query": {
"query_string": {
"query": "Liebe",
"default_operator": "AND",
"analyzer": "my_analyzer1"
}
},
"size": "10",
"from": 0
}
The analyzers and index settings are as follows:
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
},
"my_analyzer1":{
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"my_stop'.$s_id.'",
"asciifolding"
]
}
}
}
},
"mappings": {
"product": {
"_all": {
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
'.$mapping.'
}
}
}
}
Please ignore the $mapping. These are the dynamic fields that reside in the index based on some settings in my framework.
Can anyone please point me some direction where I don't need to change more and can get the what I mentioned above?
I have checked many things like match query n all. But, I don't have any fields fixed. So,I cant use that. And I want both the exact search and the search results which has partial match(Using nGrams).
Please help!
Thanks!

Elasticsearch use custom analyzer on filter

I was asking on elasticsearch nested filter return empty result about some error I have in the query and wont getting any results, but in the answer I was pointed out that the expression I use for the filter wasn't analyzed as I expect.
I have a custom analyzer to do the work how can I specify in the next query to the filter to use this custom analyzer:
GET /develop/_search?search_type=dfs_query_then_fetch
{
"query": {
"filtered" : {
"query": {
"bool": {
"must": [
{ "match": { "title": "post" }}
]
}
},
"filter": {
"bool": {
"must": [
{"term": {
"featured": 0
}},
{
"nested": {
"path": "seller",
"filter": {
"bool": {
"must": [
{ "term": { "seller.firstName": "Test 3" } }
]
}
},
"_cache" : true
}}
]
}
}
}
},
"sort": [
{
"_score":{
"order": "desc"
}
},{
"created": {
"order": "desc"
}
}
],
"track_scores": true
}
Here is a setup that seems to do what you want. I used the same basic code as the last answer, but used index_analyzer and search_analyzer in the index definition as follows:
curl -XDELETE "http://localhost:9200/my_index"
curl -XPUT "http://localhost:9200/my_index" -d'
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"snowball": { "type": "snowball", "language": "English" },
"english_stemmer": { "type": "stemmer", "language": "english" },
"english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" },
"stopwords": { "type": "stop", "stopwords": [ "_english_" ] },
"worddelimiter": { "type": "word_delimiter" }
},
"tokenizer": {
"nGram": { "type": "nGram", "min_gram": 3, "max_gram": 20 }
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"stopwords",
"asciifolding",
"lowercase",
"snowball",
"english_stemmer",
"english_possessive_stemmer",
"worddelimiter"
]
},
"custom_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"stopwords",
"asciifolding",
"lowercase",
"snowball",
"english_stemmer",
"english_possessive_stemmer",
"worddelimiter"
]
}
}
}
},
"mappings": {
"posts": {
"properties": {
"title": {
"type": "string",
"analyzer": "custom_analyzer",
"boost": 5
},
"seller": {
"type": "nested",
"properties": {
"firstName": {
"type": "string",
"index_analyzer": "custom_analyzer",
"search_analyzer": "custom_search_analyzer",
"boost": 3
}
}
}
}
}
}
}'
Then added the test docs
curl -XPUT "http://localhost:9200/my_index/posts/1" -d'
{"title": "post", "seller": {"firstName":"Test 1"}}'
curl -XPUT "http://localhost:9200/my_index/posts/2" -d'
{"title": "post", "seller": {"firstName":"Test 2"}}'
curl -XPUT "http://localhost:9200/my_index/posts/3" -d'
{"title": "post", "seller": {"firstName":"Test 3"}}'
And then a couple of match queries in a bool, where one is a multiword query, seems to accomplish what you are wanting:
curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "post"
}
},
{
"nested": {
"path": "seller",
"query": {
"match": {
"seller.firstName": {
"query": "Test 3",
"operator": "and"
}
}
}
}
}
]
}
}
}'
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 6.8380365,
"hits": [
{
"_index": "my_index",
"_type": "posts",
"_id": "3",
"_score": 6.8380365,
"_source": {
"title": "post",
"seller": {
"firstName": "Test 3"
}
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/8cd954aa60be8c44f64e4282e15e6b565c945ecb
Does that solve your problem?

Resources