Elasticsearch Completion Suggester field contains comma separated values - search

I have a field that contains comma separated values which I want to perform suggestion on.
{
"description" : "Breakfast,Sandwich,Maker"
}
Is it possible to get only applicable token while performing suggest as you type??
For ex:
When I say break, how can I get only Breakfast and not get Breakfast,Sandwich,Maker?
I have tried using commatokenizer but it seems it does not help

As said in the documentation, you can provide multiple possible inputs by indexing like this:
curl -X PUT 'localhost:9200/music/song/1?refresh=true' -d '{
"description" : "Breakfast,Sandwich,Maker",
"suggest" : {
"input": [ "Breakfast", "Sandwitch", "Maker" ],
"output": "Breakfast,Sandwich,Maker"
}
}'
This way, you suggest with any word of the list as input.
Obtaining the corresponding word as suggestion from Elasticsearch is not possible but as a workaround you could use a tokenizer outside Elasticsearch to split the suggested string and choose only the one that has the input as prefix.
EDIT: a better solution would be to use an array instead of comma-separated values, but it doesn't meet your specs... ( look at this: Elasticsearch autocomplete search on array field )

Related

How to search for a specific dynamic pattern of a field's in mongodb.?

I need to search mongodb collection for a specific pattern field. I tried using {$exists:true}; However, this gives results only if you provide exact field.
I tried using {$exists:true} for my field. But this does not give results if you give some pattern.
{
"field1":"value1",
"field2":"value2",
"field3":object
{/arjun1/pat1: 1,
/arjun2/pat2: 3,
/arjun3/pat3: 5
}
"field4":"value4",
}
From some field, I get the keys pat3 & field3. From this I would need to find out if the value /arjun3/pat3 exists in the document.
If I use {"field3./arjun3/pat3":{$exists:true}}, this would give me results. But the problem is I get only field3 and pat3 and I need to use some pattern matching like field3.*.pat3 and then use $expr or $exists; which I'm not exactly sure how to. Please help.
you could try something of this kind
db.arjun.find(
{"field3" : {
"$elemMatch" : { $and: [
{"arjun3.pat3" : {$exists:true}},
{"arjun3.pat3" : 5}
]
}}}
);
You can either go for regex (re module) for SQL like pattern matching, and compile your own custom wildcard. But if you don't want that then you can simple use the fnmatch module, it is a builtin library of python which allows wildcard matching for multiple characters (via*) or a single character (via ?).
import fnmatch
a = "hello"
print(fnmatch.fnmatch(a, "h*"))
OUTPUT:-
True

How to match different instances of the same query in Elasticsearch?

Example 1:
My query term is "abcd".
My query structure is like this:
query: {
query_string: {
query: "abc",
fields: ["field1", "field2", "field3"]
}
},
size: 50,
"highlight": {
"fields": {
"field1": {},
"field2": {},
"field3": {}
}
It matches the following instances:
abc abcs abc_def_ghi
But it does not match def_abc or def_abc_ghi.
Basically instances where abc is in the middle of a string.
Example 2:
In the same example above, if my query is abc_def
It does not match abc_def_ghi, although abc_def is present.
I have tried prefix_phrase and it solves scenario 2 but misses out on example 1's problems.
Any help would be appreciated.
for these usages you should use wildcard in query or regular expression
if you are using term query you can utilize wildcard term query or regexp query instead.
As name suggests phase_prefix is like poor mans autocomplete it searches for fields which starts with given phrase in your case abc,abcs,abc_def_ghi. as your field doesn't start with abc in case of def_abc,def_abc_ghi it won't work with phrase prefix.
Try using character filters specifically Pattern Replace Character Filter to replace _ with (space) from your field while analyzing your field. check this answer . so your token would result in [def,abc,ghi] instead of single token like [def_abc_ghi]. then you can search it using cross_field on analyzed field which should satisfy all of your mentioned cases.

Elasticsearch: How to get the length of a string field(before analysis)?

My index has a string field containing a variable length random id. Obviously it shouldn't be analysed.
But I don't know much about elasticsearch especially when I created the index.
Today I tried a lot to filter documents based on the length of id, finally I got this groovy script:
doc['myfield'].values.size()
or
doc['myfield'].value.size()
both returns mysterious numbers, I think that's because of the field got analysed.
If it's really the case, is there any way to get the original length or fix the problem, without rebuild the whole index?
Use _source instead of doc. That's using the source of the document, meaning the initial indexed text:
_source['myfield'].value.size()
If possible, try to re-index the documents to:
use doc[field] on a not-analyzed version of that field
even better, find out the size of the field before you index the document and consider adding its size as a regular field in the document itself
Elasticsearch stores a string as tokenized in the data structure ( Field data cache )where we have script access to.
So assuming that your field is not not_analyzed , doc['field'].values will look like this
"In america" => [ "in" , "america" ]
Hence what you get from doc['field'].values is a array and not a string.
Now the story doesn't change even if you have a single token or have the field as not_analyzed.
"america" => [ "america" ]
Now to see the size of the first token , you can use the following request
{
"script_fields": {
"test1": {
"script": "doc['field'].values[0].size()"
}
}
}

Elasticsearch: mapping text field for search optimization

I have to implement a text search application which indexes news articles and then allows a user to search for keywords, phrases or dates inside these texts.
After some consideration regarding my options(SOLR vs. elasticsearch mainly), I ended up doing some testing with elasticsearch.
Now the part that I am stuck on regards the mapping and search query construction options best suited for some special cases that I have encountered. My current mapping has only one field that contains all the text and needs to be analyzed in order to be searchable.
The specific part of the mapping with the field:
"txt": {
"type" : "string",
"term_vector" : "with_positions_offsets",
"analyzer" : "shingle_analyzer"
}
where shingle_analyzer is:
"analysis" : {
"filter" : {
"filter_snow": {
"type":"snowball",
"language":"romanian"
},
"shingle":{
"type":"shingle",
"max_shingle_size":4,
"min_shingle_size":2,
"output_unigrams":"true",
"filler_token":""
},
"filter_stop":{
"type":"stop",
"stopwords":["_romanian_"]
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","asciifolding", "filter_stop","filter_snow","shingle"]
}
}}
My question regards the following situations:
I have to search for "ING" and there are several "ing." that are returned.
I have to search for "E!" and the analyzer kills the
punctuation and thus no results.
I have to search for certain uppercased common terms that are used as company names (like "Apple" but with multiple words) and the lowercase filter creates useless results.
The idea that I have would be to build different fields with different filters that could cover all these possible issues.
Three questions:
Is splitting the field in three fields with different analyzers the correct way?
How would I use the different fields when searching?
Could someone explain how scoring would work to include all these fields?

Searching by filed containing a comma in Elasticsearch

We have users in our system and their nicknames can contain commas which is the special character that Elasticsearch uses for separate values. I have the following users stored:
{
"nickname" : "John"
}
{
"nickname" : "John,2"
}
If I execute the query nickname:John I get both documents what is not the expected.
I am not sure of what I need. I mean a tokenizer, an analyzer...
Thanks in advance
String fields are analyzed by default in ElasticSearch, that's why your 2nd user is indexed with 2 terms : "John" and "2" and match your nickname:John query.
If you want your nickname not to be analyzed (treated as a single string), you have to explicitly set the mapping of this field to "keyword".
More information about that : http://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-analyzer/ and about the mapping API : http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping/

Resources