Searching by filed containing a comma in Elasticsearch - search

We have users in our system and their nicknames can contain commas which is the special character that Elasticsearch uses for separate values. I have the following users stored:
{
"nickname" : "John"
}
{
"nickname" : "John,2"
}
If I execute the query nickname:John I get both documents what is not the expected.
I am not sure of what I need. I mean a tokenizer, an analyzer...
Thanks in advance

String fields are analyzed by default in ElasticSearch, that's why your 2nd user is indexed with 2 terms : "John" and "2" and match your nickname:John query.
If you want your nickname not to be analyzed (treated as a single string), you have to explicitly set the mapping of this field to "keyword".
More information about that : http://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-analyzer/ and about the mapping API : http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping/

Related

How to match different instances of the same query in Elasticsearch?

Example 1:
My query term is "abcd".
My query structure is like this:
query: {
query_string: {
query: "abc",
fields: ["field1", "field2", "field3"]
}
},
size: 50,
"highlight": {
"fields": {
"field1": {},
"field2": {},
"field3": {}
}
It matches the following instances:
abc abcs abc_def_ghi
But it does not match def_abc or def_abc_ghi.
Basically instances where abc is in the middle of a string.
Example 2:
In the same example above, if my query is abc_def
It does not match abc_def_ghi, although abc_def is present.
I have tried prefix_phrase and it solves scenario 2 but misses out on example 1's problems.
Any help would be appreciated.
for these usages you should use wildcard in query or regular expression
if you are using term query you can utilize wildcard term query or regexp query instead.
As name suggests phase_prefix is like poor mans autocomplete it searches for fields which starts with given phrase in your case abc,abcs,abc_def_ghi. as your field doesn't start with abc in case of def_abc,def_abc_ghi it won't work with phrase prefix.
Try using character filters specifically Pattern Replace Character Filter to replace _ with (space) from your field while analyzing your field. check this answer . so your token would result in [def,abc,ghi] instead of single token like [def_abc_ghi]. then you can search it using cross_field on analyzed field which should satisfy all of your mentioned cases.

Elasticsearch: How to get the length of a string field(before analysis)?

My index has a string field containing a variable length random id. Obviously it shouldn't be analysed.
But I don't know much about elasticsearch especially when I created the index.
Today I tried a lot to filter documents based on the length of id, finally I got this groovy script:
doc['myfield'].values.size()
or
doc['myfield'].value.size()
both returns mysterious numbers, I think that's because of the field got analysed.
If it's really the case, is there any way to get the original length or fix the problem, without rebuild the whole index?
Use _source instead of doc. That's using the source of the document, meaning the initial indexed text:
_source['myfield'].value.size()
If possible, try to re-index the documents to:
use doc[field] on a not-analyzed version of that field
even better, find out the size of the field before you index the document and consider adding its size as a regular field in the document itself
Elasticsearch stores a string as tokenized in the data structure ( Field data cache )where we have script access to.
So assuming that your field is not not_analyzed , doc['field'].values will look like this
"In america" => [ "in" , "america" ]
Hence what you get from doc['field'].values is a array and not a string.
Now the story doesn't change even if you have a single token or have the field as not_analyzed.
"america" => [ "america" ]
Now to see the size of the first token , you can use the following request
{
"script_fields": {
"test1": {
"script": "doc['field'].values[0].size()"
}
}
}

Elasticsearch Completion Suggester field contains comma separated values

I have a field that contains comma separated values which I want to perform suggestion on.
{
"description" : "Breakfast,Sandwich,Maker"
}
Is it possible to get only applicable token while performing suggest as you type??
For ex:
When I say break, how can I get only Breakfast and not get Breakfast,Sandwich,Maker?
I have tried using commatokenizer but it seems it does not help
As said in the documentation, you can provide multiple possible inputs by indexing like this:
curl -X PUT 'localhost:9200/music/song/1?refresh=true' -d '{
"description" : "Breakfast,Sandwich,Maker",
"suggest" : {
"input": [ "Breakfast", "Sandwitch", "Maker" ],
"output": "Breakfast,Sandwich,Maker"
}
}'
This way, you suggest with any word of the list as input.
Obtaining the corresponding word as suggestion from Elasticsearch is not possible but as a workaround you could use a tokenizer outside Elasticsearch to split the suggested string and choose only the one that has the input as prefix.
EDIT: a better solution would be to use an array instead of comma-separated values, but it doesn't meet your specs... ( look at this: Elasticsearch autocomplete search on array field )

ElasticSearch - Searching for exact text match without keeping two copies in index?

Exact matching for text is supported in ElasticSearch if the field mapping contains "index" : "not_analyzed". That way, the field wont' be tokenized and ES will use the whole string for exact matching. The Documentation
Is there a way to support both full text searching and exact matching without having to create two fields: one for full-text, and one with not_analyzed mapping for exact matching?
An example use case:
We want to search by book titles.
I like trees should return results of full text search
exact="I like trees" should return only books that have the exact title I like trees and nothing else. Case insensitive is fine.
You can use a term filter to do exact match searches
the filter looks like this
{
"term" {
"key" : "value"
}
}
a query would look like this:
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"key" : "value"
}
}
}
}
}
You don't need to store the data in two different fields, what you want is an ES multi-field.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields

Elasticsearch: mapping text field for search optimization

I have to implement a text search application which indexes news articles and then allows a user to search for keywords, phrases or dates inside these texts.
After some consideration regarding my options(SOLR vs. elasticsearch mainly), I ended up doing some testing with elasticsearch.
Now the part that I am stuck on regards the mapping and search query construction options best suited for some special cases that I have encountered. My current mapping has only one field that contains all the text and needs to be analyzed in order to be searchable.
The specific part of the mapping with the field:
"txt": {
"type" : "string",
"term_vector" : "with_positions_offsets",
"analyzer" : "shingle_analyzer"
}
where shingle_analyzer is:
"analysis" : {
"filter" : {
"filter_snow": {
"type":"snowball",
"language":"romanian"
},
"shingle":{
"type":"shingle",
"max_shingle_size":4,
"min_shingle_size":2,
"output_unigrams":"true",
"filler_token":""
},
"filter_stop":{
"type":"stop",
"stopwords":["_romanian_"]
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","asciifolding", "filter_stop","filter_snow","shingle"]
}
}}
My question regards the following situations:
I have to search for "ING" and there are several "ing." that are returned.
I have to search for "E!" and the analyzer kills the
punctuation and thus no results.
I have to search for certain uppercased common terms that are used as company names (like "Apple" but with multiple words) and the lowercase filter creates useless results.
The idea that I have would be to build different fields with different filters that could cover all these possible issues.
Three questions:
Is splitting the field in three fields with different analyzers the correct way?
How would I use the different fields when searching?
Could someone explain how scoring would work to include all these fields?

Resources