How can I search the special characters in Solr - search

I'm used Solr 6.6.2
I need to search the special characters and highlight it in Solr,
But it does not work,
my data :
[
{
"id" : "test1",
"title" : "test1# title C# ",
"dynamic_s": 5
},
{
"id" : "test2",
"title" : "test2 title C#",
"dynamic_s": 10
},
{
"id" : "test3",
"title" : "test3 title",
"dynamic_s": 0
}
]
When I search "C#",
Then it will just response like this "test1# title C# ",
It just highlights "C" this word...and "#" will not searching and highlight.
How can I make the search and highlight work for special characters?

The StandardTokenizer splits tokens on special characters, meaning that # will split the content into separate tokens - the first token will be C - and that's what's being highlighted. You'll probably get the exact same result if you just search for C.
The tokenization process will make your tokens end up being test2 title C .
Using a field type with a WhitespaceTokenizer that only splits on whitespace will probably be a better choice for this exact use case, but it's impossible to say if that'll be a good match for your regular search behavior (i.e. if you actually want to match 'C' to `C-99' etc., splitting by those characters can be needed). But - you can use a specific field for highlighting, and that fields analysis chain will be used to determine what to highlight. And you can ask for both the original and the more specific field to be highlighted, and then use the best result in your frontend application.

Related

How to match different instances of the same query in Elasticsearch?

Example 1:
My query term is "abcd".
My query structure is like this:
query: {
query_string: {
query: "abc",
fields: ["field1", "field2", "field3"]
}
},
size: 50,
"highlight": {
"fields": {
"field1": {},
"field2": {},
"field3": {}
}
It matches the following instances:
abc abcs abc_def_ghi
But it does not match def_abc or def_abc_ghi.
Basically instances where abc is in the middle of a string.
Example 2:
In the same example above, if my query is abc_def
It does not match abc_def_ghi, although abc_def is present.
I have tried prefix_phrase and it solves scenario 2 but misses out on example 1's problems.
Any help would be appreciated.
for these usages you should use wildcard in query or regular expression
if you are using term query you can utilize wildcard term query or regexp query instead.
As name suggests phase_prefix is like poor mans autocomplete it searches for fields which starts with given phrase in your case abc,abcs,abc_def_ghi. as your field doesn't start with abc in case of def_abc,def_abc_ghi it won't work with phrase prefix.
Try using character filters specifically Pattern Replace Character Filter to replace _ with (space) from your field while analyzing your field. check this answer . so your token would result in [def,abc,ghi] instead of single token like [def_abc_ghi]. then you can search it using cross_field on analyzed field which should satisfy all of your mentioned cases.

Elasticsearch Completion Suggester field contains comma separated values

I have a field that contains comma separated values which I want to perform suggestion on.
{
"description" : "Breakfast,Sandwich,Maker"
}
Is it possible to get only applicable token while performing suggest as you type??
For ex:
When I say break, how can I get only Breakfast and not get Breakfast,Sandwich,Maker?
I have tried using commatokenizer but it seems it does not help
As said in the documentation, you can provide multiple possible inputs by indexing like this:
curl -X PUT 'localhost:9200/music/song/1?refresh=true' -d '{
"description" : "Breakfast,Sandwich,Maker",
"suggest" : {
"input": [ "Breakfast", "Sandwitch", "Maker" ],
"output": "Breakfast,Sandwich,Maker"
}
}'
This way, you suggest with any word of the list as input.
Obtaining the corresponding word as suggestion from Elasticsearch is not possible but as a workaround you could use a tokenizer outside Elasticsearch to split the suggested string and choose only the one that has the input as prefix.
EDIT: a better solution would be to use an array instead of comma-separated values, but it doesn't meet your specs... ( look at this: Elasticsearch autocomplete search on array field )

ElasticSearch - Searching for exact text match without keeping two copies in index?

Exact matching for text is supported in ElasticSearch if the field mapping contains "index" : "not_analyzed". That way, the field wont' be tokenized and ES will use the whole string for exact matching. The Documentation
Is there a way to support both full text searching and exact matching without having to create two fields: one for full-text, and one with not_analyzed mapping for exact matching?
An example use case:
We want to search by book titles.
I like trees should return results of full text search
exact="I like trees" should return only books that have the exact title I like trees and nothing else. Case insensitive is fine.
You can use a term filter to do exact match searches
the filter looks like this
{
"term" {
"key" : "value"
}
}
a query would look like this:
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"key" : "value"
}
}
}
}
}
You don't need to store the data in two different fields, what you want is an ES multi-field.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields

How do I find all exact matches within a block of text in Elasticsearch?

I've got an index of hundreds of book titles in elasticserch, with documents like:
{"_id": 123, "title": "The Diamond Age", ...}
And I've got a block of freeform text entered by a user. The block of text could contain a number of book titles throughout it, with varying capitalization.
I'd like to find all the book titles in the block of text, so I can link to the specific book pages.
Any idea how I can do this? I've been looking around for exact phrase matches in blocks of text, with no luck.
You need to index the field title as not_analyzed or using keyword analyzer.
This will tell elasticsearch to do no operations on the field whenever you send a query and this will make you be able to do an exact match search.
I would suggest that you keep an analyzed version as well as a not_analyzed version in order to be able to do exact searches as well as analyzed searches. Your mappings would go like this, in this case I assume that the type name is movies in your case.
"mappings":{
"movies":{
"properties":{
"title":{
"type": "string",
"fields":{
"row":{
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
This will give you two fields title which contains an analyzed title and title.row which contains the exact value indexed with absolutely no processing.
title.row would match if you entered an exact

Elasticsearch: mapping text field for search optimization

I have to implement a text search application which indexes news articles and then allows a user to search for keywords, phrases or dates inside these texts.
After some consideration regarding my options(SOLR vs. elasticsearch mainly), I ended up doing some testing with elasticsearch.
Now the part that I am stuck on regards the mapping and search query construction options best suited for some special cases that I have encountered. My current mapping has only one field that contains all the text and needs to be analyzed in order to be searchable.
The specific part of the mapping with the field:
"txt": {
"type" : "string",
"term_vector" : "with_positions_offsets",
"analyzer" : "shingle_analyzer"
}
where shingle_analyzer is:
"analysis" : {
"filter" : {
"filter_snow": {
"type":"snowball",
"language":"romanian"
},
"shingle":{
"type":"shingle",
"max_shingle_size":4,
"min_shingle_size":2,
"output_unigrams":"true",
"filler_token":""
},
"filter_stop":{
"type":"stop",
"stopwords":["_romanian_"]
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","asciifolding", "filter_stop","filter_snow","shingle"]
}
}}
My question regards the following situations:
I have to search for "ING" and there are several "ing." that are returned.
I have to search for "E!" and the analyzer kills the
punctuation and thus no results.
I have to search for certain uppercased common terms that are used as company names (like "Apple" but with multiple words) and the lowercase filter creates useless results.
The idea that I have would be to build different fields with different filters that could cover all these possible issues.
Three questions:
Is splitting the field in three fields with different analyzers the correct way?
How would I use the different fields when searching?
Could someone explain how scoring would work to include all these fields?

Resources