We have a solr configuration simply like this https://gist.github.com/118d17e93267123f4870
According to that, I'm indexing a title The Secret of Rhonda Byrne or the Law of Attraction in the Bible with an id.
Then I go to the Solr Analysis tool to see how its gonna get indexed in this field. So the index analysis if that field type with given value is returning something like;
secret rhonda byrne law attraction bible
Just to make sure, I'm querying the id field of that title to see if its there. Positive.
But when I query this index with the analysis tool result I get no results. My query is like;
select?qf=title_stm&q="Secret+Rhonda+Byrne+Law+Attraction+Bible"&fl=id,title&defType=dismax
As given in comments, debug output of that query is here;
https://gist.github.com/9b88fd6b5f043c90d539
Issue is with : -
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
The enablePositionIncrements with increment the positions whenever it encounters the stopwords.
Hence, when you do an exact phrase match this would not match as position of the indexed title are not next to each other.
Read more of it here.
You should disable the position increments to able to be do a phrase match.
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="false"
/>
Related
I have a name field, with the following definition:
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
and I would like to search for contents based on the following criteria:
Example values:
test a value
test of solr
not a value
test me says value
So, if i do search for test value, I should only get results containing both test and values. And, even if I do test a value, I should still get only the same result as stop words will be excluded. But, with this current setup, even with including edismax in the solr query, I get all of the results. It goes by a record with either of those two tokens. Could someone suggest me the update I could do to the definition to get a result as expected? And, am I using stopwords as expected? I do not want stopwords in the search consuming execution time.
I updated the definition as per the suggestion and even then the result does not make any sense to me.
I have a value what a term. And, there are other values like what the term ; about a term; about the term; description test; Name a Nothing etc. A search for what a term returns all of these. And, I also had a value just a and the. They were also getting returned in the result. Though, for what a term, as per the below screenshot, the query omits the stop word, the result does not make any sense to me.
You can ignore the stopsword during the search and index time. You cannot ignore these words in the response. The response will come as the text is stores as it is. The data stored for search and response is different. Search happens on the indexed data. In the response you get the data stored.
In your field type definition you are using KeywordTokenizerFactory.
KeywordTokenizerFactory : is used to when you dont want to create any tokens of your text. This tokenizer treats the entire text field as a single token.
Use of any other filter is of no use after this.
You can use StandardTokenizerFactory instead KeywordTokenizerFactory.
StandardTokenizerFactory : This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
On the analysis page, I analyzed the data for the above field.
Index the data "test a value" and query it with "test value", the result is found. Here you can see while indexing the data the stopwords are skipped as we have applied the stopwordfilterfactory
Now use "test a value" while indexing as well query in the analyze page.
It skips the stopword "a" as filter is applied and matched the result.
I have this problem that I am trying to solve for quite some time. I am not solr expert, I am still learning it.
I have a special type of ID's in my system, that have to be searchable by users. The problem is, that those ID's contain some solr special characters. By the way, those ID's are stored together with other search terms in the terms_txt field.
Some ID examples: 292/2017 and 1.2.61-962-37/2017
The first one I will refer as the 'simple one', and the second as 'complex one'.
From what I red throughout the internet, is that this kind of search should be possible if we do the phrase search. So if we add apostrophes around the ID, it should work. But unfortunately that is not the case. I will post here my solr 4.0 schema, and example of my query, hoping that you can spot what is wrong with it. If phrase search is the answer to my problem, then it must be that something is wrong with either solr schema or my query (code).
In my example I am searching for "292/2017" as a phrase. Only one entry in my index has this phrase, because this combination of characters is unique (it is some kind of ID, but we insert it in terms_txt field with all other terms)
This is query executed via solr admin, it finds a lot of results, but there should be only 1. It seems like solr handles '/' character as a space, and ignores terms shorter than 3 letters (ignoring less than 3 is what we want, but not in phrase search):
INFO: [collection1] webapp=/solr-example path=/select params={q=terms_txt:"44/2017"&wt=xml} hits=31343 status=0 QTime=6
So basically, in this example, solr has found all records with the term of 2017, which is bad...
This is query executed withing application logic. It is more complex, but the problem is same:
INFO: [collection1] webapp=/solr-example path=/select params={mm=100%25&json.nl=flat&fl=id&start=0&sort=date_in_i+desc&fq=type_s:2&fq=date_in_i:[20161201+TO+*]&fq=date_in_i:[*+TO+20171011]&fq=subtype_s:(2+4+6+8)&fq=terms_txt:"\"10/2017\""&fq=language_is:0&rows=10&bq=&q=\"10\/2017\"&tie=0.1&defType=edismax&omitHeader=true&qf=terms_txt&wt=json} hits=978 status=0 QTime=2
This is how terms_txt entries looks like in index:
<arr name="terms_txt">
<str>Some string blah blah 292/2017 - more of terms, blah blah</str>
<str>Something else, blah blah</str>
</arr>
This is my solr schema field configuration for the terms_txt field (fields are dynamic):
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|\s)([^\-\_&\s]+([\-\_&]+[^\-\_&\s]*)+)(?=(\s|$))" replacement="$1MжџљМ$2 $2" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="MжџљМ" replacement="" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b[\-_]+\b" replacement="" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Anyone have any clue how should I allow special characters like .-/ to be searchable ? Can you spot some flaw in my example or suggest better solution ?
You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces might remove content as well).
The Whitespace Tokenizer is better suited for a field where matching special characters is important, since it'll only break on and remove whitespace.
Define different fields and use different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. You can adjust weights by using qf together with the (e)dismax handlers. These handlers also allows you to boost phrase matches for two and three shingles.
Use one or more copyField instructions to get your content from one field to the other fields, so you don't have to change your indexing code to adjust how you tweak things in Solr.
If you append debugQuery=true to your query string, you can also see how Solr / Lucene computes the score for each document and what contributes to its ranking, so you can tweak scoring values and see exactly how the final score changes.
When writing the query, escape any special characters with \.
Is there a practicable way to do an exact match search on a stemmed fulltext field?
I have a scenario which i need a field to be indexed and searchable regardless of the case or white spaces used.
Even using KeywordTokenizerFactory on both index and query, all my searchs based on exact match stopped working.
Is there a way to search exact match like a string field and at the same time use customs tokenizers aplied to that field?
I posted below the schema i am currently using:
<field name="subtipoimovel" type="buscalimpaquery" indexed="true" stored="true" />
<fieldType name="buscalimpaquery" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement="-"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
regards,
Silvio Giuliani
The problem is while indexing you are using KeywordTokenizerFactory, ASCIIFoldingFilterFactory, LowerCaseFilterFactory and PatternReplaceFilterFactory but while query you are using KeywordTokenizerFactory. That will not work good for exact matches.
You need to see these as pipelined processors. You need to have "similar" processing during query time too.
As Srikanth notes in a comment, you should consider splitting up the different kinds of term analysis in two separate fields. See also my answer to a functionally similar question: Solr: combining EdgeNGramFilterFactory and NGramFilterFactory.
Aparently the problem was this tokenizer:
"solr.KeywordTokenizerFactory"
I changed it to StandardTokenizerFactory and now it works exact matches.
I read the description of KeywordTokenizerFactory on solr wiki and seems to me that to work exact match i should use it instead of StandardTokenizerFactory.
Does anyone know why this happens?
I am using Solr 1.4.1 (lucene 2.9.3) on windows and am trying to understand ShingleFilter. I wrote the following code and find that if I provide more words than the actual phrase indexed in the field, then the search on that field fails i.e. no score contributed from that field with debugQuery=true.
Here is an example I created to reproduce, with field names and the document indexed:
Id: 1
title_1: Nina Simone
title_2: I put a spell on you
Issue the following Queries (dismax):
- “Nina Simone I put” <- Fails to have a score from title_1 search (using debugQuery)
- “Nina Simone” <- SUCCESS
Trying to analyze the above disparity, when I used Solr’s Field Analysis with the ‘shingle’ field (given below) and tried “Nina Simone I put”, it succeeds. So it’s only during the query that no score is provided. I also checked ‘parsedquery’ and it shows disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the title_1 field.
title_1 and title_2 fields are of type ‘shingle’, defined as:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
</fieldType>
Note that I also have a catchall field which is text. I have qf set to: 'id^2 catchall^0.8' and pf set to: 'title_1^1.5 title_2^1.2'
Is there something that I am missing or doing something wrong?
In a dismax query, the score of the query is the max of the subqueries. Not the sum. I don't really know much about how it sparse shingle queries, but if it does something like "(title1:(shingle1 shingle2...)) (title2:(shingle1 shingle2...))" then you should expect to see only one field contribute to the score.
I want stopwords excluded except when the search term is within double quotes
eg. "just like that" should also search "that".
Is this possible?
It depends on the configuration of the field you are querying.
If the configuration of the indexing analyzer includes a StopFilterFactory, then the stopwords are simply not indexed, so you can not query for them afterward. But since Solr keeps the position of the terms in the index, you can instruct it to increment the position value of the remaining terms to reflect the fact that originally, there was other terms in between.
The "enablePositionIncrements" here is the key to achieve that:
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
If the querying analyzer also has the StopFilterFactory configured with the same settings, your query should work as expected.
See this link for details:
http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.18
I've also had luck using the CommonGramsFilterFactory to achieve similar results by putting this in the appropriate place in your fieldType declaration.
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
Not sure how well it works with enablePositionIncrements="true" enabled in the StopFilterFactory. You also need to be running solr 1.4 to use this.