Apache Solr suggestion suggest only if search term have missing last char - search

I'm having very strange issue with Broadleaf solr search please see following screen-shot
here is i search with wrong spelled term "mesur" then solr search provide spell correction result but see result all results seems to have last char missing.
now see following second screen-shot
now i have appended "e" to search terms and its "mesure" now then it is not providing any results can any one having good solr experience help me out with this especially why solr have missing last character in suggestion?.

I have resolved my issue by changing schema.xml, i'm having issue with field type, previously it was as follow
<fieldType name="text_general_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- Partial Word matcher -->
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
but i have changed to as follows and it's working fine now
<fieldType name="text_general_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- Partial Word matcher -->
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" />language="English" />
</analyzer>
</fieldType>
removed filters as per xml schema and its working fine now

Related

How to handle Arabic characters on Solr

I'm trying to make my site ignore some Arabic characters ex("ه"،"ة") during the search. when user search for word end with "ة" like "مدينة" it brings only word end with that character "ة", it's supposed to bring also the words ends with "ه" like "مدينه".
these characters are the same in the Arabic language so the search result shouldn't be different, but my site produce different results
what I try on my schema_extra_types.xml
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_ar.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_ar.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Arabic" protected="protwords_ar.txt"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="25" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_ar.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_ar.txt" expand="true" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_ar.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Arabic" protected="protwords_ar.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
but the accents_ar.txt empty when I download the config folder from drupal admin interface,and where I can find an example of accents_ar.txt to use on my site? or there is another filter class to handle these kind of issues?

How to search with partial word with Solr

I'm trying search-api and search-api-solr modules and Solr to search with match partial words that are in the middle of a word, and aren't necessarily at the beginning of a word.
I use
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="25" />
but still it doesn't work properly?
Any advice?
my English text field in schema_extra_types.xml
Note: there are other text fields in schema_extra_types.xml but I currently try with the English field.
<!--
English Text Field
7.0.0 --> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt"
splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="25" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_en.txt" expand="true" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt"
splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer> </fieldType>

Search Last Four Numbers in a Given Number

I am trying to search and match last four numbers against a 10 digit number.
Example
7154226465
7152436464
7152348464
If I search for 646, it should match first two numbers. To be precise, I am looking for suffix search that matches against last 4 digits of indexed number. Below is the schema
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter catenateAll="1" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="0" generateWordParts="0" splitOnCaseChange="0"/>
<filter class="solr.ReverseStringFilterFactory"/>
<!--<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="17"/>-->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="7" maxGramSize="10" side="front"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter catenateAll="1" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="0" generateWordParts="0" splitOnCaseChange="0" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
EdgNGram with side="back" does not works in lucene 4.4. I am using solr v4.9.1
If you only want to search for the last 4 digits, then going for a EdgeNGramFilterFactory is the way to go. Try this:
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="4" />
<filter class="solr.ReverseStringFilterFactory"/>
A small note. Besides the use of ngrams, a traditional approach to efficiently support leading wildcards is to reverse the string and do a prefix query.

Text matches in debugger, but no results returned

I've got an issue where my index and query are exactly the same, however no results are returned. It seems to fail on any words that are longer than the ENGTF max length. Here's my schema.
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1" types="wdfftypes.txt" protected="protwords.txt"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
</analyzer>
</fieldType>
Here is a screenshot of the analyzer when "Satisfaction" is put into the index, and "Satisfaction" is put into the query.
Any ideas? Thanks
Once obvious option is to increase the nGram length limit. You seem to be aware of this option and probably agree that is is not ideal.
Another option is to create a second field to use the nGram search, and another to use a search without nGram. For exmaple, somewhere in your schema.xml you might see:
<field name="myCoolNGramField" type="text_en_splitting" indexed="true" stored="false"/>
<!-- make a new type, text_en_non_ngram, and use it for this new field below. -->
<field name="myCoolField" type="text_en_non_ngram" indexed="true" stored="false"/>
<copyField source="myCoolNGramField" dest="myCoolField" />

Solr splitOnCaseChange at query time?

I'm getting unexpected results in Solr and hoping someone can help. My schema.xml has splitOnCaseChange="1" for the field I'm searching on (both index & query), and the default search behavior is "OR".
I have a field with the word "Airline" indexed. When i search for "Airline" I get the match. When I search for "Airline Alias", I get the match (as expected, since it's OR). However, when I search for "AirlineAlias", I am not getting a match. I was expecting the splitOnCaseChange property to separate out the term AirlineAlias query into the 2 base words. However, if that was happening, then it should be finding the match to "Airline" (i.e. it should be the exact same query as "Airline Alias").
Is my understanding correct? If so, any ideas on why I would not be getting the correct search results?
I have copied the relevant sections from the schema.xml file below.
Thanks in advance for the help.
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="value" type="text_en_splitting" indexed="true" stored="true" multiValued="true" omitNorms="true" />
/fields>
<solrQueryParser defaultOperator="OR" />
Got the answer from Jack Krupansky on the Solr mailing list, so updating here for future searchers...
Just set autoGeneratePhraseQueries="false" on the ="text_en_splitting" field
type. The current setting treated AirlineAlias as the quoted phrase "Airline
Alias".

Resources