Solr splitOnCaseChange at query time? - search

I'm getting unexpected results in Solr and hoping someone can help. My schema.xml has splitOnCaseChange="1" for the field I'm searching on (both index & query), and the default search behavior is "OR".
I have a field with the word "Airline" indexed. When i search for "Airline" I get the match. When I search for "Airline Alias", I get the match (as expected, since it's OR). However, when I search for "AirlineAlias", I am not getting a match. I was expecting the splitOnCaseChange property to separate out the term AirlineAlias query into the 2 base words. However, if that was happening, then it should be finding the match to "Airline" (i.e. it should be the exact same query as "Airline Alias").
Is my understanding correct? If so, any ideas on why I would not be getting the correct search results?
I have copied the relevant sections from the schema.xml file below.
Thanks in advance for the help.
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="value" type="text_en_splitting" indexed="true" stored="true" multiValued="true" omitNorms="true" />
/fields>
<solrQueryParser defaultOperator="OR" />

Got the answer from Jack Krupansky on the Solr mailing list, so updating here for future searchers...
Just set autoGeneratePhraseQueries="false" on the ="text_en_splitting" field
type. The current setting treated AirlineAlias as the quoted phrase "Airline
Alias".

Related

Solr schema. exact accented match and accent insensitive match

I'm trying to figure out how to configure the Solr manage-schema's fieldType to achieve the following:
(a) When searching for non-accented strings, the results will be accent insensitive.
(b) HOWEVER When performing searching on accented strings, the results will ONLY be accent sensitive.
For example:
searchString -> expectedResult
Equipe -> Equipe, Equipé, Equípé, etc...
Equipé -> Equipé
Note: Wildcard (*) is irrelevant and chosen words are for the sake of demonstration purposes only.
My situation is a little uncommon due to some requirement restrictions but with my schema (below), I have 3 fields; OName, OSearch, ONameSearch. (note: OSearch and ONameSearch serve different purposes in the backend, so they need to be defined indentically)
The intention is for my Solr to query on OSearch and ONameSearch, and return the OName to UI.
My original understanding was that OName will store the original value ("María") and index it as accent-insensitive ("maria") such that when query without solr.ASCIIFoldingFilterFactory, the following would be achieved.
Example: {query} -> {OName = result}
q = OSearch:*equipe* OR ONameSearch:*equipe* -> OName = Equipe, Equipé, Equípé, etc
q = OSearch:*equipé* OR ONameSearch:*equipé* -> OName = Equipé
This is my schema so far...
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<field name="OName" type="lowercase" indexed="true" stored="true" />
<field name="OSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<field name="ONameSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<copyField source="OName" dest="OSearch" />
<copyField source="OName" dest="ONameSearch" />
Please advise, thanks!
Most if not all relevant resources I've looked into
How to ignore accent search in Solr
How to ignore accents in SOLR search?
SOLR and accented characters
Solr accent removal
SOLR Makes Search with Accented Characters Easy
Solr Ref Guide 6.6 Defining Fields
Solr Ref Guide 6.6 Copying Fields

Apache Solr suggestion suggest only if search term have missing last char

I'm having very strange issue with Broadleaf solr search please see following screen-shot
here is i search with wrong spelled term "mesur" then solr search provide spell correction result but see result all results seems to have last char missing.
now see following second screen-shot
now i have appended "e" to search terms and its "mesure" now then it is not providing any results can any one having good solr experience help me out with this especially why solr have missing last character in suggestion?.
I have resolved my issue by changing schema.xml, i'm having issue with field type, previously it was as follow
<fieldType name="text_general_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- Partial Word matcher -->
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
but i have changed to as follows and it's working fine now
<fieldType name="text_general_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- Partial Word matcher -->
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="1000" />
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" />language="English" />
</analyzer>
</fieldType>
removed filters as per xml schema and its working fine now

Solr: Ignore casing of strings when calculating facet numbers

I have these values for the title field in my database:
"I Am A String"
"I am A string"
I want to make the title field available as facets in my search results.
Current result:
<lst name="title">
<int name="I Am A String">4</int>
<int name="I am A string">3</int>
</lst>
Desired result:
<lst name="title">
<int name="I Am A String">7</int>
</lst>
I actually don't care which of the 2 available string options is chosen for the final result, as long as the same strings (case insenstive) are counted for the same facet.
I tried the following field definitions for the title field. I also added the resulting facet logic.
string = sees casing as different strings
string_exact = sees casing as different strings
text_ws = breaks up into words with casing intact
text = breaks into separate words
textTight = breaks into separate words
textTrue = breaks up in words with casing intact
string_exacttest = breaks up in words with casing intact
Here's my schema.xml
<field name="title" type="string" indexed="true" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="string_exact" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
<!--
this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
stemming.
-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
</analyzer>
</fieldType>
How can I make sure that the same strings (ignoring case) are grouped together when calculating the facets?
The string_exact definition is almost what you need, but you need to have a LowercaseFilter applied as well, so that each sentence is lowercased. The KeywordTokenizer keeps the whole value as a single token (so you won't see it broken into separate terms based on whitespace), and while a string field doesn't allow any additional processing, a TextField with a KeywordTokenizer behaves the same way - but you can add filters to how the token is processed afterwards.
<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Search Solr ShingleFilterFactory

I have a data collection on Solr and I need to make a search and look for all typed words.
For example, if a user introduces the text "House Tree Spain" Solr should look for "House Tree Spain", "House Tree", "House Spain", "Tree Spain" "House", "Tree", "Spain".
I'm using "solr.ShingleFilterFactory" but just when I analyze the query.
<fieldType name="generic" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- generic -->
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- spanish -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" />
<!-- english -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- generic -->
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- spanish -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" />
<!-- english -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="10" outputUnigramsIfNoShingles="true"/>
</analyzer>
</fieldType>
How can I change my schema to get the results I'm looking for?
You have to apply the Shingle filter to both the query and index analyzers. In the indexing phase, it creates the tokens "House Tree" and "Tree Spain", and puts them in the index. In the query phase, it creates those tokens out of the query and searches for them in the index. If either of those steps is omitted, then "House Tree" can never match, see?
PS. shingle size of 10 is huge. For this particular example, you only need 2. Set it as low as you can, otherwise, your index size grows very large.

Text matches in debugger, but no results returned

I've got an issue where my index and query are exactly the same, however no results are returned. It seems to fail on any words that are longer than the ENGTF max length. Here's my schema.
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1" types="wdfftypes.txt" protected="protwords.txt"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
</analyzer>
</fieldType>
Here is a screenshot of the analyzer when "Satisfaction" is put into the index, and "Satisfaction" is put into the query.
Any ideas? Thanks
Once obvious option is to increase the nGram length limit. You seem to be aware of this option and probably agree that is is not ideal.
Another option is to create a second field to use the nGram search, and another to use a search without nGram. For exmaple, somewhere in your schema.xml you might see:
<field name="myCoolNGramField" type="text_en_splitting" indexed="true" stored="false"/>
<!-- make a new type, text_en_non_ngram, and use it for this new field below. -->
<field name="myCoolField" type="text_en_non_ngram" indexed="true" stored="false"/>
<copyField source="myCoolNGramField" dest="myCoolField" />

Resources