SOLR WordDelimiterFilterFactory - search

I use WordDelimiterFilterFactory to split words that have numbers into solr tokens. For example the word Php5 is split in two tokens "PHP", "5".When searching, the request that is executed by SOLR is q="php" and q="5". But this request finds even results with "5" only. What I want is to find documents with "PHP5" or "PHP 5" only.
If someone has any idea to get around this please.
Hope it is clear.
Thank's.

You need to get solr, in addition to indexing "php5", to index "php 5" as a single token. That way a search for "php 5" will match but a search for "blah 5" will not, for example.
The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid works.
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
synonyms.txt
php5,php_5
protwords.txt (so the delimiter doesn't break it)
php5,php_5
You also have to change the query parser to use the lucid parser.
solrconfig.xml
<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
<str name="phrases">autophrases.txt</str>
<str name="replaceWhitespaceWith">_</str>
<str name="ignoreCase">false</str>
</queryParser>
<requestHandler name="/searchp" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">Keywords</str>
<str name="defType">autophrasingParser</str>
</lst>
</requestHandler>
autophrases.txt
php 5
The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter
This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

This filter splits tokens at word delimiters.
In your case you can opt for splitOnNumerics="0", so it wont spilt on numbers.
splitOnNumerics:
(integer, default 1) If 0, don't split words on
transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"
The rules for determining delimiters are determined in the below link
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter

Related

Solr: Searching with/without spaces in keywords

I am experiencing an issue when spaces are introduced to keywords, for example:
We have a product with the title "Sony Playstation 4 Camera V2 PS4
(PSVR)"
Searching for "playstation" or "playstation camera" brings back this product
Searching for "play station" or "play station camera" does not bring back this product (notice
the space)
Here is the fieldType being used:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
How can I fix this, and make both "playstation" and "play station" match? This is only limited to PlayStation for my example, but it can happen to any search term e.g. "cyberpunk", "cyber punk". So solutions that require alot of manual work such as adding a synonym for play station => playstation are not feasible.
Things I have tried, but not managed to make work:
N-GRAM filter and tokenizer
Fuzzy search
Removing whitespace
Escaping whitespace
You can use a Shingle Filter to combine multiple tokens into one.
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
If you assume that the terms are spelled correctly when being indexed, you can apply this only when querying. It'll concatenate the tokens for you, effectively giving you multiple "merged" tokens:
play station camera => play, station, camera, playstation, stationcamera
.. given maxShingleSize=2. If you increase the max size to 3, this will also give you playstationcamera as a single token (in this case). If you have terms where people will possibly split a word multiple times, that might be necessary.
If you assume that your terms are indexed correctly, and this is only necessary on query time, your index won't change and you won't have to reindex (and the size won't change).
You might have to change the location of the filter around; your stemming filter will break this in mysterious places, since you'll end up concatenating previously stemmed terms.

Solr schema. exact accented match and accent insensitive match

I'm trying to figure out how to configure the Solr manage-schema's fieldType to achieve the following:
(a) When searching for non-accented strings, the results will be accent insensitive.
(b) HOWEVER When performing searching on accented strings, the results will ONLY be accent sensitive.
For example:
searchString -> expectedResult
Equipe -> Equipe, Equipé, Equípé, etc...
Equipé -> Equipé
Note: Wildcard (*) is irrelevant and chosen words are for the sake of demonstration purposes only.
My situation is a little uncommon due to some requirement restrictions but with my schema (below), I have 3 fields; OName, OSearch, ONameSearch. (note: OSearch and ONameSearch serve different purposes in the backend, so they need to be defined indentically)
The intention is for my Solr to query on OSearch and ONameSearch, and return the OName to UI.
My original understanding was that OName will store the original value ("María") and index it as accent-insensitive ("maria") such that when query without solr.ASCIIFoldingFilterFactory, the following would be achieved.
Example: {query} -> {OName = result}
q = OSearch:*equipe* OR ONameSearch:*equipe* -> OName = Equipe, Equipé, Equípé, etc
q = OSearch:*equipé* OR ONameSearch:*equipé* -> OName = Equipé
This is my schema so far...
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<field name="OName" type="lowercase" indexed="true" stored="true" />
<field name="OSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<field name="ONameSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<copyField source="OName" dest="OSearch" />
<copyField source="OName" dest="ONameSearch" />
Please advise, thanks!
Most if not all relevant resources I've looked into
How to ignore accent search in Solr
How to ignore accents in SOLR search?
SOLR and accented characters
Solr accent removal
SOLR Makes Search with Accented Characters Easy
Solr Ref Guide 6.6 Defining Fields
Solr Ref Guide 6.6 Copying Fields

Search Last Four Numbers in a Given Number

I am trying to search and match last four numbers against a 10 digit number.
Example
7154226465
7152436464
7152348464
If I search for 646, it should match first two numbers. To be precise, I am looking for suffix search that matches against last 4 digits of indexed number. Below is the schema
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter catenateAll="1" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="0" generateWordParts="0" splitOnCaseChange="0"/>
<filter class="solr.ReverseStringFilterFactory"/>
<!--<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="17"/>-->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="7" maxGramSize="10" side="front"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter catenateAll="1" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="0" generateWordParts="0" splitOnCaseChange="0" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
EdgNGram with side="back" does not works in lucene 4.4. I am using solr v4.9.1
If you only want to search for the last 4 digits, then going for a EdgeNGramFilterFactory is the way to go. Try this:
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="4" />
<filter class="solr.ReverseStringFilterFactory"/>
A small note. Besides the use of ngrams, a traditional approach to efficiently support leading wildcards is to reverse the string and do a prefix query.

Solr: Ignore casing of strings when calculating facet numbers

I have these values for the title field in my database:
"I Am A String"
"I am A string"
I want to make the title field available as facets in my search results.
Current result:
<lst name="title">
<int name="I Am A String">4</int>
<int name="I am A string">3</int>
</lst>
Desired result:
<lst name="title">
<int name="I Am A String">7</int>
</lst>
I actually don't care which of the 2 available string options is chosen for the final result, as long as the same strings (case insenstive) are counted for the same facet.
I tried the following field definitions for the title field. I also added the resulting facet logic.
string = sees casing as different strings
string_exact = sees casing as different strings
text_ws = breaks up into words with casing intact
text = breaks into separate words
textTight = breaks into separate words
textTrue = breaks up in words with casing intact
string_exacttest = breaks up in words with casing intact
Here's my schema.xml
<field name="title" type="string" indexed="true" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="string_exact" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
<!--
this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
stemming.
-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
</analyzer>
</fieldType>
How can I make sure that the same strings (ignoring case) are grouped together when calculating the facets?
The string_exact definition is almost what you need, but you need to have a LowercaseFilter applied as well, so that each sentence is lowercased. The KeywordTokenizer keeps the whole value as a single token (so you won't see it broken into separate terms based on whitespace), and while a string field doesn't allow any additional processing, a TextField with a KeywordTokenizer behaves the same way - but you can add filters to how the token is processed afterwards.
<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Text matches in debugger, but no results returned

I've got an issue where my index and query are exactly the same, however no results are returned. It seems to fail on any words that are longer than the ENGTF max length. Here's my schema.
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1" types="wdfftypes.txt" protected="protwords.txt"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" words="mapping-FoldToASCII.txt"/>
</analyzer>
</fieldType>
Here is a screenshot of the analyzer when "Satisfaction" is put into the index, and "Satisfaction" is put into the query.
Any ideas? Thanks
Once obvious option is to increase the nGram length limit. You seem to be aware of this option and probably agree that is is not ideal.
Another option is to create a second field to use the nGram search, and another to use a search without nGram. For exmaple, somewhere in your schema.xml you might see:
<field name="myCoolNGramField" type="text_en_splitting" indexed="true" stored="false"/>
<!-- make a new type, text_en_non_ngram, and use it for this new field below. -->
<field name="myCoolField" type="text_en_non_ngram" indexed="true" stored="false"/>
<copyField source="myCoolNGramField" dest="myCoolField" />

Resources