Configuring solr to match over punctuation, e.g. 'tshirt' matches 't-shirt'

Configuring solr to match over punctuation, e.g. 'tshirt' matches 't-shirt' - search

I'm using Solr to index products on a clothing website. At the moment I'm trying to get Solr to match t-shirt based on the search term tshirt, but I'm a little bit lost as to what filters to I need.
This is the general purpose field type that I'm using to index most fields at the moment:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
I tried removing WordDelimiterFilterFactory from the index and query analyzers, but it didn't help. Any advice/best practice would be really appreciated.

You'll want to have the WordDelimiterFilter further up your chain, and you'll want to use the Whitespace Tokenizer instead. The example on the wiki does just that.
The issue now is that the tokens are split into separate tokens earlier on, and the worddelimeterfilter is only seeing each token by itself. So it sees the t, then shirt and doesn't really have anything to do.
By using the whitespace tokenizer you'll get the WDF to see "t-shirt", allowing it to generate t, shirt, tshirt, etc.
Use the "Analysis" page under the Solr Admin to see each step in the analysis and what the result is.

Related

Filter on solr splitting by list of strings

I've got this fieldType on my Solr implementation
<fieldType name="suggestion_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
splitOnNumerics="1"
preserveOriginal="1"
/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This works fine for almost every model I've got. For example for models AB1234, I can search 1234 and it finds it. But there's a particular case that I want to include and I'm trying to find a better solution than the current one:
Let's say AB is the manufacturer and 1234 is the actual part number, but in my database they are saved as AB1234. It I've got an A0 manufacturer, and A01234 partnumber, with the current implementation if i search 1234 i wont find it.
I found a workaround transforming the EdgeNGramFilterFactory into a NGramFilterFactory, but that's not the solution I want. I want the Solr to be able to search excluding the first two characters if they are letter+number or in the extreme case, but I need it to search with A0 and without A0.
I don't know if I was clear. Anyway i tryied with regular expressions, creating a new field and using this filter on it:
<filter class="solr.PatternReplaceFilterFactory" pattern="(A0)" replacement="" replace="all" />
or
<filter class="solr.PatternReplaceFilterFactory" pattern="[a-zA-Z][0-9]" replacement="" replace="all" />
but this is not giving expected results.
Can you help me? Thank you

Please try the below field type for your problem.
<fieldType name="text_en_splitting_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" splitOnNumerics="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This worked for below search terms.
AB1234
AB
1234
Please find the screenshot of the solr analysis page for the suggested field type with the search terms.

How can I implement Solr case insensitive and accent insensitive substring search with whitespaces?

I store 120000 wine records in a SQL Server database. Until now I've searched successfully for wine names by performing the following SQL:
WHERE (LOWER(Wine.name) LIKE '%" + (searchString) + "%'")
I am now in the process of switching over to using Solr. I would like to search for "clos rene" and get "Clos Réné" back. However Solr is returning all records that match 'Clos' and all records that match 'Réné'. I've have tried the following field definition:
<fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Could someone please help me define the correct field type so that I can reproduce my SQL query above to return case insensitive and accent insensitive results for multiple words with white space in between?
I have also experimented with wildcard searches using filed type 'string', but I can't get it to work as case-insensitive.

Try,
<fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
EDIT: Ok now i get your question , added extra : <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/> try this.

Solr4 exact word search not working

I am not able to do exact name search for some of the words in my database.
As in when I search for "Aimee", "Aime" fetches some results but no results with full word "Aimee". It's strangely behaving for some of the words.
I have Solr4 configured with these analyzers in schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Stem filter was causing this problem. i removed stem filter from analyzer and it worked.

Could be "Aimee" is being stemmed. So try and add <filter class="solr.PorterStemFilterFactory"/> to <analyzer type="query">

Search in solr with special characters

I have a problem with a search with special characters in solr.
My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-").
When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.
How can i search in the solr admin with that special character(something like "-" or "'"???
Regards
UPDATE
Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375
My search is to the field "Title".
excerpt from the schema.xml:
...
<!-- A general text field that has reasonable, generic
cross-language defaults: it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
(empty by default), and down cases. At query time only, it
also applies synonyms. -->
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>

You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.
The problem here is that text_general uses the StandardTokenizerFactory.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
StandardTokenizerFactory does the following:
A good general purpose tokenizer that strips many extraneous
characters and sets token types to meaningful values. Token types are
only useful for subsequent token filters that are type-aware of the
same token types.
This means the '-' character will be completely ignored and be used to tokenize the String.
"kong-fu" will be represented as "kong" and "fu". The '-' disappears.
This does also explain why select?q=title:\- won't work here.
Choose a better fitting field type:
Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.
Solr also has a fieldtype called text_ws. Depending on your requirements this might be enough.

To search for your exact phrase put inverted commas round it:
select?q=title:"Titanic - 1999"
If you just want to search for that special character then you will need to escape it:
select?q=title:\-
Also check:
Special characters (-&+, etc) not working in SOLR Query
If you know exactly which special characters you dont want to use then you can add this to the regex-normalize.xml
<regex>
<pattern>-</pattern>
<substitution>%2D</substitution>
</regex>
This will replace all "-" with %2D, so when you search, as long as you search for %2D instead of the "-" it will work fine

I spent a lot of time getting this done. Here is a clear step-by-step things to be done to query special characters in SolR. Hope it helps someone.
Edit the schema.xml file and find the solr.TextField that you are
using.
Under both, "index" and query" analyzers modify the
WordDelimiterFilterFactory and add types="characters.txt" Something like:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
</fieldType>
Ensure that you use WhitespaceTokenizerFactory as the tokenizer as
shown above.
Your characters.txt file can have entries like-
\# => ALPHA
# => ALPHA
\u0023 => ALPHA
ie:- pointing to ALPHA only.
Clear the data, re-index and query for the entered characters. It
will work.

Anyone has the best way for synonym search of multi keyword in solr?

I want to use synonym search in solr for multi keyword.
But It doesn't work correct.
I set the synonym "multi term" for "multerm" in synonym.txt. And I expect that Solr makes query-phrase for "multerm" just like "field:"multi term"~0 but "field:multi | field:term". So It can't do intimacy search for multi term synonym.
Any one has the best way for multi term synonym search in Solr? Help me please~

Here is how I handle multi-word synonyms. In my schema.xml, fieldType definition looks like:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizer="solr.KeywordTokenizerFactory"/>
<fieldType name="custom_text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- We will use synonyms only at index time to keep querying fast-->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizer="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- We will use synonyms only at index time to keep querying fast
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
Couple of things to note:
I am using synonyms only at index time, to keep queries fast.
I added KeywordTokenizerFactory, it treats the entire field as a single token, and does not split multi-word synonyms
I added expand="true". If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.
Query time synonyms are commented out.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Configuring solr to match over punctuation, e.g. 'tshirt' matches 't-shirt' - search

Related

Filter on solr splitting by list of strings

How can I implement Solr case insensitive and accent insensitive substring search with whitespaces?

Solr4 exact word search not working

Search in solr with special characters

Anyone has the best way for synonym search of multi keyword in solr?

Categories

Resources