Solr stopwords magic - search

My stopwords don't works as expected.
Here is part of my schema:
<fieldType name="text_general" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="solr.TextField" name="text_auto">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false"/>
</analyzer>
<analyzer type="query">
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
</analyzer>
</fieldType>
<field name="deal_title_terms" type="text_auto" indexed="true" stored="false" required="false" multiValued="true"/>
<field name="deal_description" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
In stopwords.txt I have next words: the, is, a;
Also I have next data in my fields:
deal_description - This is the my description
deal_title_terms - This is the deal title a terms (will be splitted in terms)
When I try to search deal_description:
Example 1: "deal_description: his is the m" - I expect that document with deal_description "This is the my description" will be returned
Example 2: "deal_description: is th" - I expect that nothing will be found because "is" and "the" are stopwords.
When I try to search deal_title_terms:
Example 1: "deal_title_terms: is" - I expect that nothing will be found because "is" is stopword.
Example 2: "deal_title_terms: is the deal" - I expect that "is" and "the" will be ignored and term "deal" will be found.
Example 3: "deal_title_terms: title a terms" - I expect that "a" will be ignored and term "title terms" will be found.
Question 1: Why stopwords don't works for "deal_description" field ?
Question 2: Why for field "deal_title_terms" stopwords not removed for my query ?(When I am trying to find title a terms it will not find "title terms" term)
Question 3: Is there any way to show stopwords in search result but prevent them from searching ? Example:
data: This is cool search engine
search query : "is coo" -> return "This is cool search engine"
search query : "is" -> return nothing
search query : "This coll" -> return "This is cool search engine"
Question 4: Where I can find detailed description (maybe with examples) how stopwords works in solr ? Because it looks like magic.

Answer to Question 1 : Replace the "KeywordTokenizerFactory" as it does no actual tokenizing, so the entire input string is preserved as a single token.Use StandardTokenizerFactory instead.
Or use the below fieldType.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Stopwords will work as expected for the "deal_description" field.
Answer to Question 3 : Yes. Add the StopFilterFactory in analyzer of type="query" only. It will prevent them from searching and not adding them while indexing.
Answer to Quesion 4 : https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Answer to Quesion 2 : The custom field created by you seems incorrect. The text has to tokenised first using the tokenizers but you are using filters first.
Check the analysis of it with solr analysis page.

Related

Solr schema. exact accented match and accent insensitive match

I'm trying to figure out how to configure the Solr manage-schema's fieldType to achieve the following:
(a) When searching for non-accented strings, the results will be accent insensitive.
(b) HOWEVER When performing searching on accented strings, the results will ONLY be accent sensitive.
For example:
searchString -> expectedResult
Equipe -> Equipe, Equipé, Equípé, etc...
Equipé -> Equipé
Note: Wildcard (*) is irrelevant and chosen words are for the sake of demonstration purposes only.
My situation is a little uncommon due to some requirement restrictions but with my schema (below), I have 3 fields; OName, OSearch, ONameSearch. (note: OSearch and ONameSearch serve different purposes in the backend, so they need to be defined indentically)
The intention is for my Solr to query on OSearch and ONameSearch, and return the OName to UI.
My original understanding was that OName will store the original value ("María") and index it as accent-insensitive ("maria") such that when query without solr.ASCIIFoldingFilterFactory, the following would be achieved.
Example: {query} -> {OName = result}
q = OSearch:*equipe* OR ONameSearch:*equipe* -> OName = Equipe, Equipé, Equípé, etc
q = OSearch:*equipé* OR ONameSearch:*equipé* -> OName = Equipé
This is my schema so far...
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<field name="OName" type="lowercase" indexed="true" stored="true" />
<field name="OSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<field name="ONameSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<copyField source="OName" dest="OSearch" />
<copyField source="OName" dest="ONameSearch" />
Please advise, thanks!
Most if not all relevant resources I've looked into
How to ignore accent search in Solr
How to ignore accents in SOLR search?
SOLR and accented characters
Solr accent removal
SOLR Makes Search with Accented Characters Easy
Solr Ref Guide 6.6 Defining Fields
Solr Ref Guide 6.6 Copying Fields

Solr: Ignore casing of strings when calculating facet numbers

I have these values for the title field in my database:
"I Am A String"
"I am A string"
I want to make the title field available as facets in my search results.
Current result:
<lst name="title">
<int name="I Am A String">4</int>
<int name="I am A string">3</int>
</lst>
Desired result:
<lst name="title">
<int name="I Am A String">7</int>
</lst>
I actually don't care which of the 2 available string options is chosen for the final result, as long as the same strings (case insenstive) are counted for the same facet.
I tried the following field definitions for the title field. I also added the resulting facet logic.
string = sees casing as different strings
string_exact = sees casing as different strings
text_ws = breaks up into words with casing intact
text = breaks into separate words
textTight = breaks into separate words
textTrue = breaks up in words with casing intact
string_exacttest = breaks up in words with casing intact
Here's my schema.xml
<field name="title" type="string" indexed="true" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="string_exact" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
<!--
this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
stemming.
-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
</analyzer>
</fieldType>
How can I make sure that the same strings (ignoring case) are grouped together when calculating the facets?
The string_exact definition is almost what you need, but you need to have a LowercaseFilter applied as well, so that each sentence is lowercased. The KeywordTokenizer keeps the whole value as a single token (so you won't see it broken into separate terms based on whitespace), and while a string field doesn't allow any additional processing, a TextField with a KeywordTokenizer behaves the same way - but you can add filters to how the token is processed afterwards.
<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Solr: Integrating Partial Match and Exact Match results

Consider a car database containing something like:
Mercedes C class
Mercedes A class
BMW 3 Series
Mazda 3
I have a schema that would return results for partial matches. As you can see I have limited the minimum character to be considered to 2:
<fieldType class="solr.TextField" name="string_contains" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="2"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="2"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
So if a user searches for 'ercedes' both Mercedes entries would be returned. If a user searches for 'C' or '3', nothing will be returned since the schema sets a minimum of 2 characters.
I also have the following schema, which will return any exact matches:
<fieldType class="solr.TextField" name="textStemmed" omitNorms="true" positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="querystopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
Using the above, searching 'C' would return 'Mercedes C class' because it is an exact match, but nothing for a partial match.
Is it possible to somehow have a schema which works similarly to the first one, ie it can return partial matches but can also return matches to single character terms when they are an exact match?
thanks
Mark
you can do this:
declare two (or more) fields 'carpartial' defined as string_contains, 'carexact' as textStemmed.
use copyfield to copy the original field into those additional fields
you use edismax handler to query those two fields, but boosting one more than the other:
qf=string_contains^4 textStemmed^6
You might want to tweak your analysis chains, but you see how it works, use different variants of the same fields(you can add more of course), with different boosts.

How to turn query, which searches SOLR data, into lowercase?

My scheme.xlm looks like this:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- The searched field -->
<field name="product_name" type="text" indexed="true" stored="true"/>
This should index the field in lowercase and also transform search query into the lowercase.
Data I want to find is: "Nokia Lumia 610"
When I search "nokia" I get the expected result but
when searching only "Nokia"(upper case N) there aren't any results.
Above "analyzer" performs lowercase only on index but not on search query.
Is this an error?
How to force SOLR indexes and search query to be in lowercase?
Transformation of the search query also depends on the type of query and the analyzer that you are using. For example, the above will not convert your search query to lowercase if you are sending request to the select analyzer. If you are sending request :-
http://url/solr/select?q=Nokia
then the above will not be converted to lowercase since the select analyzer is not present in your fieldtype definition. You will have to modify your code as follows :-
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="select">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
if the above does not work, then please post the request that you are sending and the output of adding debugQuery=true to the request.
Along with
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="select">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
in schema.xml.
In head.vm change return $("#q").val();
to
return $("#q").val().toLowerCase(); for InCaseSensitive autocomplete feature.
So that you can get result if you search with Capital Letters.

apache solr search with *

When I search like this: q=*6205* I got much more results then searching q=6205, and it's good
but my problem is:
when I search for q=6205-2RS or q=6205\-2RS I got some results but when I put * in a search string I receive no results (q=*6205-2RS* or q=*6205\-2RS*) Why?
I want to search for *6205-2RS* but I want solr to search this string also in a middle of items names.
Wildcards queries does not undergo any analysis.
So when you are searching for 6205-2RS with wildcards, it would be searched as is, without any analysis like lower case filter, worddelimiters.
Whats the schema defination for the field. Is it a text field or a String field type ?
The definition for text_general is as below -
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The tokenizer and lower case filter in analysis at index time for 6205-2RS would generate tokens 6205,2rs
As no analysis takes place during search, its searching for 6205-2RS as is, and will not find any results.
Change the field type to string and that should match the results.

Resources