apache solr search with * - search

When I search like this: q=*6205* I got much more results then searching q=6205, and it's good
but my problem is:
when I search for q=6205-2RS or q=6205\-2RS I got some results but when I put * in a search string I receive no results (q=*6205-2RS* or q=*6205\-2RS*) Why?
I want to search for *6205-2RS* but I want solr to search this string also in a middle of items names.

Wildcards queries does not undergo any analysis.
So when you are searching for 6205-2RS with wildcards, it would be searched as is, without any analysis like lower case filter, worddelimiters.
Whats the schema defination for the field. Is it a text field or a String field type ?
The definition for text_general is as below -
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The tokenizer and lower case filter in analysis at index time for 6205-2RS would generate tokens 6205,2rs
As no analysis takes place during search, its searching for 6205-2RS as is, and will not find any results.
Change the field type to string and that should match the results.

Related

Solr: Searching with/without spaces in keywords

I am experiencing an issue when spaces are introduced to keywords, for example:
We have a product with the title "Sony Playstation 4 Camera V2 PS4
(PSVR)"
Searching for "playstation" or "playstation camera" brings back this product
Searching for "play station" or "play station camera" does not bring back this product (notice
the space)
Here is the fieldType being used:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
How can I fix this, and make both "playstation" and "play station" match? This is only limited to PlayStation for my example, but it can happen to any search term e.g. "cyberpunk", "cyber punk". So solutions that require alot of manual work such as adding a synonym for play station => playstation are not feasible.
Things I have tried, but not managed to make work:
N-GRAM filter and tokenizer
Fuzzy search
Removing whitespace
Escaping whitespace
You can use a Shingle Filter to combine multiple tokens into one.
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
If you assume that the terms are spelled correctly when being indexed, you can apply this only when querying. It'll concatenate the tokens for you, effectively giving you multiple "merged" tokens:
play station camera => play, station, camera, playstation, stationcamera
.. given maxShingleSize=2. If you increase the max size to 3, this will also give you playstationcamera as a single token (in this case). If you have terms where people will possibly split a word multiple times, that might be necessary.
If you assume that your terms are indexed correctly, and this is only necessary on query time, your index won't change and you won't have to reindex (and the size won't change).
You might have to change the location of the filter around; your stemming filter will break this in mysterious places, since you'll end up concatenating previously stemmed terms.

Solr: Integrating Partial Match and Exact Match results

Consider a car database containing something like:
Mercedes C class
Mercedes A class
BMW 3 Series
Mazda 3
I have a schema that would return results for partial matches. As you can see I have limited the minimum character to be considered to 2:
<fieldType class="solr.TextField" name="string_contains" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="2"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="2"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
So if a user searches for 'ercedes' both Mercedes entries would be returned. If a user searches for 'C' or '3', nothing will be returned since the schema sets a minimum of 2 characters.
I also have the following schema, which will return any exact matches:
<fieldType class="solr.TextField" name="textStemmed" omitNorms="true" positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="querystopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
Using the above, searching 'C' would return 'Mercedes C class' because it is an exact match, but nothing for a partial match.
Is it possible to somehow have a schema which works similarly to the first one, ie it can return partial matches but can also return matches to single character terms when they are an exact match?
thanks
Mark
you can do this:
declare two (or more) fields 'carpartial' defined as string_contains, 'carexact' as textStemmed.
use copyfield to copy the original field into those additional fields
you use edismax handler to query those two fields, but boosting one more than the other:
qf=string_contains^4 textStemmed^6
You might want to tweak your analysis chains, but you see how it works, use different variants of the same fields(you can add more of course), with different boosts.

Solr stopwords magic

My stopwords don't works as expected.
Here is part of my schema:
<fieldType name="text_general" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="solr.TextField" name="text_auto">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false"/>
</analyzer>
<analyzer type="query">
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
</analyzer>
</fieldType>
<field name="deal_title_terms" type="text_auto" indexed="true" stored="false" required="false" multiValued="true"/>
<field name="deal_description" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
In stopwords.txt I have next words: the, is, a;
Also I have next data in my fields:
deal_description - This is the my description
deal_title_terms - This is the deal title a terms (will be splitted in terms)
When I try to search deal_description:
Example 1: "deal_description: his is the m" - I expect that document with deal_description "This is the my description" will be returned
Example 2: "deal_description: is th" - I expect that nothing will be found because "is" and "the" are stopwords.
When I try to search deal_title_terms:
Example 1: "deal_title_terms: is" - I expect that nothing will be found because "is" is stopword.
Example 2: "deal_title_terms: is the deal" - I expect that "is" and "the" will be ignored and term "deal" will be found.
Example 3: "deal_title_terms: title a terms" - I expect that "a" will be ignored and term "title terms" will be found.
Question 1: Why stopwords don't works for "deal_description" field ?
Question 2: Why for field "deal_title_terms" stopwords not removed for my query ?(When I am trying to find title a terms it will not find "title terms" term)
Question 3: Is there any way to show stopwords in search result but prevent them from searching ? Example:
data: This is cool search engine
search query : "is coo" -> return "This is cool search engine"
search query : "is" -> return nothing
search query : "This coll" -> return "This is cool search engine"
Question 4: Where I can find detailed description (maybe with examples) how stopwords works in solr ? Because it looks like magic.
Answer to Question 1 : Replace the "KeywordTokenizerFactory" as it does no actual tokenizing, so the entire input string is preserved as a single token.Use StandardTokenizerFactory instead.
Or use the below fieldType.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Stopwords will work as expected for the "deal_description" field.
Answer to Question 3 : Yes. Add the StopFilterFactory in analyzer of type="query" only. It will prevent them from searching and not adding them while indexing.
Answer to Quesion 4 : https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Answer to Quesion 2 : The custom field created by you seems incorrect. The text has to tokenised first using the tokenizers but you are using filters first.
Check the analysis of it with solr analysis page.

Search Solr ShingleFilterFactory

I have a data collection on Solr and I need to make a search and look for all typed words.
For example, if a user introduces the text "House Tree Spain" Solr should look for "House Tree Spain", "House Tree", "House Spain", "Tree Spain" "House", "Tree", "Spain".
I'm using "solr.ShingleFilterFactory" but just when I analyze the query.
<fieldType name="generic" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- generic -->
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- spanish -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" />
<!-- english -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- generic -->
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- spanish -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" />
<!-- english -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="10" outputUnigramsIfNoShingles="true"/>
</analyzer>
</fieldType>
How can I change my schema to get the results I'm looking for?
You have to apply the Shingle filter to both the query and index analyzers. In the indexing phase, it creates the tokens "House Tree" and "Tree Spain", and puts them in the index. In the query phase, it creates those tokens out of the query and searches for them in the index. If either of those steps is omitted, then "House Tree" can never match, see?
PS. shingle size of 10 is huge. For this particular example, you only need 2. Set it as low as you can, otherwise, your index size grows very large.

Solr - Search words within a string

I have a field containing a lot of words, for example:
"hello my name is Nicole and I am working with Solr"
and I need Solr to return this document if I search for this words (note that the word order is not as in the indexed text):
"am name with"
I am using this configuration
<fieldType name="propertiesField" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="-" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
and this query:
select/?q=properties_all:am-name-with&version=2.2&start=0&rows=10&indent=on
when I analize it with the analyzer, those words are highlighted but no document is found when I do the search.
Thanks for your help!!!!
If there is no good reason to use different index and query time analyzer, do not.
I would fieldType like:
<fieldType name="propertiesField" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Additionally set default operator as AND (schema file)
<solrQueryParser defaultOperator="AND"/>
Then query Solr with:
select/?q=properties_all:(am name with)&version=2.2&start=0&rows=10&indent=on

Resources