Solr Partial And Full String Match - search

I am trying to allow searches on partial strings in Solr so if someone searched for "ppopota" they'd get the same result as if they searched for "hippopotamus." I read the documentation up and down and feel like I have exhausted my options. So far I have the following:
Defining a new field type:
<fieldtype name="testedgengrams" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldtype>
Defining a field of type "testedgengrams":
<field name="text_ngrams" type="testedgengrams" indexed="true" stored="false"/>
Copying contents of text_ngrams into text:
<copyField source="text_ngrams" dest="text"/>
Alas, that doesn't work. What am I missing?

You're using EdgeNGramFilterFactory which generates tokens 'hi', 'hip', 'hipp', etc, so it won't match 'ppopota'. Use NGramFilterFactory instead.

To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like: sample solr schema.xml
Here's the line to paste:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
Take it to the next level
5 ways to speed up indexing
Special instructions for editing the websolr schema.xml if you are using Heroku
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+

Ok I'm doing the same thing with field name
name_de
And I managed to get this thing to work using copyField like this:
schema.xml
<schema name="solr-magento" version="1.2">
<types>
...
<fieldType name="type_name_de_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
</fieldType>
</types>
...
<fields>
...
<field name="name_de_partial" type="type_name_de_partial" indexed="true" stored="true"/>
</fields>
....
<copyField source="name_de" dest="name_de_partial" />
</schema>
Then create search condition in solrconfig.xml
<requestHandler name="magento_de" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str> <!-- Tie breaker -->
<str name="qf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="pf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="mm">3<90%</str> <!-- Minimum 'Should' Match [id 1..3 must much all, else 90proc] -->
<int name="ps">100</int> <!-- Phrase Slop -->
<str name="q.alt">*:*</str>
..
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
With this solr is searching in fields name_de_partial with pow 1.0 and in name_de with pow 3.0
So if engine founds specific query word in name_de, then it is put on top of the list.
If he also finds something in name_de_partial then it also counts and is put in results.
And field name_de_partial is using specific solr filters so it can found word "hippie" using query "hip" or "ppie" or "ippi" without a swet.

If you set EdgeNGramFilterFactory or NGramFilterFactory both at index and query time, combined with q.op=AND (or default mm=100% if you are using dismax) you will experience some problems.
Try defining NGramFilterFactory only at index time:
<fieldType name="testedgengrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
or try setting q.op=OR (or mm=1 if you are using dismax)

Related

Solr search does not return exact match

I am using Solr 6 to implement a search engine. The Problem I am facing is that when I searched for the word it returns some other results first and the actual query is at number 6.
For example I am searching for Cafe 9
It returns me this...
NECOS NATURAL STORE & CAFE
SATTAR BUKSH CAFE
THE PINK CADILLAC CAFE & RESTAURANT
CAFE ROCK LAHORE
CAFE CHEF ZAKIR
CAFE 9
What I want is that it show Cafe 9 in 1st place and then other results as Cafe 9 is the exact match..
I have indexed all the fields with type text_general and the schema.xml is attached.
Thanks in advance.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.ApostropheFilterFactory"/>
<!-- <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/> -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want to boost the score of documents containing all the query terms in close proximity then you can pass the pf parameter with the value of the field name. In your case you should be passing pf=name (pf stands for phrase fields). The eDisMax query parser will attempt to make phrase queries out of all the terms in the q parameter, and if it’s able to find the exact phrase in any of the phrase fields, it will apply the specified boost to the match for that document.
In case you're not using the eDisMax query parser by default you can use it temporarily for the current query by passing q={!edismax pf=name}cafe 9.
You could also pass the pf2 parameter (as in pf2=name) which works in a way similar to pf except that the generated phrase queries are the bigrams in your query (that is, every two consecutive terms will be considered a boosting phrase). There's is also a pf3 parameter if that happens to be what you're looking for.
You can also customize the boost and pass more than one field name to the phrase proximity parameters (for instance, pf=name^2 title^3).

How to search arabic words in solr

In my solr schema.xml I defined product arabic name field as below
<field name="productNameArabic" type="text_ar" indexed="true" stored="true"/>
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
In solr search I want to search with product name using Arabic letters. While searching, Arabic user can feel little default to search some product name. Because some characters need to mention while searching.
Ex: إ أ آ
In the above mentioned characters, user can get combination of shift key. Usually if Arabic people will mention “ ا “ character and will get the below combined words.
Ex: إبرا
In my solr schema.xml I defined product arabic name field as below
I was able to achieve desired functionality by adding ASCIIFoldingFilter, this filter is able to remove accents from different languages, to make them similar in index time.
<fieldType name="arabic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Some more information about this filter - here. Working code example - here

How to ignore whitespaces on solr query

I have the name Audioslave indexed on Solr and I want to match that document to the query string Audio Slave.
I have the following rule configured:
<fieldType name="text_filter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
generateWordParts="1"
generateNumberParts="1"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
generateWordParts="1"
generateNumberParts="1"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
And a field using it:
<field name="artist_name_filter" type="text_filter" multiValued="false" indexed="true" stored="true" required="false" />
When using Solr analysis tool everything looks good.
The Query part is the following:
The KeywordTokenizerFactory generates Audio Slave,
Then the WordDelimiterFilterFactory splits it into Audio Slave, Audio, AudioSlave and Slave (lets just use the 3rd column (AudioSlave) from here.
The TrimFilterFactory keeps it as AudioSlave
Finally the LowerCaseFilterFactory change it to audioslave
On the other hand, the index part is:
The KeywordTokenizerFactory generates Audioslave,
Then the WordDelimiterFilterFactory and TrimFilterFactory keeps it as Audioslave
Finally the LowerCaseFilterFactory change it to audioslave
So both fields should match, but the query returns no results:
http://localhost:8983/solr/search_api/select?defType=edismax&fq=type:Artist&q=Audio%20slave&qf=artist_name_filter&wt=json
Your problem isn't analysis, it's QueryParser syntax. Spaces are used to separate query clauses, and that isn't affected by the analyzer. When you have q=Audio slave, it applies query syntax rules first, and separates it into clauses "Audio" and "slave", and then analyzes each clause separately.
Escaping the space should do the job, I believe: q=Audio\ slave
A phrase query here seems like it ought to work, such as q="Audio slave", but it doesn't. It generates something like: "(audio slave audio audioslave) slave" for me, which is problematic.
Try by using the WhitespaceTokenizerFactory as a tokenizer for your index part.
Here the KeywordTokenizerFactory keeps the text as it is...it won't create any tokens.
Replace the same with WhitespaceTokenizerFactory.
WhitespaceTokenizerFactory will create tokens at space.

Solr query that contains slash

I've found an interesting query for Solr and it returns search results, but I don't understand, what is the purpose of slash symbol between the words?
duties:health/nurse
Anybody knows? Please, help.
Simple. You can look at the analyzer chain to understand what happens.
My guess is that the analyzer chain turns the / into a space - which makes the query into
duties: health nurse
To find out your analyzer chain from the configuration - start by checking the type of the field
For example
<field name="health" type="text_general" indexed="true" stored="true" required="true"/>
Now we look for the definition of the type
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As you can see, we have an index analyzer and a query analyzer.
My query analyzer would turn / in the query into something else by using the StandardTokenizerFactory.
From the solr wiki:
solr.StandardTokenizerFactory
A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types. There aren't any filters that use StandardTokenizer's types.
I am thinking that health/nurse is being viewed as a string literal as there are no spaces between. Health / nurse should yield different results than health/nurse, correct? If so, then health/nurse must be an indexed term in your documents.

Solr accent removal

i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
After having added a couple of test information to index i have checked via http://localhost:8080/solr/test_core/admin/luke?fl=title
which kind of tokens have been generated.
For instance a title like "Bayern München" has been tokenized into:
<int name="bayern">1</int>
<int name="m">1</int>
<int name="nchen">1</int>
Therefore instead of replacing the character by its ascii pendant, it has been interpret as being a delimiter?! Having that kind of index results into that i neither can search for "münchen" nor m?nchen.
Any idea how to fix?
Thanks in advance.
The issue is you are applying StandardTokenizerFactory before applying the ASCIIFoldingFilterFactory. Instead you should use the MappingCharFilterFactory character filter factory first and the the StandardTokenizerFactory.
As per the Solr Reference guide StandardTokenizerFactory supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>. Therefore when you tokenize using StandardTokenizerFactory the umlaut characters are lost and your ASCIIFoldingFilterFactory is of no use after that.
Your fieldType should be like below if you want to go for StandardTokenizerFactory.
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue, ä -> ae, etc.

Resources