need help in apache solr suggester for phrases

need help in apache solr suggester for phrases - search

i am trying to use suggeter in my application
example: I have a documents as below
apache solr version 4.2
apache hadoop version 2
cassendra nosql db
mysql rdbms
if i search for "apa" first two result is shown as suggestion and
if the search string is "apache so" only 1st one is shown as suggestion which is as expected
But
if i search for "solr" no result is shown for suggestion( i would expect apache solr version 4.2)
My query is
http://localhost:8983/solr/colletion/suggest?wt=json&indent=true&spellcheck=true&spellcheck.q=solr
below is my field type
<fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
and suggest request handler in solrconfig.xml is
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.fst.WFSTLookupFactory</str>
<str name="field">title2</str> <!-- the indexed field to derive suggestions from -->
<float name="threshold">0</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">8</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
my solr version is 4.2 CDH 4.7
please help

You are using a KeywordTokenizerFactory which treats the entire string as a single stream. So in your case, the 1st document would be indexed as
apache solr version 4.2
Since your auto suggest is turned on, your first query apac & others starting with the same prefix apac can match both the entries in index starting with it (as you have suggest turned on)
If you looking to match individual words in a text you should consider using another Tokenizer such as WhitespaceTokenizerFactory.
More details: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Related

Trying to implement scoped autosuggestions with solr

I am trying to implement scoped autosuggestions like in ecommerce websites like amazon etc.
eg.
if i type Lego , the suggestions should come like
Legolas in Names
Lego in Toys
where Names and Toys are solr field names.
closest aid i got is from this discussion:
solr autocomplete with scope is it possible?
Which informed me that it isn't possible with the suggester which I am currently using.
Until now, using the suggester I am able to achieve autosuggestions from a single solr field. [the autosuggest field , following guidelines in the suggester documentation]
Any ideas/links to help me with ?
Update
I tried to achieve autosuggestions using facets. My query looks something like:
http://localhost:8983/solr/core1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=field1&facet.field=field2&facet.prefix=i
This gives me all the facet results starting with letter 'i' and term faceted to field1 and field2.
This gave me the idea.
Any comments?

I am assuming you are storing the Names or Toys data as in a field, let call it category.
You can configure the payloadField parameter in the searchComponent definition and pass the category data into it. Later in the application when you receive the suggestion results from solr, show first suggestion from each category or which ever strategy suits better for your use case.
You can find the more information in Solr Suggester.

Suggester component seems useful but in payload field, one can only return a single field which may not satisfy many of the use cases.
By Facet prefixing, you cannot get suggestions from a word in the middle. So "Lego" will give suggestion of a product whose value in name field is "Legolas Sample" but not from "Sample Legolas".
The third way is to implement autosuggest is by using a index analyzer that has a layer of EdgeNGramFilterFactory and then searching on the required prefix.
So, the solr schema will look like
<field name="names" type="string" multiValued="false" indexed="true" stored="true"/>
<field name="toys" type="string" multiValued="false" indexed="true" stored="true"/>
<field name="names_ngram" type="text_suggest_ngram" multiValued="false" indexed="true" stored="false"/>
<field name="toys_ngram" type="text_suggest_ngram" multiValued="false" indexed="true" stored="false"/>
and the field type would have a definition of
<fieldType name="text_suggest_ngram" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="10" minGramSize="2"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
and these _ngram fields would be a copyfield:
<copyField source="names" dest="names_ngram"/>
<copyField source="toys" dest="toys_ngram"/>
So , once you have reindexed your data, if you query for "Lego" it will give results from both "Sample Legolas" and "Legolas Sample". However, if you have to categorize these results according to n fields they matched, that would be n different queries which is usually not a problem.

You can add multiple suggester components.
Add one for each field.
E.g. :
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">namesSuggester</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">Names</str>
<str name="weightField">Popularity</str>
<str name="indexPath">namesSuggesterIndexDir</str>
<str name="suggestAnalyzerFieldType">suggester</str>
</lst>
<lst name="suggester">
<str name="name">toysSuggester</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">Toys</str>
<str name="weightField">Popularity</str>
<str name="indexPath">toysSuggesterIndexDir</str>
<str name="suggestAnalyzerFieldType">suggester</str>
</lst>
</searchComponent>

Solr accent removal

i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
After having added a couple of test information to index i have checked via http://localhost:8080/solr/test_core/admin/luke?fl=title
which kind of tokens have been generated.
For instance a title like "Bayern München" has been tokenized into:
<int name="bayern">1</int>
<int name="m">1</int>
<int name="nchen">1</int>
Therefore instead of replacing the character by its ascii pendant, it has been interpret as being a delimiter?! Having that kind of index results into that i neither can search for "münchen" nor m?nchen.
Any idea how to fix?
Thanks in advance.

The issue is you are applying StandardTokenizerFactory before applying the ASCIIFoldingFilterFactory. Instead you should use the MappingCharFilterFactory character filter factory first and the the StandardTokenizerFactory.
As per the Solr Reference guide StandardTokenizerFactory supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>. Therefore when you tokenize using StandardTokenizerFactory the umlaut characters are lost and your ASCIIFoldingFilterFactory is of no use after that.
Your fieldType should be like below if you want to go for StandardTokenizerFactory.
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue, ä -> ae, etc.

Solr - example spell checker not working

I have set up the spellchecker for the example installation configuration that comes with Solr. I have followed their instructions for the spellchecker here: [http://wiki.apache.org/solr/SpellCheckComponent][1]
The problem I have is that after following it exactly I still cannot get it to work?
The response when I build (http://localhost:8983/solr/spell?q=:&spellcheck.build=true&spellcheck.q=delll%20ultrashar&spellcheck=true)
looks as follows:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">14</int>
</lst>
<str name="command">build</str>
<result name="response" numFound="17" start="0">
...
</result>
<lst name="spellcheck">
<lst name="suggestions"/>
</lst>
</response>
And when I query with http://localhost:8983/solr/spell?q=:&spellcheck.q=delll+ultrashar&spellcheck=true&spellcheck.extendedResults=true
I get the following response
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<result name="response" numFound="17" start="0">
...
</result>
<lst name="spellcheck">
<lst name="suggestions">
<bool name="correctlySpelled">false</bool>
</lst>
</lst>
</response>
What gives? Am i missing something in my schema.xml?
The schema.xml is here: http://www.developermill.com/schema.xml
The solrConfig.xml is here: http://www.developermill.com/solrconfig.xml
The only change to the example files was the addition of the following in the solrconfig.xml:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<!--
Optional, it is required when more than one spellchecker is configured.
Select non-default name with spellcheck.dictionary in request handler.
-->
<str name="name">default</str>
<!-- The classname is optional, defaults to IndexBasedSpellChecker -->
<str name="classname">solr.IndexBasedSpellChecker</str>
<!--
Load tokens from the following field for spell checking,
analyzer for the field's type as defined in schema.xml are used
-->
<str name="field">spell</str>
<!-- Optional, by default use in-memory index (RAMDirectory) -->
<str name="spellcheckIndexDir">./spellchecker</str>
<!-- Set the accuracy (float) to be used for the suggestions. Default is 0.5 -->
<str name="accuracy">0.7</str>
<!-- Require terms to occur in 1/100th of 1% of documents in order to be included in the dictionary -->
<float name="thresholdTokenFrequency">.0001</float>
</lst>
<!-- Example of using different distance measure -->
<lst name="spellchecker">
<str name="name">jarowinkler</str>
<str name="field">lowerfilt</str>
<!-- Use a different Distance Measure -->
<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
<!-- This field type's analyzer is used by the QueryConverter to tokenize the value for "q" parameter -->
<str name="queryAnalyzerFieldType">textSpell</str>
</searchComponent>
<!--
The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens. Uses a simple regular expression
to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an exact parse from the query parser.
Optional, defaults to solr.SpellingQueryConverter
-->
<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
<lst name="defaults">
<!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
<str name="spellcheck.dictionary">default</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<!-- Add to a RequestHandler
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REPEAT NOTE: YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT. THIS IS DONE HERE SOLELY FOR
THE SIMPLICITY OF THE EXAMPLE. YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

The textSpell field definition is in the wrong place. The following fragment should be within the types tag inside the schema.xml:
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
After you've fixed that, everything should work I guess, but I'd suggest you to work on cleaning up a little bit your example, since it basically contains everything you can configure. You should keep just what you really need.

Solr Partial And Full String Match

I am trying to allow searches on partial strings in Solr so if someone searched for "ppopota" they'd get the same result as if they searched for "hippopotamus." I read the documentation up and down and feel like I have exhausted my options. So far I have the following:
Defining a new field type:
<fieldtype name="testedgengrams" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldtype>
Defining a field of type "testedgengrams":
<field name="text_ngrams" type="testedgengrams" indexed="true" stored="false"/>
Copying contents of text_ngrams into text:
<copyField source="text_ngrams" dest="text"/>
Alas, that doesn't work. What am I missing?

You're using EdgeNGramFilterFactory which generates tokens 'hi', 'hip', 'hipp', etc, so it won't match 'ppopota'. Use NGramFilterFactory instead.

To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like: sample solr schema.xml
Here's the line to paste:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
Take it to the next level
5 ways to speed up indexing
Special instructions for editing the websolr schema.xml if you are using Heroku
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+

Ok I'm doing the same thing with field name
name_de
And I managed to get this thing to work using copyField like this:
schema.xml
<schema name="solr-magento" version="1.2">
<types>
...
<fieldType name="type_name_de_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
</fieldType>
</types>
...
<fields>
...
<field name="name_de_partial" type="type_name_de_partial" indexed="true" stored="true"/>
</fields>
....
<copyField source="name_de" dest="name_de_partial" />
</schema>
Then create search condition in solrconfig.xml
<requestHandler name="magento_de" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str> <!-- Tie breaker -->
<str name="qf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="pf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="mm">3<90%</str> <!-- Minimum 'Should' Match [id 1..3 must much all, else 90proc] -->
<int name="ps">100</int> <!-- Phrase Slop -->
<str name="q.alt">*:*</str>
..
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
With this solr is searching in fields name_de_partial with pow 1.0 and in name_de with pow 3.0
So if engine founds specific query word in name_de, then it is put on top of the list.
If he also finds something in name_de_partial then it also counts and is put in results.
And field name_de_partial is using specific solr filters so it can found word "hippie" using query "hip" or "ppie" or "ippi" without a swet.

If you set EdgeNGramFilterFactory or NGramFilterFactory both at index and query time, combined with q.op=AND (or default mm=100% if you are using dismax) you will experience some problems.
Try defining NGramFilterFactory only at index time:
<fieldType name="testedgengrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
or try setting q.op=OR (or mm=1 if you are using dismax)

Solr exact word search

I want to configure my Solr search engine so I get an exact match for the search term I enter.
eg. 'taxes' should return documents with 'taxes' and not 'tax', 'taxation' etc.
Any help or tips would be appreciated.

I presume your field is a TextField, by default solr does a fuzzy search on this field. What you want is to set up your field as a string field and add no tokenizer then you'll get an exact match.
You can even combine the exact search with a fuzzy search and use DisMax to boost the relative weights.
Example (schema.xml) :
<field name="name" type="string" indexed="true" stored="false" required="true" />
<field name="nameString" type="string" indexed="true" stored="false" required="true" />
<copyField source="name" dest="nameString"/>
Example (solrconfig.xml) :
<requestHandler name="accounts" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="qf">
nameString^10.0 name^5.0 description^1.0
</str>
<str name="tie">0.1</str>
</lst>
</requestHandler>

To turn off stemming in your schema.xml, you can define text field like this:
<types>
<!-- other fields definition -->
<fieldType name="text_no_stem" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- other fields definition -->
</types>
<fields>
<!-- other fields definition -->
<dynamicField name="*_nostem" type="text_no_stem" indexed="true" stored="true"/>
<!-- other fields definition -->
</fields>
I'm using sunspot to integrate solr with Ruby on Rails. With this in the schema.xml I define my searchable block like this:
searchable do
text(:wants, as: :wants_nostem)
end

Turn off stemming.

Use the quotes for exact match result :
Example :
core Name : core1
Key : namestring
http://localhost:8983/solr/core1/select?q=namestring:"taxes"&wt=json&indent=true

Use solr string field whcih will do an exact value search e.g
<fieldType class="solr.StrField" name="string" omitNorms="true" sortMissingLast="true" />

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string