Solr search returns different output - search

I am using bitnami solr 4.6.0-1 for seraching arabic and english word. While i am trying to search an arabic word say for example:'أبدل خبير' it brought 4 output as follows
<result name="response" numFound="4" start="0">
<doc><str name="Arname">أبدل</str></doc>
<doc><str name="Arname">أبدل يحيا</str></doc>
<doc><str name="Arname">كلسم حبير</str></doc></doc>
<doc><str name="Arname">حبير</str></doc>
Query executed like select?q=Arname%3A++أبدل+خبير~0.75&fl=Arname&wt=xml&indent=true
But when i searchd the word in reverse order 'خبير أبدل' it brought only 2 output
<result name="response" numFound="4" start="0">
<doc><str name="Arname">أبدل</str></doc>
<doc> <str name="Arname">أبدل يحيا</str></doc>
</result>
Query Executed like select?q=Arname%3A++خبير+أبدل~0.75&fl=Arname&wt=xml&indent=true
The field type set to Arname is text_ar
The Schema of text_ar is as follows
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- for any non-arabic -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<!-- normalizes ﻯ to ﻱ, etc -->
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Please tell me a solution for this.

Related

Solr: Ignore casing of strings when calculating facet numbers

I have these values for the title field in my database:
"I Am A String"
"I am A string"
I want to make the title field available as facets in my search results.
Current result:
<lst name="title">
<int name="I Am A String">4</int>
<int name="I am A string">3</int>
</lst>
Desired result:
<lst name="title">
<int name="I Am A String">7</int>
</lst>
I actually don't care which of the 2 available string options is chosen for the final result, as long as the same strings (case insenstive) are counted for the same facet.
I tried the following field definitions for the title field. I also added the resulting facet logic.
string = sees casing as different strings
string_exact = sees casing as different strings
text_ws = breaks up into words with casing intact
text = breaks into separate words
textTight = breaks into separate words
textTrue = breaks up in words with casing intact
string_exacttest = breaks up in words with casing intact
Here's my schema.xml
<field name="title" type="string" indexed="true" stored="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="string_exact" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
<!--
this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
stemming.
-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
</analyzer>
</fieldType>
How can I make sure that the same strings (ignoring case) are grouped together when calculating the facets?
The string_exact definition is almost what you need, but you need to have a LowercaseFilter applied as well, so that each sentence is lowercased. The KeywordTokenizer keeps the whole value as a single token (so you won't see it broken into separate terms based on whitespace), and while a string field doesn't allow any additional processing, a TextField with a KeywordTokenizer behaves the same way - but you can add filters to how the token is processed afterwards.
<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

SOLR WordDelimiterFilterFactory

I use WordDelimiterFilterFactory to split words that have numbers into solr tokens. For example the word Php5 is split in two tokens "PHP", "5".When searching, the request that is executed by SOLR is q="php" and q="5". But this request finds even results with "5" only. What I want is to find documents with "PHP5" or "PHP 5" only.
If someone has any idea to get around this please.
Hope it is clear.
Thank's.
You need to get solr, in addition to indexing "php5", to index "php 5" as a single token. That way a search for "php 5" will match but a search for "blah 5" will not, for example.
The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid works.
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
synonyms.txt
php5,php_5
protwords.txt (so the delimiter doesn't break it)
php5,php_5
You also have to change the query parser to use the lucid parser.
solrconfig.xml
<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
<str name="phrases">autophrases.txt</str>
<str name="replaceWhitespaceWith">_</str>
<str name="ignoreCase">false</str>
</queryParser>
<requestHandler name="/searchp" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">Keywords</str>
<str name="defType">autophrasingParser</str>
</lst>
</requestHandler>
autophrases.txt
php 5
The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter
This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
This filter splits tokens at word delimiters.
In your case you can opt for splitOnNumerics="0", so it wont spilt on numbers.
splitOnNumerics:
(integer, default 1) If 0, don't split words on
transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"
The rules for determining delimiters are determined in the below link
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter

Solr - termfreq partial matches

I'm using Solr to query a set of documents and I want to get the number of matches for certain term, right now I'm using
termfreq(text,'manage')
However this does not hit on Manager or Management
termfreq(text,'manage*')
returns the same count. I've tried using different tokenizers, some won't even accept the * and I haven't found one that returns the correct number of matches.
Field:
<field name="text" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" required="false"/>
Is there a way I can get termfreq to also count partial matches?
You will need to add some custom tokenizers and and filter classes to the analyzer.
In your /shared/field_types.xml file, create a new type like this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And in /shared/fields.xml:
<field name="text" stored="true" type="text" multiValued="false" indexed="true"/>
<dynamicField name="*_text" stored="true" type="text" multiValued="false" indexed="true"/>
And use that as "text" as the type of the field.
A more advanced solution:
<fieldType name="startsWith" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- remove words/chars we don't care about -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<!-- now remove any extra space we have, since spaces WILL influence matching -->
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
In /shared/fields.xml:
<dynamicField name="*_starts_with" stored="true" type="startsWith" multiValued="false" indexed="true"/>
Then, in the top level of your core's schema.xml add this:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/fields.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/field_types.xml"/>
And add this to your copyFields in the core's schema.xml:
<copyFields>
<copyField source="yourField" dest="yourField_text"/>
<copyField source="yourField" dest="yourField_starts_with"/>
...
</copyFields>
I have had the same problem. I needed to count the termfreq, which also should match on subparts of words.
Add this FieldType solved it.
<fieldType name="startWith" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Solr - WhiteSpaceTokenizerFactory works for index but not while querying

Consider the following schema,
<schema>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" multiValued="false"/>
<fieldType name="stop_analyzer_string" class="solr.TextField" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="name_search" type="stop_analyzer_string" indexed="true" stored="false"/>
<copyField source="name" dest="name_search"/>
<field name="name" type="string" indexed="true" stored="true"/>
</fields>
</schema>
The name field gets indexed with WhitespaceTokenizerFactory, but it doesn't seem to use the WhitespaceTokenizerFactory while querying with the name field.
For a doc with name as "solr search",
the query name_search:solr - matches the document. //index time WhiteSpace tokenizer works
the query name_search:search - matches the document. //index time WhiteSpace tokenizer works
But the query name_search:solr search - doesn't match the document. //query time WhiteSpace tokenizer doesn't work
But as specified in the schema, the query should also be tokenized with whitespace and matched with the document. no?
Not sure what you are missing, but all the above queries worked for me for the data that you mentioned.
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=xml&indent=true
The above returned result document i indexed.
Just to test do this:
http://localhost:8983/solr/#/collection1/documents
Got to :
And paste below document as is into your Document(s) part and hit Submit Document
{"id":"100001","name_search":"solr search"}
Run you query as:
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=json&indent=true

Solr - Search words within a string

I have a field containing a lot of words, for example:
"hello my name is Nicole and I am working with Solr"
and I need Solr to return this document if I search for this words (note that the word order is not as in the indexed text):
"am name with"
I am using this configuration
<fieldType name="propertiesField" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="-" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
and this query:
select/?q=properties_all:am-name-with&version=2.2&start=0&rows=10&indent=on
when I analize it with the analyzer, those words are highlighted but no document is found when I do the search.
Thanks for your help!!!!
If there is no good reason to use different index and query time analyzer, do not.
I would fieldType like:
<fieldType name="propertiesField" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Additionally set default operator as AND (schema file)
<solrQueryParser defaultOperator="AND"/>
Then query Solr with:
select/?q=properties_all:(am name with)&version=2.2&start=0&rows=10&indent=on

Resources