Remove diacritics at index time into Solr - search

I am working on a Solr search fine tuning. I'm using Solr 4.0.
Normally, I worked with language analyzers and tokenizers for English language, however this time I'm working with Portuguese language and I'm facing issue as it doesn't really give the expected result I need.
For example: I'm searching for word 'proteses' but what is indexed is 'próteses' which is with diacritics. So it gives wrong results!
What I need to do is remove all diacritics before indexing and search, so it gives correct results. However, I'm unable to find how to handle this part.
Can anyone point me in right direction?

You have to use a char mapping filter on the fields that can contain diacritics. This filter will normalize them.
For example :
<fieldType name="text_with_diacritics" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt comes with Solr has mappings for many diacritics.
Obviously, you'll have to reindex your documents after you configured this filter.

Solr also has several ICU filters available, and have both a Normalization and Folding filter available to allow for removal of accents and diacritics across Unicode.
There is also a ASCIIFoldingFilter available, which will attempt to convert any character above the standard 7-bit ASCII range down into the range.

Related

How to prevent Apache Solr from returning "suites" when searching for "suits" and vice versa?

We are a listings/business directory company that uses Apache Solr 4.7.2. When we do a search for "suits" in "Melbourne", our top two results are hotels that contain the word "suites" and the rest of the results are tailors, clothing retailers, etc., as expected. How do I prevent Solr from including hotels/suites in a search for "suits"?
This is due to stemming. There are two ways to handle it:
Disable stemming completely by removing the stemming filter from schema.xml
use KeywordMarkFilter if you just want to exclude specific keywords from being stemmed. In this particular case you would create a protwords.txt file with two lines, "suits" and "suites" (and any other keyword you want to protect from stemming)

Solr detects the language. How do I search now across multiple description_* fields?

I am trying to make multi-language stemming working with the Solr. I have setup language detection with LangDetectLanguageIdentifierUpdateProcessorFactory as per official Solr guides. The language is recognized and now I have a whole bunch of dynamic fields like:
description_en
description_de
description_fr
...
which are properly stemmed.
The question now is how do I search across so many fields? Making a long query every time that will search across dozens possible language fields doesn't seem like a smart option. I have tried using copyField like:
<copyField source="description_*" dest="text"/>
but stemming is being lost in the text field when I do that.
The text field is defined as solr.TextField with solr.WhitespaceTokenizerFactory. Maybe I am not setting up the text field properly or how is this supposed to be done?
You have multiple options:
search over all the fields you mentioned. There always will be some overhead: the more fields you use, the slower search will be (gradually)
try to recognise query language and search over only necessary fields: for example recognised and some default one. Here you can find library for this
develop custom solution with multiple languages in one field, which is possible and could work in production according to Trey Graigner
The question is a bit old, but maybe that answer will help other people.

How can I search a single word in apache Solr?

I am using Apache Solr for indexing using DataImportHandler.
The document structure is as follows:
id(long), title(text), abstract(text), pubDate(date)
I combined title and abstract filed fro text searching.My problem is when I query
"title: utility" then it gives result as follows:
id, title
6, Financial Deal Insights Energy & Utilities December 2008
11,Residential utility retail strategies in an economic downturn
16,Financial Deal Insights: Energy & Utilities Review of 2008
41,Solar at the heart of utility corporate strategy
I want to search only "utility" but it gives result also for utilities...
I also tried title:"utility" and title:utility~1 but it doesnt worked.
I read about 'stemming' but I dont have any idea how to use it...
please help me..
thanks..
This is cause of the PorterStemFilterFactory in your Text analysis.
<filter class="solr.PorterStemFilterFactory"/>
Stemmer would reduce the words to root and hence utility would match utilities as well.
Check if you need Stemmer for Searching, else you can remove it from your filter chain.
Else check for a less aggressive stemmer to fit your needs.

Solr, Special Chars, and Latin to Cyrillic char conversion

I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б or б and Ж ж).
Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...
Example would be (making up words here, hopefully won't offend anyone):
"BÖÖK" would be found when searching for "book"
"ЖRAY" would be found when searching for XRAY
"ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.
So, how should I go about this? Some theories I have are:
allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
Magic bullet I have not found yet...
Any ideas? Anyone tried this before?
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version
You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.
e.g.
<filter class="solr.ASCIIFoldingFilterFactory" />
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
So -
"BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.

Substring matches within SOLR

I can't seem to figure out how to find substring matches with SOLR, I've figured out matches based on a prefix so I can get ham to match hamburger.
How would I get a search for 'burger' to match hamburger as well? I tried burger but this tossed an error '*' or '?' not allowed as first character in WildcardQuery.
How can I match substrings using SOLR?
If anyone ends up here after searching for "apachesolr substring", there's a simpler solution for this : https://drupal.stackexchange.com/a/27956/10419 (from https://drupal.stackexchange.com/questions/26024/how-can-i-make-search-with-a-substring-of-a-word)
Add ngramfilter to text type definition in schema.xml in solr config
directory.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25" />
You can enable this but it will be very resource hungry (e.g. search for SuffixQuery).
See: http://lucene.472066.n3.nabble.com/Leading-Wildcard-Search-td522362.html
Quoting the mailing list:
Work arounds? Imagine making a second index (or adding another field) with all of the terms spelled backwards.
=>
See Add ReverseStringFilter https://issues.apache.org/jira/browse/LUCENE-1398
and Support for efficient leading wildcards search: https://issues.apache.org/jira/browse/SOLR-1321
At the moment issues.apache.org seems down. Try to use e.g. google cache.
As stated before in link you can use leading wildcards with edismax (ExtendedDismaxQParser). Just try it out to see if it is fast enough.
Some more info about the above mentioned reversedstring can also be found here: solr.ReversedWildcardFilterFactory

Resources