Solr accent removal

Solr accent removal - search

i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
After having added a couple of test information to index i have checked via http://localhost:8080/solr/test_core/admin/luke?fl=title
which kind of tokens have been generated.
For instance a title like "Bayern München" has been tokenized into:
<int name="bayern">1</int>
<int name="m">1</int>
<int name="nchen">1</int>
Therefore instead of replacing the character by its ascii pendant, it has been interpret as being a delimiter?! Having that kind of index results into that i neither can search for "münchen" nor m?nchen.
Any idea how to fix?
Thanks in advance.

The issue is you are applying StandardTokenizerFactory before applying the ASCIIFoldingFilterFactory. Instead you should use the MappingCharFilterFactory character filter factory first and the the StandardTokenizerFactory.
As per the Solr Reference guide StandardTokenizerFactory supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>. Therefore when you tokenize using StandardTokenizerFactory the umlaut characters are lost and your ASCIIFoldingFilterFactory is of no use after that.
Your fieldType should be like below if you want to go for StandardTokenizerFactory.
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue, ä -> ae, etc.

Related

How to make characters that are part of SOLR query syntax searchable?

I have this problem that I am trying to solve for quite some time. I am not solr expert, I am still learning it.
I have a special type of ID's in my system, that have to be searchable by users. The problem is, that those ID's contain some solr special characters. By the way, those ID's are stored together with other search terms in the terms_txt field.
Some ID examples: 292/2017 and 1.2.61-962-37/2017
The first one I will refer as the 'simple one', and the second as 'complex one'.
From what I red throughout the internet, is that this kind of search should be possible if we do the phrase search. So if we add apostrophes around the ID, it should work. But unfortunately that is not the case. I will post here my solr 4.0 schema, and example of my query, hoping that you can spot what is wrong with it. If phrase search is the answer to my problem, then it must be that something is wrong with either solr schema or my query (code).
In my example I am searching for "292/2017" as a phrase. Only one entry in my index has this phrase, because this combination of characters is unique (it is some kind of ID, but we insert it in terms_txt field with all other terms)
This is query executed via solr admin, it finds a lot of results, but there should be only 1. It seems like solr handles '/' character as a space, and ignores terms shorter than 3 letters (ignoring less than 3 is what we want, but not in phrase search):
INFO: [collection1] webapp=/solr-example path=/select params={q=terms_txt:"44/2017"&wt=xml} hits=31343 status=0 QTime=6
So basically, in this example, solr has found all records with the term of 2017, which is bad...
This is query executed withing application logic. It is more complex, but the problem is same:
INFO: [collection1] webapp=/solr-example path=/select params={mm=100%25&json.nl=flat&fl=id&start=0&sort=date_in_i+desc&fq=type_s:2&fq=date_in_i:[20161201+TO+*]&fq=date_in_i:[*+TO+20171011]&fq=subtype_s:(2+4+6+8)&fq=terms_txt:"\"10/2017\""&fq=language_is:0&rows=10&bq=&q=\"10\/2017\"&tie=0.1&defType=edismax&omitHeader=true&qf=terms_txt&wt=json} hits=978 status=0 QTime=2
This is how terms_txt entries looks like in index:
<arr name="terms_txt">
<str>Some string blah blah 292/2017 - more of terms, blah blah</str>
<str>Something else, blah blah</str>
</arr>
This is my solr schema field configuration for the terms_txt field (fields are dynamic):
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|\s)([^\-\_&\s]+([\-\_&]+[^\-\_&\s]*)+)(?=(\s|$))" replacement="$1MжџљМ$2 $2" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="MжџљМ" replacement="" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b[\-_]+\b" replacement="" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Anyone have any clue how should I allow special characters like .-/ to be searchable ? Can you spot some flaw in my example or suggest better solution ?

You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces might remove content as well).
The Whitespace Tokenizer is better suited for a field where matching special characters is important, since it'll only break on and remove whitespace.
Define different fields and use different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. You can adjust weights by using qf together with the (e)dismax handlers. These handlers also allows you to boost phrase matches for two and three shingles.
Use one or more copyField instructions to get your content from one field to the other fields, so you don't have to change your indexing code to adjust how you tweak things in Solr.
If you append debugQuery=true to your query string, you can also see how Solr / Lucene computes the score for each document and what contributes to its ranking, so you can tweak scoring values and see exactly how the final score changes.
When writing the query, escape any special characters with \.

How to search arabic words in solr

In my solr schema.xml I defined product arabic name field as below
<field name="productNameArabic" type="text_ar" indexed="true" stored="true"/>
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
In solr search I want to search with product name using Arabic letters. While searching, Arabic user can feel little default to search some product name. Because some characters need to mention while searching.
Ex: إ أ آ
In the above mentioned characters, user can get combination of shift key. Usually if Arabic people will mention “ ا “ character and will get the below combined words.
Ex: إبرا
In my solr schema.xml I defined product arabic name field as below

I was able to achieve desired functionality by adding ASCIIFoldingFilter, this filter is able to remove accents from different languages, to make them similar in index time.
<fieldType name="arabic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Some more information about this filter - here. Working code example - here

Tune solr phrase query search

We are trying to tune our phrase queries in DSE search.
For example, if we have column name X with the value "D A T A S T A X" we are searching for exact match for X:"T A S T"
Words are tokenized with with whitespacetokenizer.
We have couple hundred Million records in database and all the indexes are memory (We tested using pcstat). However still the queries are taking 5-15 sec. Why it is taking so time to pull the results if all the indexes are in memory? How can I tune this?
Any help is appreciated.

Try this fieldType:
<fieldType name="custom_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here the KeywordTokenizerFactory tokenizeer will pass the text stream exactly to the filters. The PatternReplaceFilterFactory will remove all except characters and numbers. You can config this however you want. Then we lowercase the stream and generate the NGram. This is for the index phase. For the query phase we don't do the NGram because we want to match the exact sub string.
We will be use the NGram instead of EdgeNGram, Because that will provides substring. The EdgeNGram always contain either from start or end. So EdgeNGram is not helpful in this case.
Hope this helps.

Search Solr: Match exact result Koh S*

I have a schema as below
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And my document will be:
Koh Samui
Koh Chang
Koh Lanta
When i do the search Koh* It returns 3 results. it accepted. But when i search koh S*. it return zero result. I want the result would be only Koh Samui

I assume you want to use a wildcard query. This depends on the Query Parser you are using. If you are using the Standard Query Parser, Dismax, or eDismax, all of these don't support wild card queries.
And to be honest if you are using Solr to search for wildcard queries, you might be using the wrong tool as the solution, you can use any SQL Database and use text like 'Koh S%' in the where clause, and this will give you what you want.
But anyway, to use a wild card Query in Solr you need to use the Lucene query parser directly. To do this, use the following query:
http://localhost:8983/solr/core/select?q={!lucene df=text}Koh S*

Solr Partial And Full String Match

I am trying to allow searches on partial strings in Solr so if someone searched for "ppopota" they'd get the same result as if they searched for "hippopotamus." I read the documentation up and down and feel like I have exhausted my options. So far I have the following:
Defining a new field type:
<fieldtype name="testedgengrams" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldtype>
Defining a field of type "testedgengrams":
<field name="text_ngrams" type="testedgengrams" indexed="true" stored="false"/>
Copying contents of text_ngrams into text:
<copyField source="text_ngrams" dest="text"/>
Alas, that doesn't work. What am I missing?

You're using EdgeNGramFilterFactory which generates tokens 'hi', 'hip', 'hipp', etc, so it won't match 'ppopota'. Use NGramFilterFactory instead.

To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like: sample solr schema.xml
Here's the line to paste:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
Take it to the next level
5 ways to speed up indexing
Special instructions for editing the websolr schema.xml if you are using Heroku
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+

Ok I'm doing the same thing with field name
name_de
And I managed to get this thing to work using copyField like this:
schema.xml
<schema name="solr-magento" version="1.2">
<types>
...
<fieldType name="type_name_de_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
</fieldType>
</types>
...
<fields>
...
<field name="name_de_partial" type="type_name_de_partial" indexed="true" stored="true"/>
</fields>
....
<copyField source="name_de" dest="name_de_partial" />
</schema>
Then create search condition in solrconfig.xml
<requestHandler name="magento_de" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str> <!-- Tie breaker -->
<str name="qf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="pf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="mm">3<90%</str> <!-- Minimum 'Should' Match [id 1..3 must much all, else 90proc] -->
<int name="ps">100</int> <!-- Phrase Slop -->
<str name="q.alt">*:*</str>
..
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
With this solr is searching in fields name_de_partial with pow 1.0 and in name_de with pow 3.0
So if engine founds specific query word in name_de, then it is put on top of the list.
If he also finds something in name_de_partial then it also counts and is put in results.
And field name_de_partial is using specific solr filters so it can found word "hippie" using query "hip" or "ppie" or "ippi" without a swet.

If you set EdgeNGramFilterFactory or NGramFilterFactory both at index and query time, combined with q.op=AND (or default mm=100% if you are using dismax) you will experience some problems.
Try defining NGramFilterFactory only at index time:
<fieldType name="testedgengrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
or try setting q.op=OR (or mm=1 if you are using dismax)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string