How to search chinese characters with Solr? - search

Basically I'm working on Drupal & using Solr as search engine. It searches some of the simplified chinese word/characters & some not like below
美国:为美朝峰会同朝鲜进行的磋商取得进展
It's not searching as simple character.
So I gone through both
https://lucene.apache.org/solr/guide/7_4/language-analysis.html
http://www.opencms-wiki.org/wiki/Solr_-_configuration_for_Chinese_and_correct_results_for_german_umlauts
& in solr config file I have below
<fieldType name="text_chinese" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
<analyzer>
<tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
It's giving
local:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not load conf for core local: Plugin init failure for
[schema.xml] fieldType "text_chinese": Cannot load analyzer:
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer. Schema file
is /var/solr/cores/local/conf/schema.xml
still it's not giving result.
Not sure if missing something in config.

The error message is telling you that Solr isn't able to find the implementing class of the analyzer you have defined - Cannot load analyzer: org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
The SmartCN analyzer isn't loaded by default, but it's included in the binary build under contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-<version number>.jar.
Add the directory to the list of directories that Solr can load libraries from in solrconfig.xml:
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*smartcn.*\.jar" />

Related

Solr - Exact Match on solr.TextField

Is there a practicable way to do an exact match search on a stemmed fulltext field?
I have a scenario which i need a field to be indexed and searchable regardless of the case or white spaces used.
Even using KeywordTokenizerFactory on both index and query, all my searchs based on exact match stopped working.
Is there a way to search exact match like a string field and at the same time use customs tokenizers aplied to that field?
I posted below the schema i am currently using:
<field name="subtipoimovel" type="buscalimpaquery" indexed="true" stored="true" />
<fieldType name="buscalimpaquery" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement="-"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
regards,
Silvio Giuliani
The problem is while indexing you are using KeywordTokenizerFactory, ASCIIFoldingFilterFactory, LowerCaseFilterFactory and PatternReplaceFilterFactory but while query you are using KeywordTokenizerFactory. That will not work good for exact matches.
You need to see these as pipelined processors. You need to have "similar" processing during query time too.
As Srikanth notes in a comment, you should consider splitting up the different kinds of term analysis in two separate fields. See also my answer to a functionally similar question: Solr: combining EdgeNGramFilterFactory and NGramFilterFactory.
Aparently the problem was this tokenizer:
"solr.KeywordTokenizerFactory"
I changed it to StandardTokenizerFactory and now it works exact matches.
I read the description of KeywordTokenizerFactory on solr wiki and seems to me that to work exact match i should use it instead of StandardTokenizerFactory.
Does anyone know why this happens?

Search Solr: Match exact result Koh S*

I have a schema as below
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And my document will be:
Koh Samui
Koh Chang
Koh Lanta
When i do the search Koh* It returns 3 results. it accepted. But when i search koh S*. it return zero result. I want the result would be only Koh Samui
I assume you want to use a wildcard query. This depends on the Query Parser you are using. If you are using the Standard Query Parser, Dismax, or eDismax, all of these don't support wild card queries.
And to be honest if you are using Solr to search for wildcard queries, you might be using the wrong tool as the solution, you can use any SQL Database and use text like 'Koh S%' in the where clause, and this will give you what you want.
But anyway, to use a wild card Query in Solr you need to use the Lucene query parser directly. To do this, use the following query:
http://localhost:8983/solr/core/select?q={!lucene df=text}Koh S*

Solr for Arabic

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.
Can I have result for arabic words ?
I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)
There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.
Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

Solr Partial And Full String Match

I am trying to allow searches on partial strings in Solr so if someone searched for "ppopota" they'd get the same result as if they searched for "hippopotamus." I read the documentation up and down and feel like I have exhausted my options. So far I have the following:
Defining a new field type:
<fieldtype name="testedgengrams" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldtype>
Defining a field of type "testedgengrams":
<field name="text_ngrams" type="testedgengrams" indexed="true" stored="false"/>
Copying contents of text_ngrams into text:
<copyField source="text_ngrams" dest="text"/>
Alas, that doesn't work. What am I missing?
You're using EdgeNGramFilterFactory which generates tokens 'hi', 'hip', 'hipp', etc, so it won't match 'ppopota'. Use NGramFilterFactory instead.
To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like: sample solr schema.xml
Here's the line to paste:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
Take it to the next level
5 ways to speed up indexing
Special instructions for editing the websolr schema.xml if you are using Heroku
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+
Ok I'm doing the same thing with field name
name_de
And I managed to get this thing to work using copyField like this:
schema.xml
<schema name="solr-magento" version="1.2">
<types>
...
<fieldType name="type_name_de_partial" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
</analyzer>
</fieldType>
</types>
...
<fields>
...
<field name="name_de_partial" type="type_name_de_partial" indexed="true" stored="true"/>
</fields>
....
<copyField source="name_de" dest="name_de_partial" />
</schema>
Then create search condition in solrconfig.xml
<requestHandler name="magento_de" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str> <!-- Tie breaker -->
<str name="qf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="pf">name_de_partial^1.0 name_de^3.0</str> <!-- Phrase Fields -->
<str name="mm">3<90%</str> <!-- Minimum 'Should' Match [id 1..3 must much all, else 90proc] -->
<int name="ps">100</int> <!-- Phrase Slop -->
<str name="q.alt">*:*</str>
..
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
With this solr is searching in fields name_de_partial with pow 1.0 and in name_de with pow 3.0
So if engine founds specific query word in name_de, then it is put on top of the list.
If he also finds something in name_de_partial then it also counts and is put in results.
And field name_de_partial is using specific solr filters so it can found word "hippie" using query "hip" or "ppie" or "ippi" without a swet.
If you set EdgeNGramFilterFactory or NGramFilterFactory both at index and query time, combined with q.op=AND (or default mm=100% if you are using dismax) you will experience some problems.
Try defining NGramFilterFactory only at index time:
<fieldType name="testedgengrams" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
or try setting q.op=OR (or mm=1 if you are using dismax)

ShingleFilter search with more terms than indexed phrase fails

I am using Solr 1.4.1 (lucene 2.9.3) on windows and am trying to understand ShingleFilter. I wrote the following code and find that if I provide more words than the actual phrase indexed in the field, then the search on that field fails i.e. no score contributed from that field with debugQuery=true.
Here is an example I created to reproduce, with field names and the document indexed:
Id: 1
title_1: Nina Simone
title_2: I put a spell on you
Issue the following Queries (dismax):
- “Nina Simone I put” <- Fails to have a score from title_1 search (using debugQuery)
- “Nina Simone” <- SUCCESS
Trying to analyze the above disparity, when I used Solr’s Field Analysis with the ‘shingle’ field (given below) and tried “Nina Simone I put”, it succeeds. So it’s only during the query that no score is provided. I also checked ‘parsedquery’ and it shows disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the title_1 field.
title_1 and title_2 fields are of type ‘shingle’, defined as:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
</fieldType>
Note that I also have a catchall field which is text. I have qf set to: 'id^2 catchall^0.8' and pf set to: 'title_1^1.5 title_2^1.2'
Is there something that I am missing or doing something wrong?
In a dismax query, the score of the query is the max of the subqueries. Not the sum. I don't really know much about how it sparse shingle queries, but if it does something like "(title1:(shingle1 shingle2...)) (title2:(shingle1 shingle2...))" then you should expect to see only one field contribute to the score.

Resources