solr 4.1 java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) - search

all!
After i update my solr version to 4.1, there is a such error when rebuild index:
Warning: Error creating document : SolrInputDocument[dop_pos_state=, dop_country=, dop_first_name_onlySort=Rick, dop_first_name=Rick, dop_sync_flag=true, dop_orgid=1522402, dop_last_name=King, dop_last_name_onlySort=King, dop_invite_flag=true, dop_name=Rick King, dop_metro_area=, dop_create_date=2012-12-15 08:53:55.0, dop_address=Greater Boston Area, dop_job_level=1, dop_id=343218, dop_title=at A & J Engineering Inc., dop_update_date=2013-02-19 09:38:38.0, dop_metromap_id=210, dop_facebook_linked=0, dop_linkedin_linked=0, dop_crunchBase_linked=0, dop_twitter_linked=0]
java.lang.IllegalArgumentException: first position increment must be > 0 (got 0)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:125)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:306)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
When i delete the field dop_title (which is "at A & J Engineering Corp." ) in schema.xml, it works fine. The dop_title's analyzer is below:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Then I put in solr's analysis page, the result is:
How did this happen and in which way can i avoid this? Thanks for your help!

There is an ongoing migration from stream-based to graph-based processing of tokens. This has uncovered some strange edge-cases in Solr 4.1. It looks like yours is one of those (a regression). You can open an issue if you want and somebody will look at it.
In a meanwhile, you may find it useful to know that if you tick the little "Verbose Output" button on the right side of the analysis page, it shows a lot more information about each step in the pipeline, including position values. That could help you to debug this issue faster and/or help to avoid it.

Related

Solr: Searching with/without spaces in keywords

I am experiencing an issue when spaces are introduced to keywords, for example:
We have a product with the title "Sony Playstation 4 Camera V2 PS4
(PSVR)"
Searching for "playstation" or "playstation camera" brings back this product
Searching for "play station" or "play station camera" does not bring back this product (notice
the space)
Here is the fieldType being used:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
How can I fix this, and make both "playstation" and "play station" match? This is only limited to PlayStation for my example, but it can happen to any search term e.g. "cyberpunk", "cyber punk". So solutions that require alot of manual work such as adding a synonym for play station => playstation are not feasible.
Things I have tried, but not managed to make work:
N-GRAM filter and tokenizer
Fuzzy search
Removing whitespace
Escaping whitespace
You can use a Shingle Filter to combine multiple tokens into one.
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
If you assume that the terms are spelled correctly when being indexed, you can apply this only when querying. It'll concatenate the tokens for you, effectively giving you multiple "merged" tokens:
play station camera => play, station, camera, playstation, stationcamera
.. given maxShingleSize=2. If you increase the max size to 3, this will also give you playstationcamera as a single token (in this case). If you have terms where people will possibly split a word multiple times, that might be necessary.
If you assume that your terms are indexed correctly, and this is only necessary on query time, your index won't change and you won't have to reindex (and the size won't change).
You might have to change the location of the filter around; your stemming filter will break this in mysterious places, since you'll end up concatenating previously stemmed terms.

How to handle Arabic characters on Solr

I'm trying to make my site ignore some Arabic characters ex("ه"،"ة") during the search. when user search for word end with "ة" like "مدينة" it brings only word end with that character "ة", it's supposed to bring also the words ends with "ه" like "مدينه".
these characters are the same in the Arabic language so the search result shouldn't be different, but my site produce different results
what I try on my schema_extra_types.xml
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_ar.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_ar.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Arabic" protected="protwords_ar.txt"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="25" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_ar.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_ar.txt" expand="true" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ar.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_ar.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Arabic" protected="protwords_ar.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
but the accents_ar.txt empty when I download the config folder from drupal admin interface,and where I can find an example of accents_ar.txt to use on my site? or there is another filter class to handle these kind of issues?

Solr: One word query does not match three word indexed value

One of my documents has a title attribute with the value Poésie pour pouvoir. When I query q=title:poesie, no results are found. q=title:poesie pour finds the document, though.
title is of type text. Excerpt from my schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
The second query isn't searching the title field only - it's also searching the default search field. The query is parsed as "title:poesie default_field:pour". The second part is what's generating the hit.
You can use the debugQuery parameter to see how your query is being parsed. Use the analysis page under the Solr admin page to see why the title value doesn't match (input "Poésie pour pouvoir" under "indexed" value and "poesie" under query value).

How to search abbreviation word "ITS" for "Information Technology Service" in Solr

In my dataset, the word "ITS" means "Information Technology Service". However, when I search "ITS" in solr, I get results like "it", "it's" and "its" (adjective). No results are related to "Information Technology Service". How can I change Solr for this purpose?
My schema for the filed is listed below. I actually use two field. One with stemming and the other without stemming. But it still does not work.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- for no stemming -->
<fieldType name="text_no_stemming" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
you are not letting Solr know ITS is a synonym for "Information Technology Service". You need to do that first, check SynonymFilter

Solr SnowballPorterFilterFactory for index and query analyzers

I use SnowballPorterFilterFactory for index and query analyzers.
When i search for "profession" word. Solr successfully finds only articles that contains "profession", but i want "professional" "professionalism" ...
This is the current configuration on schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer>
</fieldType>
What is happening is porter is over-stemming your query. When you search for profession your keyword gets stemmed down to profess, whereas profession professional and professionalism are all stored in the index as profession.
The only real way you are going to get around this is by adding another fieldType where you do not stem your query.
Something like:
<fieldType name="text_unstemmed_query" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer>
</fieldType>
With a copyfield like:
<copyField source="your_text_field" dest="text_unstem_query_field"/>

Resources