Solr - WhiteSpaceTokenizerFactory works for index but not while querying - search

Consider the following schema,
<schema>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" multiValued="false"/>
<fieldType name="stop_analyzer_string" class="solr.TextField" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="name_search" type="stop_analyzer_string" indexed="true" stored="false"/>
<copyField source="name" dest="name_search"/>
<field name="name" type="string" indexed="true" stored="true"/>
</fields>
</schema>
The name field gets indexed with WhitespaceTokenizerFactory, but it doesn't seem to use the WhitespaceTokenizerFactory while querying with the name field.
For a doc with name as "solr search",
the query name_search:solr - matches the document. //index time WhiteSpace tokenizer works
the query name_search:search - matches the document. //index time WhiteSpace tokenizer works
But the query name_search:solr search - doesn't match the document. //query time WhiteSpace tokenizer doesn't work
But as specified in the schema, the query should also be tokenized with whitespace and matched with the document. no?

Not sure what you are missing, but all the above queries worked for me for the data that you mentioned.
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=xml&indent=true
The above returned result document i indexed.
Just to test do this:
http://localhost:8983/solr/#/collection1/documents
Got to :
And paste below document as is into your Document(s) part and hit Submit Document
{"id":"100001","name_search":"solr search"}
Run you query as:
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=json&indent=true

Related

Solr schema. exact accented match and accent insensitive match

I'm trying to figure out how to configure the Solr manage-schema's fieldType to achieve the following:
(a) When searching for non-accented strings, the results will be accent insensitive.
(b) HOWEVER When performing searching on accented strings, the results will ONLY be accent sensitive.
For example:
searchString -> expectedResult
Equipe -> Equipe, Equipé, Equípé, etc...
Equipé -> Equipé
Note: Wildcard (*) is irrelevant and chosen words are for the sake of demonstration purposes only.
My situation is a little uncommon due to some requirement restrictions but with my schema (below), I have 3 fields; OName, OSearch, ONameSearch. (note: OSearch and ONameSearch serve different purposes in the backend, so they need to be defined indentically)
The intention is for my Solr to query on OSearch and ONameSearch, and return the OName to UI.
My original understanding was that OName will store the original value ("María") and index it as accent-insensitive ("maria") such that when query without solr.ASCIIFoldingFilterFactory, the following would be achieved.
Example: {query} -> {OName = result}
q = OSearch:*equipe* OR ONameSearch:*equipe* -> OName = Equipe, Equipé, Equípé, etc
q = OSearch:*equipé* OR ONameSearch:*equipé* -> OName = Equipé
This is my schema so far...
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<field name="OName" type="lowercase" indexed="true" stored="true" />
<field name="OSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<field name="ONameSearch" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="true" />
<copyField source="OName" dest="OSearch" />
<copyField source="OName" dest="ONameSearch" />
Please advise, thanks!
Most if not all relevant resources I've looked into
How to ignore accent search in Solr
How to ignore accents in SOLR search?
SOLR and accented characters
Solr accent removal
SOLR Makes Search with Accented Characters Easy
Solr Ref Guide 6.6 Defining Fields
Solr Ref Guide 6.6 Copying Fields

Solr - termfreq partial matches

I'm using Solr to query a set of documents and I want to get the number of matches for certain term, right now I'm using
termfreq(text,'manage')
However this does not hit on Manager or Management
termfreq(text,'manage*')
returns the same count. I've tried using different tokenizers, some won't even accept the * and I haven't found one that returns the correct number of matches.
Field:
<field name="text" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" required="false"/>
Is there a way I can get termfreq to also count partial matches?
You will need to add some custom tokenizers and and filter classes to the analyzer.
In your /shared/field_types.xml file, create a new type like this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And in /shared/fields.xml:
<field name="text" stored="true" type="text" multiValued="false" indexed="true"/>
<dynamicField name="*_text" stored="true" type="text" multiValued="false" indexed="true"/>
And use that as "text" as the type of the field.
A more advanced solution:
<fieldType name="startsWith" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- remove words/chars we don't care about -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<!-- now remove any extra space we have, since spaces WILL influence matching -->
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
In /shared/fields.xml:
<dynamicField name="*_starts_with" stored="true" type="startsWith" multiValued="false" indexed="true"/>
Then, in the top level of your core's schema.xml add this:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/fields.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/field_types.xml"/>
And add this to your copyFields in the core's schema.xml:
<copyFields>
<copyField source="yourField" dest="yourField_text"/>
<copyField source="yourField" dest="yourField_starts_with"/>
...
</copyFields>
I have had the same problem. I needed to count the termfreq, which also should match on subparts of words.
Add this FieldType solved it.
<fieldType name="startWith" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

How to store Pincode filed in SOLR and how can I retrieve data according to the pincode?

I have a Pincode field like "389151" and I want to store this in below format.
pincode_analyzed: [
"389151",
"38915",
"3891",
"389"
]
You can copy the Pincode field to the pincode_analyzed field defined with a fieldType similar to this:
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<field name="pincode_analyzed" type="ngrams" indexed="true" stored="true" multiValued="true" />
<copyField source="Pincode" dest="pincode_analyzed"/>
You can read more about tokenizers here: https://cwiki.apache.org/confluence/display/solr/Tokenizers

Solr match entire field

I want to create a field that will only match if the document's value for that field matches the query term with no additions. For instance, a query for "john" should only return results where the name is "john", not "johnson", "johns", etc.
I've seen other posts about exact matching in solr, and the prevailing answer seems to be to create a new field in schema.xml with type string. I've tried it, but that approach seems to also match when the exact query is contained within a field (results containing "johnson" still appear with the query "john").
The schema has fields lastName and lastName_ngram (which we're currently searching with):
<field name="lastName_ngram" type="text_token_ngram" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true"/>
<fieldType name="text_token_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
<field name="lastName" type="text_token" indexed="true" stored="true" omitNorms="true" omitTermFreqAndPositions="true"/>
<fieldType name="text_token" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
And I'd like to include a field lastNameExact so that documents that exactly match the entire field can be boosted:
<field name="lastNameExact" type="string" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true"/>
<copyField source="lastName" dest="lastNameExact"/>
Is there a modification I can make to this so that the lastNameExact field will only hit on documents containing a field with the entirety of the search query?
I could propose you a fix for that. Do not use type string for lastNameExact and use exact_match field type instead.
<fieldType name="exact_match" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
Copy field should remain the same.
Link for working schema.xml - https://github.com/MysterionRise/information-retrieval-adventure/blob/dadb683820fe4f1eaf6081185a933a28a5e1e481/lucene5/src/main/resources/solr/cores/test/conf/schema.xml

Weird results with Solr 1.4 and EdgeNGrams - some substrings match, some don't

EDIT 3: The workaround I'm using right now is to strip anything but letters, digits, and whitespace from both my queries and my indexed fields. This produces the desired behavior, but it's very much a workaround rather than a true solution, and I would still like to understand why Solr is doing what it's doing...so still interested in an answer, if anyone has one. END EDIT 3
I have a document named "TT-14B" indexed by Solr 1.4 (via Django/Haystack). When I query the content_auto field for "tt-1" or "tt 14" or "tt 14b" I get the document back; when I query "tt-14" or "tt-14b" I get no results. I edited the Haystack-generated Solr schema a bit to try to fix this, to no avail. Using analyze.jsp, it seems to me that I should be getting a match for "tt-14"; I should certainly be getting one for "tt-14b". (Edit: Oh, and changing the default operator from AND to OR doesn't help.)
Can someone help me understand why this isn't working? Thanks.
...
results
QUERY | WORKS
=======|======
tt | yes
tt- | yes
tt-1 | yes
tt-14 | no
tt-14b | no
tt 14 | yes
tt 14b | yes
EDIT 2
Got some more comparably weird results, might help debug the problem. In this case the test document was "abc'def".
QUERY | WORKS
========|======
abc | yes
abc'd | yes
abc'de | no
abc'def | no
Same pattern, obviously, but I don't understand what's causing it.
END EDIT 2
schema.xml relevant part (full file below)
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
schema.xml (full)
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<schema name="default" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!-- Numeric field types that manipulate the value into
a string value that isn't human-readable in its internal form,
but with a lexicographic ordering the same as the numeric ordering,
so that range queries work correctly. -->
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
</types>
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" />
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false" />
<dynamicField name="*_i" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="slong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<field name="modelname_exact" type="string" indexed="true" stored="true" multiValued="false" />
<field name="modelname" type="text" indexed="true" stored="true" multiValued="false" />
<field name="name" type="text" indexed="true" stored="true" multiValued="false" />
<field name="text" type="text" indexed="true" stored="true" multiValued="false" />
<field name="name_exact" type="string" indexed="true" stored="true" multiValued="false" />
<field name="content_auto" type="edge_ngram" indexed="true" stored="true" multiValued="true" />
</fields>
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>text</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="AND" />
</schema>
An screenshout of the /admin/analysis.jsp for every case would be interesting.
Is there a reason, why positionIncrementGap="1"is set to 1?
tt-14b and tt 14b are handled different, because of the whitspace tokenizer.
That means: tt-14b is one term, as long as the WordDelimiterFilterFactorydoesn't fired, while tt 14b are 2 terms from the beginning.
The positionIncrementGap gives you the possibility to see different terms as one phrase, even if there are not neighbors, but on the next "n" position. So try to rise the positionIncrementGap.
Btw: The first i notice on your schema.xml are the missing "EdgeNGramFilterFactory" at query time. Which should be okay. But there are also understandable reasons, why "same filters on query- and index-time" are handled as best practices.
This depends on every special situation, but activating this filter on query-time would be a try.
A little late to the show on this one -- but, as noted above, the WhitespaceTokenizerFactory breaks words with hyphens if it's passed through the StandardAnalyzer. I found this out the hard way too...:
Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results
The solution is probably to use the KeywordAnalyzer -- it shouldn't split anything.
I came up with a similar work-around to your "EDIT 3" on the link above (in PHP).
The particularly frustrating thing about the Solr analyzer is that it shows everything is fine and behaving as expected -- which really confused me.
Good luck!

Resources