Solr match entire field - search

I want to create a field that will only match if the document's value for that field matches the query term with no additions. For instance, a query for "john" should only return results where the name is "john", not "johnson", "johns", etc.
I've seen other posts about exact matching in solr, and the prevailing answer seems to be to create a new field in schema.xml with type string. I've tried it, but that approach seems to also match when the exact query is contained within a field (results containing "johnson" still appear with the query "john").
The schema has fields lastName and lastName_ngram (which we're currently searching with):
<field name="lastName_ngram" type="text_token_ngram" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true"/>
<fieldType name="text_token_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
<field name="lastName" type="text_token" indexed="true" stored="true" omitNorms="true" omitTermFreqAndPositions="true"/>
<fieldType name="text_token" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
And I'd like to include a field lastNameExact so that documents that exactly match the entire field can be boosted:
<field name="lastNameExact" type="string" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true"/>
<copyField source="lastName" dest="lastNameExact"/>
Is there a modification I can make to this so that the lastNameExact field will only hit on documents containing a field with the entirety of the search query?

I could propose you a fix for that. Do not use type string for lastNameExact and use exact_match field type instead.
<fieldType name="exact_match" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
Copy field should remain the same.
Link for working schema.xml - https://github.com/MysterionRise/information-retrieval-adventure/blob/dadb683820fe4f1eaf6081185a933a28a5e1e481/lucene5/src/main/resources/solr/cores/test/conf/schema.xml

Related

Solr stopwords magic

My stopwords don't works as expected.
Here is part of my schema:
<fieldType name="text_general" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="solr.TextField" name="text_auto">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false"/>
</analyzer>
<analyzer type="query">
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
</analyzer>
</fieldType>
<field name="deal_title_terms" type="text_auto" indexed="true" stored="false" required="false" multiValued="true"/>
<field name="deal_description" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
In stopwords.txt I have next words: the, is, a;
Also I have next data in my fields:
deal_description - This is the my description
deal_title_terms - This is the deal title a terms (will be splitted in terms)
When I try to search deal_description:
Example 1: "deal_description: his is the m" - I expect that document with deal_description "This is the my description" will be returned
Example 2: "deal_description: is th" - I expect that nothing will be found because "is" and "the" are stopwords.
When I try to search deal_title_terms:
Example 1: "deal_title_terms: is" - I expect that nothing will be found because "is" is stopword.
Example 2: "deal_title_terms: is the deal" - I expect that "is" and "the" will be ignored and term "deal" will be found.
Example 3: "deal_title_terms: title a terms" - I expect that "a" will be ignored and term "title terms" will be found.
Question 1: Why stopwords don't works for "deal_description" field ?
Question 2: Why for field "deal_title_terms" stopwords not removed for my query ?(When I am trying to find title a terms it will not find "title terms" term)
Question 3: Is there any way to show stopwords in search result but prevent them from searching ? Example:
data: This is cool search engine
search query : "is coo" -> return "This is cool search engine"
search query : "is" -> return nothing
search query : "This coll" -> return "This is cool search engine"
Question 4: Where I can find detailed description (maybe with examples) how stopwords works in solr ? Because it looks like magic.
Answer to Question 1 : Replace the "KeywordTokenizerFactory" as it does no actual tokenizing, so the entire input string is preserved as a single token.Use StandardTokenizerFactory instead.
Or use the below fieldType.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Stopwords will work as expected for the "deal_description" field.
Answer to Question 3 : Yes. Add the StopFilterFactory in analyzer of type="query" only. It will prevent them from searching and not adding them while indexing.
Answer to Quesion 4 : https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Answer to Quesion 2 : The custom field created by you seems incorrect. The text has to tokenised first using the tokenizers but you are using filters first.
Check the analysis of it with solr analysis page.

Solr - termfreq partial matches

I'm using Solr to query a set of documents and I want to get the number of matches for certain term, right now I'm using
termfreq(text,'manage')
However this does not hit on Manager or Management
termfreq(text,'manage*')
returns the same count. I've tried using different tokenizers, some won't even accept the * and I haven't found one that returns the correct number of matches.
Field:
<field name="text" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" required="false"/>
Is there a way I can get termfreq to also count partial matches?
You will need to add some custom tokenizers and and filter classes to the analyzer.
In your /shared/field_types.xml file, create a new type like this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And in /shared/fields.xml:
<field name="text" stored="true" type="text" multiValued="false" indexed="true"/>
<dynamicField name="*_text" stored="true" type="text" multiValued="false" indexed="true"/>
And use that as "text" as the type of the field.
A more advanced solution:
<fieldType name="startsWith" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- remove words/chars we don't care about -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<!-- now remove any extra space we have, since spaces WILL influence matching -->
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
In /shared/fields.xml:
<dynamicField name="*_starts_with" stored="true" type="startsWith" multiValued="false" indexed="true"/>
Then, in the top level of your core's schema.xml add this:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/fields.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/field_types.xml"/>
And add this to your copyFields in the core's schema.xml:
<copyFields>
<copyField source="yourField" dest="yourField_text"/>
<copyField source="yourField" dest="yourField_starts_with"/>
...
</copyFields>
I have had the same problem. I needed to count the termfreq, which also should match on subparts of words.
Add this FieldType solved it.
<fieldType name="startWith" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Solr - WhiteSpaceTokenizerFactory works for index but not while querying

Consider the following schema,
<schema>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" multiValued="false"/>
<fieldType name="stop_analyzer_string" class="solr.TextField" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="name_search" type="stop_analyzer_string" indexed="true" stored="false"/>
<copyField source="name" dest="name_search"/>
<field name="name" type="string" indexed="true" stored="true"/>
</fields>
</schema>
The name field gets indexed with WhitespaceTokenizerFactory, but it doesn't seem to use the WhitespaceTokenizerFactory while querying with the name field.
For a doc with name as "solr search",
the query name_search:solr - matches the document. //index time WhiteSpace tokenizer works
the query name_search:search - matches the document. //index time WhiteSpace tokenizer works
But the query name_search:solr search - doesn't match the document. //query time WhiteSpace tokenizer doesn't work
But as specified in the schema, the query should also be tokenized with whitespace and matched with the document. no?
Not sure what you are missing, but all the above queries worked for me for the data that you mentioned.
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=xml&indent=true
The above returned result document i indexed.
Just to test do this:
http://localhost:8983/solr/#/collection1/documents
Got to :
And paste below document as is into your Document(s) part and hit Submit Document
{"id":"100001","name_search":"solr search"}
Run you query as:
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=json&indent=true

How to turn query, which searches SOLR data, into lowercase?

My scheme.xlm looks like this:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- The searched field -->
<field name="product_name" type="text" indexed="true" stored="true"/>
This should index the field in lowercase and also transform search query into the lowercase.
Data I want to find is: "Nokia Lumia 610"
When I search "nokia" I get the expected result but
when searching only "Nokia"(upper case N) there aren't any results.
Above "analyzer" performs lowercase only on index but not on search query.
Is this an error?
How to force SOLR indexes and search query to be in lowercase?
Transformation of the search query also depends on the type of query and the analyzer that you are using. For example, the above will not convert your search query to lowercase if you are sending request to the select analyzer. If you are sending request :-
http://url/solr/select?q=Nokia
then the above will not be converted to lowercase since the select analyzer is not present in your fieldtype definition. You will have to modify your code as follows :-
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="select">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
if the above does not work, then please post the request that you are sending and the output of adding debugQuery=true to the request.
Along with
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="select">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
in schema.xml.
In head.vm change return $("#q").val();
to
return $("#q").val().toLowerCase(); for InCaseSensitive autocomplete feature.
So that you can get result if you search with Capital Letters.

Solr - Search words within a string

I have a field containing a lot of words, for example:
"hello my name is Nicole and I am working with Solr"
and I need Solr to return this document if I search for this words (note that the word order is not as in the indexed text):
"am name with"
I am using this configuration
<fieldType name="propertiesField" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="-" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
and this query:
select/?q=properties_all:am-name-with&version=2.2&start=0&rows=10&indent=on
when I analize it with the analyzer, those words are highlighted but no document is found when I do the search.
Thanks for your help!!!!
If there is no good reason to use different index and query time analyzer, do not.
I would fieldType like:
<fieldType name="propertiesField" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Additionally set default operator as AND (schema file)
<solrQueryParser defaultOperator="AND"/>
Then query Solr with:
select/?q=properties_all:(am name with)&version=2.2&start=0&rows=10&indent=on

Resources