Solr - Exact Match on solr.TextField - search

Is there a practicable way to do an exact match search on a stemmed fulltext field?
I have a scenario which i need a field to be indexed and searchable regardless of the case or white spaces used.
Even using KeywordTokenizerFactory on both index and query, all my searchs based on exact match stopped working.
Is there a way to search exact match like a string field and at the same time use customs tokenizers aplied to that field?
I posted below the schema i am currently using:
<field name="subtipoimovel" type="buscalimpaquery" indexed="true" stored="true" />
<fieldType name="buscalimpaquery" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement="-"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
regards,
Silvio Giuliani

The problem is while indexing you are using KeywordTokenizerFactory, ASCIIFoldingFilterFactory, LowerCaseFilterFactory and PatternReplaceFilterFactory but while query you are using KeywordTokenizerFactory. That will not work good for exact matches.
You need to see these as pipelined processors. You need to have "similar" processing during query time too.

As Srikanth notes in a comment, you should consider splitting up the different kinds of term analysis in two separate fields. See also my answer to a functionally similar question: Solr: combining EdgeNGramFilterFactory and NGramFilterFactory.

Aparently the problem was this tokenizer:
"solr.KeywordTokenizerFactory"
I changed it to StandardTokenizerFactory and now it works exact matches.
I read the description of KeywordTokenizerFactory on solr wiki and seems to me that to work exact match i should use it instead of StandardTokenizerFactory.
Does anyone know why this happens?

Related

How to ignore stop words and use remaining words to fetch expected results alone using Solr?

I have a name field, with the following definition:
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
and I would like to search for contents based on the following criteria:
Example values:
test a value
test of solr
not a value
test me says value
So, if i do search for test value, I should only get results containing both test and values. And, even if I do test a value, I should still get only the same result as stop words will be excluded. But, with this current setup, even with including edismax in the solr query, I get all of the results. It goes by a record with either of those two tokens. Could someone suggest me the update I could do to the definition to get a result as expected? And, am I using stopwords as expected? I do not want stopwords in the search consuming execution time.
I updated the definition as per the suggestion and even then the result does not make any sense to me.
I have a value what a term. And, there are other values like what the term ; about a term; about the term; description test; Name a Nothing etc. A search for what a term returns all of these. And, I also had a value just a and the. They were also getting returned in the result. Though, for what a term, as per the below screenshot, the query omits the stop word, the result does not make any sense to me.
You can ignore the stopsword during the search and index time. You cannot ignore these words in the response. The response will come as the text is stores as it is. The data stored for search and response is different. Search happens on the indexed data. In the response you get the data stored.
In your field type definition you are using KeywordTokenizerFactory.
KeywordTokenizerFactory : is used to when you dont want to create any tokens of your text. This tokenizer treats the entire text field as a single token.
Use of any other filter is of no use after this.
You can use StandardTokenizerFactory instead KeywordTokenizerFactory.
StandardTokenizerFactory : This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
On the analysis page, I analyzed the data for the above field.
Index the data "test a value" and query it with "test value", the result is found. Here you can see while indexing the data the stopwords are skipped as we have applied the stopwordfilterfactory
Now use "test a value" while indexing as well query in the analyze page.
It skips the stopword "a" as filter is applied and matched the result.

How to make characters that are part of SOLR query syntax searchable?

I have this problem that I am trying to solve for quite some time. I am not solr expert, I am still learning it.
I have a special type of ID's in my system, that have to be searchable by users. The problem is, that those ID's contain some solr special characters. By the way, those ID's are stored together with other search terms in the terms_txt field.
Some ID examples: 292/2017 and 1.2.61-962-37/2017
The first one I will refer as the 'simple one', and the second as 'complex one'.
From what I red throughout the internet, is that this kind of search should be possible if we do the phrase search. So if we add apostrophes around the ID, it should work. But unfortunately that is not the case. I will post here my solr 4.0 schema, and example of my query, hoping that you can spot what is wrong with it. If phrase search is the answer to my problem, then it must be that something is wrong with either solr schema or my query (code).
In my example I am searching for "292/2017" as a phrase. Only one entry in my index has this phrase, because this combination of characters is unique (it is some kind of ID, but we insert it in terms_txt field with all other terms)
This is query executed via solr admin, it finds a lot of results, but there should be only 1. It seems like solr handles '/' character as a space, and ignores terms shorter than 3 letters (ignoring less than 3 is what we want, but not in phrase search):
INFO: [collection1] webapp=/solr-example path=/select params={q=terms_txt:"44/2017"&wt=xml} hits=31343 status=0 QTime=6
So basically, in this example, solr has found all records with the term of 2017, which is bad...
This is query executed withing application logic. It is more complex, but the problem is same:
INFO: [collection1] webapp=/solr-example path=/select params={mm=100%25&json.nl=flat&fl=id&start=0&sort=date_in_i+desc&fq=type_s:2&fq=date_in_i:[20161201+TO+*]&fq=date_in_i:[*+TO+20171011]&fq=subtype_s:(2+4+6+8)&fq=terms_txt:"\"10/2017\""&fq=language_is:0&rows=10&bq=&q=\"10\/2017\"&tie=0.1&defType=edismax&omitHeader=true&qf=terms_txt&wt=json} hits=978 status=0 QTime=2
This is how terms_txt entries looks like in index:
<arr name="terms_txt">
<str>Some string blah blah 292/2017 - more of terms, blah blah</str>
<str>Something else, blah blah</str>
</arr>
This is my solr schema field configuration for the terms_txt field (fields are dynamic):
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|\s)([^\-\_&\s]+([\-\_&]+[^\-\_&\s]*)+)(?=(\s|$))" replacement="$1MжџљМ$2 $2" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="MжџљМ" replacement="" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b[\-_]+\b" replacement="" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Anyone have any clue how should I allow special characters like .-/ to be searchable ? Can you spot some flaw in my example or suggest better solution ?
You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces might remove content as well).
The Whitespace Tokenizer is better suited for a field where matching special characters is important, since it'll only break on and remove whitespace.
Define different fields and use different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. You can adjust weights by using qf together with the (e)dismax handlers. These handlers also allows you to boost phrase matches for two and three shingles.
Use one or more copyField instructions to get your content from one field to the other fields, so you don't have to change your indexing code to adjust how you tweak things in Solr.
If you append debugQuery=true to your query string, you can also see how Solr / Lucene computes the score for each document and what contributes to its ranking, so you can tweak scoring values and see exactly how the final score changes.
When writing the query, escape any special characters with \.

Tune solr phrase query search

We are trying to tune our phrase queries in DSE search.
For example, if we have column name X with the value "D A T A S T A X" we are searching for exact match for X:"T A S T"
Words are tokenized with with whitespacetokenizer.
We have couple hundred Million records in database and all the indexes are memory (We tested using pcstat). However still the queries are taking 5-15 sec. Why it is taking so time to pull the results if all the indexes are in memory? How can I tune this?
Any help is appreciated.
Try this fieldType:
<fieldType name="custom_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here the KeywordTokenizerFactory tokenizeer will pass the text stream exactly to the filters. The PatternReplaceFilterFactory will remove all except characters and numbers. You can config this however you want. Then we lowercase the stream and generate the NGram. This is for the index phase. For the query phase we don't do the NGram because we want to match the exact sub string.
We will be use the NGram instead of EdgeNGram, Because that will provides substring. The EdgeNGram always contain either from start or end. So EdgeNGram is not helpful in this case.
Hope this helps.

Search Solr: Match exact result Koh S*

I have a schema as below
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And my document will be:
Koh Samui
Koh Chang
Koh Lanta
When i do the search Koh* It returns 3 results. it accepted. But when i search koh S*. it return zero result. I want the result would be only Koh Samui
I assume you want to use a wildcard query. This depends on the Query Parser you are using. If you are using the Standard Query Parser, Dismax, or eDismax, all of these don't support wild card queries.
And to be honest if you are using Solr to search for wildcard queries, you might be using the wrong tool as the solution, you can use any SQL Database and use text like 'Koh S%' in the where clause, and this will give you what you want.
But anyway, to use a wild card Query in Solr you need to use the Lucene query parser directly. To do this, use the following query:
http://localhost:8983/solr/core/select?q={!lucene df=text}Koh S*

ShingleFilter search with more terms than indexed phrase fails

I am using Solr 1.4.1 (lucene 2.9.3) on windows and am trying to understand ShingleFilter. I wrote the following code and find that if I provide more words than the actual phrase indexed in the field, then the search on that field fails i.e. no score contributed from that field with debugQuery=true.
Here is an example I created to reproduce, with field names and the document indexed:
Id: 1
title_1: Nina Simone
title_2: I put a spell on you
Issue the following Queries (dismax):
- “Nina Simone I put” <- Fails to have a score from title_1 search (using debugQuery)
- “Nina Simone” <- SUCCESS
Trying to analyze the above disparity, when I used Solr’s Field Analysis with the ‘shingle’ field (given below) and tried “Nina Simone I put”, it succeeds. So it’s only during the query that no score is provided. I also checked ‘parsedquery’ and it shows disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the title_1 field.
title_1 and title_2 fields are of type ‘shingle’, defined as:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
</fieldType>
Note that I also have a catchall field which is text. I have qf set to: 'id^2 catchall^0.8' and pf set to: 'title_1^1.5 title_2^1.2'
Is there something that I am missing or doing something wrong?
In a dismax query, the score of the query is the max of the subqueries. Not the sum. I don't really know much about how it sparse shingle queries, but if it does something like "(title1:(shingle1 shingle2...)) (title2:(shingle1 shingle2...))" then you should expect to see only one field contribute to the score.

Resources