Solr: Wildcards and Case sensitivity search - search

I've been looking around trying to figure out what's going on here but have thus far come up empty. I'm hoping someone can offer me guidance as to where I can look for a solution. I have a text field that is defined as such:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I have a few records that have the following key/values
"text":[
"NOFX_SiteTest_4",
"NOFX_SiteTest_4\nNOFX_SiteTest_4\n Fourteen\n Ten\n Thirteen\n Fifteen\n Two\n 3\n Select Fields"
]
"text":[
"NOFX_SiteTest_44",
"NOFX_SiteTest_44\nNOFX_SiteTest_44\n Fourteen\n Ten\n Thirteen\n Fifteen\n Two\n 3\n Select Fields"
]
"text":[
"NOFX_SiteTest_445",
"NOFX_SiteTest_445\nNOFX_SiteTest_445\n Fourteen\n Ten\n Thirteen\n Fifteen\n Two\n 3\n Select Fields"
]
I'm trying various searches to get Solr to return those records. The problem is, depending on how I structure the query (based on where I add the wildcard, if I add a wildcard, and where leave the search text with regards to the underscores), the results I get are unexpected and incorrect. Here are the searches I ran from the Solr Admin query page:
SEARCH
text:(( NOFX_SiteTest_4* )) OR text_exact:(( NOFX_SiteTest_4* ))
RESULT
3 Records (correct)
SEARCH
text:(( NOFX_SiteTest_ )) OR text_exact:(( NOFX_SiteTest_ ))
RESULT
3 Records (correct)
SEARCH
text:(( NOFX_SiteTest )) OR text_exact:(( NOFX_SiteTest ))
RESULT
3 Records (correct)
SEARCH
text:(( NOFX_SiteTest* )) OR text_exact:(( NOFX_SiteTest* ))
RESULT
3 Records (correct)
SEARCH
text:(( nofx_sitetest_4 )) OR text_exact:(( nofx_sitetest_4 ))
RESULT
1 Record (correct)
SEARCH
text:(( nofx_sitetest_4* )) OR text_exact:(( nofx_sitetest_4* ))
RESULT
0 Records (incorrect)
SEARCH
text:(( nofx_sitetest_ )) OR text_exact:(( nofx_sitetest_ ))
RESULT
3 Records (correct)
SEARCH
text:(( nofx_sitetest* )) OR text_exact:(( nofx_sitetest* ))
RESULT
0 Records (incorrect)
From what it seems to me, based on the configuration for this field, Solr should be seeing these two queries as identical :
text:(( NOFX_SiteTest_4* )) OR text_exact:(( NOFX_SiteTest_4* ))
and
text:(( nofx_sitetest_4* )) OR text_exact:(( nofx_sitetest_4* ))
Why is it the case that the first search, where the letters are properly capitalized, the appropriate number of records are returned but the second search, where it's all lower case, they are not. Yet, when running these queries:
text:(( NOFX_SiteTest_ )) OR text_exact:(( NOFX_SiteTest_ ))
and
text:(( nofx_sitetest_ )) OR text_exact:(( nofx_sitetest_ ))
the proper number of records are returned. Why is the inclusion of the wildcard causing an issue? Particularly when the search consists entirely of lower case letters?
I'm hoping that someone can point me in the right direction. I've been looking through the docs and searching on similar problems but nothing I've run across seems to help me with my issue or helping me understand why this is occurring in the first place.
EDIT: Some additional information.
Here is the definition of the two fields I'm using in my search above:
<field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="text_exact" type="text_exact" indexed="false" stored="false" multiValued="true"/>
<!-- copy all fields to the default search field -->
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>
<copyField source="Comment" dest="text"/>
<!-- copy all fields to the exact match search field -->
<copyField source="title" dest="text_exact"/>
<copyField source="content" dest="text_exact"/>
<copyField source="Comment" dest="text_exact"/>
The only difference between the text and the text_exact is how the field types are defined. When my search is
text:(( NOFX_SiteTest_4* )) OR text_exact:(( NOFX_SiteTest_4* ))
it will find the 3 records (as I state above) but it does so because of the text_exact field, not the text field. I find that odd. Running the search
text_exact:(( NOFX_SiteTest_4* ))
returns 3 records but running the search
text:(( NOFX_SiteTest_4* ))
returns 0 records. I can see why text_exact is returning data. Because there is that exact text in the text_exact field. But I'm not sure why the search against text yields no records. Shouldn't that field be a bit more open and lenient? And be even more allowing of wildcard searches? Because if I remove the asterisk, it does return the one record where that exact text is in the text field. Why isn't it honoring the asterisk as a wildcard?
Finally, if I remove the wildcard and change the text to all lower case, it will find that record without difficulty when searching against the text field. So, again, whatever the issue may be, it appears that it has something to do with using the asterisk as a wildcard.

First of all LowerCaseFilterFactory filter, should go before WordDelimiterFilterFactory filter
<filter class="solr.LowerCaseFilterFactory"/>
It would convert all chars into lowercase, then it will be split by wordDelimiterFilter
When you're using wildcards, add additional text:(( NOFX_SiteTest_4 )) that would be the exact math.
Final
text:( NOFX_SiteTest_4* ) OR text_exact:( NOFX_SiteTest_4*) OR text:( NOFX_SiteTest_4 )
Please use the analysis solr tool to see what's happening.

When you're using a wildcard, the analysis chain doesn't run as it usually does.
The only filters invoked are those that implement MultiTermAwareComponent, so the analysis page won't do much good to tell you what's happening there.
This means that when you're doing a wildcard search, if the indexing pipeline has changed the tokens (split them, etc.), that processing will not happen when querying. That's probably why you don't get the hits as you'd like, but without wildcards it works. The cause here is that the WordDelimiterFilter isn't multitermaware, so when you're indexing, the input text is split into multiple tokens, while when you're querying, that doesn't happen. Since the tokens doesn't match (I'd wager a guess that just NOFX* might match, since that would be a single token on both sides), you don't get a hit.
If you do require wildcard matching for analyzed text, you're probably going to have to do a NgramFilter instead, and then tweak that filter to get the results you want for each token. But this will, yet again, behave differently, depending on where you add the NgramFilter in your chain (i.e. before or after word delimiter, etc.).

Related

How to ignore stop words and use remaining words to fetch expected results alone using Solr?

I have a name field, with the following definition:
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
and I would like to search for contents based on the following criteria:
Example values:
test a value
test of solr
not a value
test me says value
So, if i do search for test value, I should only get results containing both test and values. And, even if I do test a value, I should still get only the same result as stop words will be excluded. But, with this current setup, even with including edismax in the solr query, I get all of the results. It goes by a record with either of those two tokens. Could someone suggest me the update I could do to the definition to get a result as expected? And, am I using stopwords as expected? I do not want stopwords in the search consuming execution time.
I updated the definition as per the suggestion and even then the result does not make any sense to me.
I have a value what a term. And, there are other values like what the term ; about a term; about the term; description test; Name a Nothing etc. A search for what a term returns all of these. And, I also had a value just a and the. They were also getting returned in the result. Though, for what a term, as per the below screenshot, the query omits the stop word, the result does not make any sense to me.
You can ignore the stopsword during the search and index time. You cannot ignore these words in the response. The response will come as the text is stores as it is. The data stored for search and response is different. Search happens on the indexed data. In the response you get the data stored.
In your field type definition you are using KeywordTokenizerFactory.
KeywordTokenizerFactory : is used to when you dont want to create any tokens of your text. This tokenizer treats the entire text field as a single token.
Use of any other filter is of no use after this.
You can use StandardTokenizerFactory instead KeywordTokenizerFactory.
StandardTokenizerFactory : This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
On the analysis page, I analyzed the data for the above field.
Index the data "test a value" and query it with "test value", the result is found. Here you can see while indexing the data the stopwords are skipped as we have applied the stopwordfilterfactory
Now use "test a value" while indexing as well query in the analyze page.
It skips the stopword "a" as filter is applied and matched the result.

Alfresco SOLR4 not giving results if I use wildcard search on a text field having comma separated numbers

I am using SOLR4 along with Alfresco 5 application
I have a text field called field1 with value : 71,72,73
If I search for
#field1:72
I get the results.
But if I search for
#field1:*72*
I am not getting results.
What changes I need to do in the configs to get the results.
I have below configurations set on my schema.xml
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
UPDATE : After further analysis this looks to me a limitation of minimum number of characters that SOLR accepts for searching. If I use more than 2 characters I can get the results. For example in the above
#field1:*72,* Gives me the results. If I use just asterisk then also it works but not if I use 1 or 2 characters like 7* or 72* won't work.
UPDATE 2 : This time I tried with a text field having values "123456". If I search for
1*
12*
123*
1234*
I am not getting results. But I can get the results only if I give 12345*
I can also get results if I give 123456*
I am sure this worked fine in older solr version 4.9 but broken in 4.10

How to make characters that are part of SOLR query syntax searchable?

I have this problem that I am trying to solve for quite some time. I am not solr expert, I am still learning it.
I have a special type of ID's in my system, that have to be searchable by users. The problem is, that those ID's contain some solr special characters. By the way, those ID's are stored together with other search terms in the terms_txt field.
Some ID examples: 292/2017 and 1.2.61-962-37/2017
The first one I will refer as the 'simple one', and the second as 'complex one'.
From what I red throughout the internet, is that this kind of search should be possible if we do the phrase search. So if we add apostrophes around the ID, it should work. But unfortunately that is not the case. I will post here my solr 4.0 schema, and example of my query, hoping that you can spot what is wrong with it. If phrase search is the answer to my problem, then it must be that something is wrong with either solr schema or my query (code).
In my example I am searching for "292/2017" as a phrase. Only one entry in my index has this phrase, because this combination of characters is unique (it is some kind of ID, but we insert it in terms_txt field with all other terms)
This is query executed via solr admin, it finds a lot of results, but there should be only 1. It seems like solr handles '/' character as a space, and ignores terms shorter than 3 letters (ignoring less than 3 is what we want, but not in phrase search):
INFO: [collection1] webapp=/solr-example path=/select params={q=terms_txt:"44/2017"&wt=xml} hits=31343 status=0 QTime=6
So basically, in this example, solr has found all records with the term of 2017, which is bad...
This is query executed withing application logic. It is more complex, but the problem is same:
INFO: [collection1] webapp=/solr-example path=/select params={mm=100%25&json.nl=flat&fl=id&start=0&sort=date_in_i+desc&fq=type_s:2&fq=date_in_i:[20161201+TO+*]&fq=date_in_i:[*+TO+20171011]&fq=subtype_s:(2+4+6+8)&fq=terms_txt:"\"10/2017\""&fq=language_is:0&rows=10&bq=&q=\"10\/2017\"&tie=0.1&defType=edismax&omitHeader=true&qf=terms_txt&wt=json} hits=978 status=0 QTime=2
This is how terms_txt entries looks like in index:
<arr name="terms_txt">
<str>Some string blah blah 292/2017 - more of terms, blah blah</str>
<str>Something else, blah blah</str>
</arr>
This is my solr schema field configuration for the terms_txt field (fields are dynamic):
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|\s)([^\-\_&\s]+([\-\_&]+[^\-\_&\s]*)+)(?=(\s|$))" replacement="$1MжџљМ$2 $2" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&]+" replacement="MжџљМ$1" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="MжџљМ" replacement="" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b[\-_]+\b" replacement="" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&(\w)" replacement="$1and$2" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="99"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Anyone have any clue how should I allow special characters like .-/ to be searchable ? Can you spot some flaw in my example or suggest better solution ?
You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces might remove content as well).
The Whitespace Tokenizer is better suited for a field where matching special characters is important, since it'll only break on and remove whitespace.
Define different fields and use different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. You can adjust weights by using qf together with the (e)dismax handlers. These handlers also allows you to boost phrase matches for two and three shingles.
Use one or more copyField instructions to get your content from one field to the other fields, so you don't have to change your indexing code to adjust how you tweak things in Solr.
If you append debugQuery=true to your query string, you can also see how Solr / Lucene computes the score for each document and what contributes to its ranking, so you can tweak scoring values and see exactly how the final score changes.
When writing the query, escape any special characters with \.

Solr: ClassicFilterFactory with acronyms & use of Solr's analyzer

I have a schema.xml with a text type, that uses tokenizers, filters... at index and other at query time. Now I have the problem, that a search query, which should return some results, doesn't return anything. So I thought, using Solr's analyzer would bring me closer to the root of the problem.
I have the following string: Foo Bar Ges.m.b.H
This is my schema.xml definition for the field type text:
<fieldType name="text" class="solr.TextField" omitNorms="false" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="2" catenateAll="1" preserveOriginal="1" splitOnNumerics="0"/>
</analyzer>
</fieldType>
When I search for Foo Bar I get all the results back, so the problem lies within the Ges.m.b.H. (notice the missing dot at the end). I have a few questions about this:
1. ClassicFilterFactory
ClassicFilterFactory only works on acronyms that are in this format LETTER.LETTER.LETTER.. For example, G.m.b.H. -> GmbH. But it doesn't work on acronyms like G.m.b.H (missing dot at the end) or Ges.m.b.H. or Ges.m.b.H . Is there a way to get this to work? For now, I'm doing it with the WordDelimiterFilterFactory, but it would be good to know, if there is a better way.
2. Solr's Analzer
I tried to analyze the index and query time with solr's analyzer. My text get's splitted on index and query time, as expected. When I fill out the field for index and query, I get this highlighted fields that look like, if there was a hit. Here are some screenshots:
The screenshot above is from index time of Foo Bar Ges.m.b.H, LowerCaseFilterFactory. I also get "hits" at other filters like my last filter ReverseStringFilterFactory:
The next screenshot is from query time:
To me, it looks like, Solr is looking at the last line of my query tokenizer/filter stuff, and searches for hits in the indexed documents, and if there were some hits, they get highlighted. But unfortunately, this search doesn't return any hits, when used in my normal search.
I drilled it down to exclude any other queries:
http://localhost:8982/solr/atalanda_development/select?q=foo+bar+ges.m.b.h&defType=edismax&qf=vendor_name_search_text
Summing up:
Any ideas, why this doesn't work?
Am I right, that the highlighted, kinda purple fields, are hits? Can someone explain, HOW Solr is doing this, so that I can understand this in the future?
Any suggestions to the ClassicFilterFactory problem would be great!

Solr - Exact Match on solr.TextField

Is there a practicable way to do an exact match search on a stemmed fulltext field?
I have a scenario which i need a field to be indexed and searchable regardless of the case or white spaces used.
Even using KeywordTokenizerFactory on both index and query, all my searchs based on exact match stopped working.
Is there a way to search exact match like a string field and at the same time use customs tokenizers aplied to that field?
I posted below the schema i am currently using:
<field name="subtipoimovel" type="buscalimpaquery" indexed="true" stored="true" />
<fieldType name="buscalimpaquery" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement="-"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
regards,
Silvio Giuliani
The problem is while indexing you are using KeywordTokenizerFactory, ASCIIFoldingFilterFactory, LowerCaseFilterFactory and PatternReplaceFilterFactory but while query you are using KeywordTokenizerFactory. That will not work good for exact matches.
You need to see these as pipelined processors. You need to have "similar" processing during query time too.
As Srikanth notes in a comment, you should consider splitting up the different kinds of term analysis in two separate fields. See also my answer to a functionally similar question: Solr: combining EdgeNGramFilterFactory and NGramFilterFactory.
Aparently the problem was this tokenizer:
"solr.KeywordTokenizerFactory"
I changed it to StandardTokenizerFactory and now it works exact matches.
I read the description of KeywordTokenizerFactory on solr wiki and seems to me that to work exact match i should use it instead of StandardTokenizerFactory.
Does anyone know why this happens?

Resources