Substring matches within SOLR - search

I can't seem to figure out how to find substring matches with SOLR, I've figured out matches based on a prefix so I can get ham to match hamburger.
How would I get a search for 'burger' to match hamburger as well? I tried burger but this tossed an error '*' or '?' not allowed as first character in WildcardQuery.
How can I match substrings using SOLR?

If anyone ends up here after searching for "apachesolr substring", there's a simpler solution for this : https://drupal.stackexchange.com/a/27956/10419 (from https://drupal.stackexchange.com/questions/26024/how-can-i-make-search-with-a-substring-of-a-word)
Add ngramfilter to text type definition in schema.xml in solr config
directory.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25" />

You can enable this but it will be very resource hungry (e.g. search for SuffixQuery).
See: http://lucene.472066.n3.nabble.com/Leading-Wildcard-Search-td522362.html
Quoting the mailing list:
Work arounds? Imagine making a second index (or adding another field) with all of the terms spelled backwards.
=>
See Add ReverseStringFilter https://issues.apache.org/jira/browse/LUCENE-1398
and Support for efficient leading wildcards search: https://issues.apache.org/jira/browse/SOLR-1321
At the moment issues.apache.org seems down. Try to use e.g. google cache.

As stated before in link you can use leading wildcards with edismax (ExtendedDismaxQParser). Just try it out to see if it is fast enough.
Some more info about the above mentioned reversedstring can also be found here: solr.ReversedWildcardFilterFactory

Related

SOLR: Return missing words for multi word searches

I'm trying to receive the words of a search query in solr, which were not included in a match.
Let's say I'm searching for "Red Hat Linux chickpeas" (without quotes) and one of the hits is "Red Hat Enterprise Linux operating system".. Then I'd like to get the information that the word "chickpeas" is missing in this result.
I think this should somehow be possible with SOLR, however apparently I couldn't come up with the right google/stackoverflow query to find a solution to this.
You could try using a facet to get the number returns with the given terms:
q=Red+Hat+Linux+chickpeas&facet=true&facet.field={!terms=red,hat,linux,chickpeas}text
Where text is a catch-all field (tokenized, lowercase filtered). Note that the facets are case-sensitive.
The answer to my question is using an exists function query for each search term.
See here:
https://stackoverflow.com/a/26163945/467944

Remove diacritics at index time into Solr

I am working on a Solr search fine tuning. I'm using Solr 4.0.
Normally, I worked with language analyzers and tokenizers for English language, however this time I'm working with Portuguese language and I'm facing issue as it doesn't really give the expected result I need.
For example: I'm searching for word 'proteses' but what is indexed is 'próteses' which is with diacritics. So it gives wrong results!
What I need to do is remove all diacritics before indexing and search, so it gives correct results. However, I'm unable to find how to handle this part.
Can anyone point me in right direction?
You have to use a char mapping filter on the fields that can contain diacritics. This filter will normalize them.
For example :
<fieldType name="text_with_diacritics" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt comes with Solr has mappings for many diacritics.
Obviously, you'll have to reindex your documents after you configured this filter.
Solr also has several ICU filters available, and have both a Normalization and Folding filter available to allow for removal of accents and diacritics across Unicode.
There is also a ASCIIFoldingFilter available, which will attempt to convert any character above the standard 7-bit ASCII range down into the range.

Solr, Special Chars, and Latin to Cyrillic char conversion

I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б or б and Ж ж).
Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...
Example would be (making up words here, hopefully won't offend anyone):
"BÖÖK" would be found when searching for "book"
"ЖRAY" would be found when searching for XRAY
"ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.
So, how should I go about this? Some theories I have are:
allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
Magic bullet I have not found yet...
Any ideas? Anyone tried this before?
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version
You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.
e.g.
<filter class="solr.ASCIIFoldingFilterFactory" />
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
So -
"BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.

Solr not matching. Threshold setting, or something weird?

I'm using solr to search for articles. I created 2 test "body" sentences which have the common word "tall", but there is no match.
The Query---> Body:"There are tall people outside" AND !UserId:2
Does not match a post with:
Body: the KU tower is really tall
UserId:3
Is this just simply a very low matching score? or is there something else going on here? In the case of a low matching score should it really be that low? The body sentences are very short and share a common word, I would have expected some match.
EDIT: I think the matching isn't happening as a result of having the !UserId: 2 condition. If I try to match body sentences without that, its very liberal. Can anyone explain this? and perhaps how to best structure a query to avoid this type of specific behavior?
Thanks!
I have seen some funky behavior with the ! operator with Solr. I would suggest you use the - (negative indicator) instead as shown in the SolrQuerySyntax Wiki Page. Try changing your original query to Body:"There are tall people outside" AND -UserId:2 to see if that works as you are expecting.
For those who come after me, I found a solution however not necessarily an explanation for its behavior.
The Solr query:
(PostBody:There are tall people outside) AND !UserId:2
worked as I desired above. Note that if the quotes are added around the body, it does not match. I believe Solr attempts to match such a query as a single string rather than individual words.

How to find a match within a single term using Lucene

I am using the Lucene search engine but it only seems to find matches that occur at the beginning of terms.
For example:
Searching for "one" would match "onematch" or "one day a time" but not "loneranger".
The Lucene doc says it doesnt support wildcards at the front of a search string so I am not sure whether Lucene even searches inter-term matches or only can match documents that start with the search term.
Is this a problem with how I have created my index, how I am building my search query or just a limitation of Lucene?
Found some info in another post here on Stack Overflow [LUCENE.NET] Leading wildcard throws an error"
You can set the SetAllowLeadingWildcardCharacters property on your Query Parser to allow leading wildcards during your search. This will of course have the obvious large performance impact but will allow user to find matches within a search term.
Lucene will find a document if the search term appears anywhere within it, but it doesn't allow you to do wildcard queries where the wildcard is on the front of the search term, because it performs horribly. If that is functionality you care about, you will either have to do some low-level Lucene hacking change a config flag (thanks for the interesting link), find a third-party library that has already done that hacking, or find a different search implementation (for small enough datasets, the built in search from a lot of RDBMS engines is sufficient).
Your query should be
"Query query = new WildcardQuery(new Term("contents", "*one *"));"
where contents is the field name in which you are searching.
"one" should be enclosed with asterisk mark. I have given space in the query after *one but there should not be any space. without space the * is not displaying that is why I added star.

Resources