I want that if someone search for phan then elephant Should match.
Now i have value:*phan* then it works so i tried this
<analyzer type="query">
<filter class="solr.PatternReplaceFilterFactory" pattern="(.+)" replacement="*$1*" replace="all" />
But then its making the query as
"*phan*" as single field not wilcard
how can i do that
To make Solr find documents for word parts, you need to have a look at the NGramTokenizer or the Edge NGramTokenizer. As you are required to match parts of the word within the middle of it, you should have a look at the NGramTokenizer. If the start and end of the word would do, the EdgeNGram would be favourable, as it is smaller in index terms.
A good sample is found here on SO within the question Apache solr search part of the word.
Why Indexing over query time?
Lucene and as such Solr are not meant to do searches with leading wildcards. So even search for *foo is likely to cause bad performance. Not to mention *foo*. You can read this up in the FAQs 'What wildcard search support is available from Lucene?'
Leading wildcards (e.g. *ook) are not supported by the QueryParser by default. As of Lucene 2.1, they can be enabled by calling QueryParser.setAllowLeadingWildcard( true ). Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that match the pattern.
In the SO question Understanding Lucene leading wildcard performance is a more detailed write up on this topic.
Related
I am looking for a way to do wildcard search only on specific elements when executing a search:search. Specifically, I might have documents that look like the following:
<pdbe:person-envelope xmlns:pdbe="http://schemas.abbvienet.com/people-db/envelope">
<person xmlns="http://schemas.abbvienet.com/people-db/model">
<costcenter>
<code>0000601775</code>
<name>DISC-PLAT INFORM</name>
</costcenter>
<displayName>Tj Tang</displayName>
<upi>10025613</upi>
<firstName>
<preferred>TJ</preferred>
<given>Tze-John</given>
</firstName>
<lastName>
<preferred>Tang</preferred>
<given>Tang</given>
</lastName>
<title>Principal Research Scientist</title>
</person>
<pdbe:raw/>
</pdbe:person-envelope>
When searches happen, I want the search text to be automatically wildcarded, but only for certain elements like displayName, firstName, lastName, but NOT for upi or code. As I understand it, I would have certain wildcard related indexes enabled in the database, but then I would need to have a custom query parser that rewrite the query into multiple cts:element-query and cts:element-value-query statements for each element that I want to wildcard search on, OR'd with the originally parsed search query. Or I can create field constraints, and rewrite the query to use field contraints.
Is there another way to conditionally search using wildcard on some elements but not others, when the user is entering as simple search query?, i.e. partial first and last name, "TJ Tan", but no partial hits when I search "100256".
You are on the right track. Lets take an element (or maybe field) query on "TS Tan"
With cts:tokenize, you can break this up (read about cs:tokenize - it is not just a normal tokenizer).
Then I have "TS" and "Tan"
You can the do things like apply business rules on which word should be wild-carded and which not and build the appropriate cts query (probably individual word queries in an and statement - or a near query - tuning depends on your need).
Now with search phrase tokenized, you can also consider that you may find building your results relies not on a wildcard index, but on a an element word lexicon - where you do your term-expansion with word-matches and those terms are then sent to the query.
We sometimes take that further and combine the query building with xdmp:estimate and make the query less restrictive if we don't get enough results early on.
Where to put this logic?
You mention search:search, so in this case, I would suggest you package this into a custom constraint.
We are using Solr to provide search functionality for our site, and I have the following requirement which has me stumped:
Given the search term "2011 Bolinger", identify that "Bollinger" (note the different spelling) is a valid value for the Producer facet, and automatically apply facet filtering for this value.
It's the fuzzy matching of the search term which I'm stuck on. Does anyone know of a way to include information in a Solr response about synonym matches which have occurred for a query during the search (i.e. a way for Solr to tell me that it saw the word 'Bollinger' in a document and recognised it as equivalent to 'Bolinger')? From what I've read so far of the Solr documentation I can't see a way to do this, but I may have missed something.
how can i search "ice cube" if I have "icecube" in my index. I have set mm as 2<-1 4<70%. While using shingle in query analyzer, the query "ice cube" creates three tokens as "ice","cube", "icecube". But mm is the limitation here. Only ice and cubes are searched but not "icecubes".i.e not working for pair though I am using shingle filter. However in analysis tool, three tokens are created. How to solve it ?.
Here the schema configuration link: http://pastebin.com/74xaKEyv
I think you should use the Shingle filter only at indexation in your schema.
Try removing <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/> from the query part of your analyzer.
I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.
We have a database of movies and series, and as the data comes from many sources of varying reliability, we'd like to be able to do fuzzy string matching on the titles of episodes. We are using Solr for search in our application, but the default matching mechanisms operate on word levels, which is not good enough for short strings, like titles
I had used n-grams approximate matching in the past, and I was very happy to find that Lucene (and Solr) supports something this out of the box. Unfortunately, I haven't been able to configure it correctly.
I assumed that I need a special field type for this, so I added the
following field-type to my schema.xml:
<fieldType
name="trigrams"
stored="true"
class="solr.StrField">
<analyzer type="index">
<tokenizer
class="solr.analysis.NGramTokenizerFactory"
minGramSize="3"
maxGramSize="5"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
and changed the appropriate field in the schema to:
<field name="title" type="trigrams"
indexed="true" stored="true" multiValued="false" />
However, this is not working as I expected. The query analysis looks
correctly, but I don't get any results, which makes me believe that
something happens at index time (ie. the title is indexed like a
default string field instead of trigram field).
The query I am trying is something like
title:"guy walks into a psychiatrist office"
(with a typo or two) and it should match "Guy Walks into a Psychiatrist Office".
(I am not really sure if the query is correct.)
Moreover, I would like to be able to do something more in fact. I'd like to
lowercace the string, remove all punctuation marks and spaces, remove
English stopwords and THEN change the string into trigrams. However,
the filters are applied only after the string has been tokenized...
Thanks in advance for your answers.
To answer to the last part of your question: solr has also an ngram filter. So you should not use the ngram tokenizer (but one like "WhitespaceTokenizer" for example), apply all pre-ngram filters and then add this one:
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="3" />
The solution turned out to be very simple: AND was set as the default operator, and if any of the ngrams didn't match, the whole query failed. So, it was sufficient to add:
<solrQueryParser defaultOperator="OR" />
in my schema definition.