Syntax for fuzzy search in Solr 4 - search

I have been trying to enable fuzzy searching for our Solr 4.1 powered search but all I can find online is:
1- how to do it in the default lucene query syntax which doesnt help in my case,
2- that dismax does not support it and
3- that edismax is going to or should support it
However, I can't find any documentation of how to use it in edismax querying format, not even on the default edsimax page for query syntax which uses the operator ~ for defining slop factor instead. I did try specifying it in qf parameter as per some links online but that didn't work and I am also assuming here that Solr 4.1 uses edismax by default.
So if someone knows how its supposed to work or if its even supported, any pointers would be greatly appreciated.

This worked for me.
Try adding this fieldType at your schema.xml.
This is case unsensitive and for spanish words, but changing that should work for you too.
<fieldType name="text_es_fuzzy" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" enablePositionIncrements="true"/>
</analyzer>
</fieldType>
After this, when you perform a query just add "~0.5" at the end of the search string. The "0.5" is a custom value you can choose. This value determines how "fuzzy" your search is. When the value is closer to zero the results will be "fuzzier" and viceversa.

Related

Solr can do the add field and use copy field just don't reindex?

I use the Solr 6.1,
And i Just completed the document index,
But some reason I need make it not case sensitive in search,
And i found the solution can use copy field make it work,
But it need to add field to help it completed,
Like below :
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
does anybody know can I use this solution when it completed index after?
or it have other solution can fix it??
No. You'll have to reindex the content (at least the field in question) to change the case of the generated tokens. You can do this either from the original source, or write a script that retrieves each document from Solr and re-indexes the single field - as long as all your fields are set as stored. If they're not stored (and do not have docValues that can be used in place of a stored value), you'll have to reindex. Solr has no way to get the original text from the processed tokens.
Also remember that a KeywordTokenizer will keep the value as a single token and not split on whitespace etc.
Make sure you get the correct result before indexing by using the Analysis page under Solr's admin interface.

Solr query emulates exactly match

I use solr to make a short query with brand. I want the equal match, but understand that it is impossible in the Lucene.
I tried some hardcoded query just for tests
myBrand:2\+2 and myBrand:\+
I get 2:2, seems and condition not working or not so how I am expect?
Also, i try fq
myBrand:2\+2 with a fq of myBrand:\+
Now, no results at all.
I use Solr 5 and make all tests in the Solr web interface.
Is there some method to get the best matching of some short brands, nicknames and etc, when I no need too much eristics and want strong equal matching? Or anyway I have to filter results in my own code after solr query executed?
UPDATED
Changes in a schema resolved my issues.
Now it is working for the queries 2+2 like a charm.
<fieldType name="text_general" class="solr.TextField" multiValued="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*"/>
</analyzer>
</fieldType>
Equal matches are not impossible. You just have to use a field type that retains the exact value, such as StrField or a TextField with KeywordTokenizer (if you want to make it case insensitive).
Matching would be field:"Exact value" or any regular query syntax. The reason why Solr/Lucene wouldn't do an "exact" match is that the regular TextField definitions in the example breaks the text into separate tokens.
For filters you usually want the exact value (both for a facet and for the fq, so you can filter the results exactly), so this is not a Lucene limitation, but something introduced by the type of fields you're working with.
The solution might be to have the same content in many fields (one to search against for regular text queries) and one to filter and facet on. Use copyField to get the same values into several fields from the same source field.

Search of integer is slow when comapared to string in solr

I have file which has integers and strings delimited by pipe like below
abc|182|2rt|jd
yre|123|7yd|op
ifs|132|24d|oe
i have created a new field type pipedelimited as below
<fieldType name="pipedelimited" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="|"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The problem is when i search for a integer the search will take too much of time to respond,
but if i search for string response is in millisecond.
Please help with the reason for this
Both of your search examples are text as far as Solr is concerned. So, they should be treated identically.
So, either you missed something from your description of the situation or there is something very funny about particular records. Have you tried searching for string and "integer" values that supposed to return the same record. Do you get the same speed? You should.
Try using a debug flag and see what you can notice differently.
Basically, side by side comparisons should be evaluated by trying to make all other parameters as equal as possible. And then focusing on the visible differences.

Amazon like search with Solr

We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I configure Solr to offer some of the features to our end users.
Our product data consists of kinda standard data for products like
title of a product
description
a product is in multiple categories and sub-categories
a product can have multiple variants with options, like a T-Shirt in red, blue, green, S, M, L, XL... or an iPad with 16GB, 32GB...
a product has a brand
a product has a retailer
For now, we are using this schema file to index and perform queries on Solr:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
</fieldType>
EdgeNGramFilterFactory indexes a word like shirt into sh, shi, shir, shirt
WordDelimiterFilterFactory breaks up words like wi-fi into wi, fi, wifi
PorterStemFilterFactory works good for stemming
PhoneticFilterFactory provides kinda fuzzy search
One problem is, that the fuzzy search doesn't work very well. If I search for the book Inferno and missspelled it with Infenro, the search doesn't return any results. I've read about the SpellCheckComponent (http://wiki.apache.org/solr/SpellCheckComponent), but I'm not sure if that's the best way to do a fuzzy search, or a Did you mean? feature.
The second problem is, that it should be possible, to search for Shirts red to find red T-Shirts (where red is an option value of the option type color) or to search for woman shoes or adidas shoes woman. Is it possible to do this with Solr?
And the third problem is, that I'm not sure which of the tokenizer and filters inside the schema.xml are a good choice to achieve such features.
I hope someone has used such features with solr, and can help me in this case. Thx!
EDIT
Here is some data, that we store inside Solr:
<doc>
<str name="id">572</str>
<arr name="taxons">
<str>cat1</str>
<str>cat1/cat2</str>
<str>cat1/cat2/cat3</str>
<str>cat1/cat4</str>
</arr>
<arr name="options">
<str>color_blue</str>
<str>color_red</str>
<str>size_39</str>
<str>size_40</str>
</arr>
<int name="count_on_hand">321</int>
<arr name="name_text">
<str>Riddle-Shirt Tech</str>
</arr>
<arr name="description_text">
<str>The Riddle Shirt Tech Men's Hoodie features signature details, along with ultra-lightweight fleece for optimum warmth.</str>
</arr>
<arr name="brand_text">
<str>Riddle</str>
</arr>
<arr name="retailer_text">
<str>Supershop</str>
</arr>
</doc>
I'm not sure if the options key-value pairs are stored in a proper way, but that's the first approach I came up with.
Disclaimer:
I've made some assumptions about the schema, so please check the gist with the example schema and data - https://gist.github.com/rchukh/7385672#file-19854599
E.g. for taxons I've used special text field with PathHierarchyTokenizerFactory
First problem (fuzzy search):
The issue why Inferno doen't match Infenro is because it's not a phonetic misspelling. The photetic filter is not meant for that kind of match.
If you're interested in some details - here is a pretty good article about the algorithms supported by lucene/solr: http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html
You will probably be interested in the SpellCheck Collate feature
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
From wiki:
A collation is the original query string with the best suggestions for
each term replaced in it. If spellcheck.collate is true, Solr will
take the best suggestion for each token (if it exists) and construct a
new query from the suggestions. For example, if the input query was
"jawa class lording" and the best suggestion for "jawa" was "java" and
"lording" was "loading", then the resulting collation would be "java
class loading".
You can also leverage the fuzzy search feature based on the distance algorithms (but as I understand it's more useful for phrase searches, e.g. proximity search).
Here's an example from solr wiki:
roam~
This search will match terms like foam and roams. It will also match the word "roam" itself.
So Infenro~ in query should match Inferno in index... but my bet is to go with "google-like" approach:
That is - notify the user that following results are for correct spellings, but allow him to use the wrong spelling also (As it happens, sometimes the user may be right, and the machine may be wrong).
Second problem
This problem can be solved with edismax, e.g. if you want to search by name_text AND options:
q=shirt%20AND%20red&defType=edismax&qf=name_text%20options
Here you can see the explain plan of this query - http://explain.solr.pl/explains/w1qb7zie
The issue with storing options as multivalued field with separator is that the search query will start matching the key, e.g. "color".
For example - the following request:
q=shirt%20AND%20color&defType=edismax&qf=name_text%20options
will match all shirts that have "color" option - http://explain.solr.pl/explains/pn6fbpfq
Third problem
I have some doubts about using any FilterFactory after stemmers, but can't provide some meaninful information currently.

Searching names with Apache Solr

I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names.
After reading numerous posts and articles including:
How can I use Lucene for personal name (first name, last name) search?
http://dublincore.org/documents/1998/02/03/name-representation/
what's the best way to search a social network by prioritizing a users relationships first?
http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene Index and Query Design Question - Searching People
Lucene Fuzzy Search for customer names and partial address
... and a few others I cannot find at-the-moment. And getting at-least indexing and basic search working in my machine I have devised the following scheme for user searching:
1) Have a first, second and third name field and index those with Solr
2) Use edismax as the requestParser for multi column searching
3) Use a combination of normalization filters such as: transliteration, latin-to-ascii convesrion, etc.
4) Finally use fuzzy search
Evidently, being very new to this I am unsure if the above is the best way to do it and would like to hear from experienced users who have a better idea than me in this field.
I need to be able to match names in the following ways:
1) Accent folding: Jorn matches Jörn and vise versa
2) Alternative spellings: Karl matches Carl and vice versa
3) Shortened representations (I believe I do this with the SynonymFilterFactory): Sue matches Susanne, etc.
4) Levenstein matching: Jonn matches John, etc.
5) Soundex matching: Elin and Ellen
Any guidance, criticisms or comments are very welcome. Please let me know if this is possible ... or perhaps I'm just day-dreaming. :)
EDIT
I must also add that I also have a fullname field in case some people have long names, as an example from one of the posts: Jon Paul or Del Carmen should also match Jon Paul Del Carmen
And since this is a new project, I can modify the schema and architecture any way I see fit so there are very limited restrictions.
It sounds like you are catering for a corpus with searches that you need to match very loosely?
If you are doing that you will want to choose your fields and set different boosts to rank your results.
So have separate "copied" fields in solr:
one field for exact full name (with filters)
multivalued field with filters ASCIIFolding, Lowercase...
multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase...
PhoneticFilterFactory (with Caverphone or Double-Metaphone)
See Also: more non-english Soundex discussion
Synonyms for names, I don't know if there is a public synonym db available.
Fuzzy searching, I've not found it useful, it uses Levenshtein Distance.
Other filters and indexing get more superior "search relevant" results.
Unicode characters in names can be handled with the ASCIIFoldingFilterFactory
You are describing solutions up front for expected use cases.
If you want quality results, plan on tuning your Search Relevance
This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl).
Found a nickname db, not sure how good:
http://www.peacockdata2.com/products/pdnickname/
Note that it's not free.
The answer in another post is pretty good:
Training solr to recognize nicknames or name variants
<fieldType name="name_en" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
For phonetic name search you might also try the Beider-Morse Filter which works pretty well if you have a mixture of names from different countries.
If you want to use it with a typeahead feature, combine it with an EdgeNGramFilter:
<fieldType name="phoneticNames" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
</analyzer>
</fieldType>
We created a simple 'name' field type that allows mixing both 'key' (e.g., SOUNDEX) and 'pairwise' portions of the answers above.
Here's the overview:
at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations
Here's the core of its implementation...
List<IndexableField> createFields(SchemaField field, String name) {
Collection<FieldSpec> nameFields = deriveFieldsForName(name);
List<IndexableField> docFields = new ArrayList<>();
for (FieldSpec fs : nameFields) {
docFields.add(new Field(fs.getName(), fs.getStringValue(),
fs.getLuceneField()));
}
docFields.add(createDocValues(field.getName(), new Name(name)));
return docFields;
}
The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc.
at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time
Here's the core of its implementation...
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
Name name = parseNameString(externalVal, parser.getParams());
QuerySpec querySpec = buildQuery(name);
return querySpec.accept(new SolrQueryVisitor(field.getName()));
}
The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names.
then second, Solr’s Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results
Here's what this looks like in your query...
&rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")
The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation.
N.B. Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/) to handle more kinds of variations that the ones mentioned above (e.g., missing spaces, cross-language).

Resources