How to improve proximity search in solr - search

When I search for company in solr , the result should contain similar results such as com pany,comp-any and company.How to get that using solr.

For the use case you provided, you can use n-grams.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="7"/>
</analyzer>
This filter breaks the tokens in parts of the specified sizes, like, for the word "company", will produce the following tokens: "com", "omp", "mpa", "pan", "any", "comp", "ompa", "mpan", "pany", "compa", "ompan", "mpany", "compan", "ompany", "company"
TAKE CARE This filter may degrade performance and makes your index grows exponentially, and possibly runs Solr out of memory depending on the size of the fields you're using it (i.e. if you use it for content extraction). So, choose wisely the field to use it :)
Here are some useful information with examples about it:
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-N-GramFilter

Related

Using query analyzer in Solr for exact search on words with special charectors

I am trying to search in Solr for exact match. Problem is this kind of data:
test score
test-score
test_score
test+score
If i use exact search with query test score it will result only one record.
I need to find all four.
One way is to copy this field and replace these special characters which is creating a new index requirement so that original content is saved separately.
<fieldType name="text_exact_dehyphen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*-\s*" replacement=" "/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
Is there any way i can use my exact query to search for all words with special characters in it also.
Thanks
Use copyField to copy the content into multiple fields - one that doesn't change the content (i.e. just a string field or a textfield with keywordtokenizer if you need to lowercase the content) and one field with a StandardTokenizer or similar, allowing you to match "test score" against "test+score" etc.
You can then weight these fields differently by using the edismax query parser, and using qf to weigh the fields: qf=field_exact^5 field will score an exact match five times higher than matches in the other field.
Use q.op=AND to ensure that all terms are present in the resulting document.

Solr query emulates exactly match

I use solr to make a short query with brand. I want the equal match, but understand that it is impossible in the Lucene.
I tried some hardcoded query just for tests
myBrand:2\+2 and myBrand:\+
I get 2:2, seems and condition not working or not so how I am expect?
Also, i try fq
myBrand:2\+2 with a fq of myBrand:\+
Now, no results at all.
I use Solr 5 and make all tests in the Solr web interface.
Is there some method to get the best matching of some short brands, nicknames and etc, when I no need too much eristics and want strong equal matching? Or anyway I have to filter results in my own code after solr query executed?
UPDATED
Changes in a schema resolved my issues.
Now it is working for the queries 2+2 like a charm.
<fieldType name="text_general" class="solr.TextField" multiValued="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*"/>
</analyzer>
</fieldType>
Equal matches are not impossible. You just have to use a field type that retains the exact value, such as StrField or a TextField with KeywordTokenizer (if you want to make it case insensitive).
Matching would be field:"Exact value" or any regular query syntax. The reason why Solr/Lucene wouldn't do an "exact" match is that the regular TextField definitions in the example breaks the text into separate tokens.
For filters you usually want the exact value (both for a facet and for the fq, so you can filter the results exactly), so this is not a Lucene limitation, but something introduced by the type of fields you're working with.
The solution might be to have the same content in many fields (one to search against for regular text queries) and one to filter and facet on. Use copyField to get the same values into several fields from the same source field.

Amazon like search with Solr

We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I configure Solr to offer some of the features to our end users.
Our product data consists of kinda standard data for products like
title of a product
description
a product is in multiple categories and sub-categories
a product can have multiple variants with options, like a T-Shirt in red, blue, green, S, M, L, XL... or an iPad with 16GB, 32GB...
a product has a brand
a product has a retailer
For now, we are using this schema file to index and perform queries on Solr:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
</fieldType>
EdgeNGramFilterFactory indexes a word like shirt into sh, shi, shir, shirt
WordDelimiterFilterFactory breaks up words like wi-fi into wi, fi, wifi
PorterStemFilterFactory works good for stemming
PhoneticFilterFactory provides kinda fuzzy search
One problem is, that the fuzzy search doesn't work very well. If I search for the book Inferno and missspelled it with Infenro, the search doesn't return any results. I've read about the SpellCheckComponent (http://wiki.apache.org/solr/SpellCheckComponent), but I'm not sure if that's the best way to do a fuzzy search, or a Did you mean? feature.
The second problem is, that it should be possible, to search for Shirts red to find red T-Shirts (where red is an option value of the option type color) or to search for woman shoes or adidas shoes woman. Is it possible to do this with Solr?
And the third problem is, that I'm not sure which of the tokenizer and filters inside the schema.xml are a good choice to achieve such features.
I hope someone has used such features with solr, and can help me in this case. Thx!
EDIT
Here is some data, that we store inside Solr:
<doc>
<str name="id">572</str>
<arr name="taxons">
<str>cat1</str>
<str>cat1/cat2</str>
<str>cat1/cat2/cat3</str>
<str>cat1/cat4</str>
</arr>
<arr name="options">
<str>color_blue</str>
<str>color_red</str>
<str>size_39</str>
<str>size_40</str>
</arr>
<int name="count_on_hand">321</int>
<arr name="name_text">
<str>Riddle-Shirt Tech</str>
</arr>
<arr name="description_text">
<str>The Riddle Shirt Tech Men's Hoodie features signature details, along with ultra-lightweight fleece for optimum warmth.</str>
</arr>
<arr name="brand_text">
<str>Riddle</str>
</arr>
<arr name="retailer_text">
<str>Supershop</str>
</arr>
</doc>
I'm not sure if the options key-value pairs are stored in a proper way, but that's the first approach I came up with.
Disclaimer:
I've made some assumptions about the schema, so please check the gist with the example schema and data - https://gist.github.com/rchukh/7385672#file-19854599
E.g. for taxons I've used special text field with PathHierarchyTokenizerFactory
First problem (fuzzy search):
The issue why Inferno doen't match Infenro is because it's not a phonetic misspelling. The photetic filter is not meant for that kind of match.
If you're interested in some details - here is a pretty good article about the algorithms supported by lucene/solr: http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html
You will probably be interested in the SpellCheck Collate feature
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
From wiki:
A collation is the original query string with the best suggestions for
each term replaced in it. If spellcheck.collate is true, Solr will
take the best suggestion for each token (if it exists) and construct a
new query from the suggestions. For example, if the input query was
"jawa class lording" and the best suggestion for "jawa" was "java" and
"lording" was "loading", then the resulting collation would be "java
class loading".
You can also leverage the fuzzy search feature based on the distance algorithms (but as I understand it's more useful for phrase searches, e.g. proximity search).
Here's an example from solr wiki:
roam~
This search will match terms like foam and roams. It will also match the word "roam" itself.
So Infenro~ in query should match Inferno in index... but my bet is to go with "google-like" approach:
That is - notify the user that following results are for correct spellings, but allow him to use the wrong spelling also (As it happens, sometimes the user may be right, and the machine may be wrong).
Second problem
This problem can be solved with edismax, e.g. if you want to search by name_text AND options:
q=shirt%20AND%20red&defType=edismax&qf=name_text%20options
Here you can see the explain plan of this query - http://explain.solr.pl/explains/w1qb7zie
The issue with storing options as multivalued field with separator is that the search query will start matching the key, e.g. "color".
For example - the following request:
q=shirt%20AND%20color&defType=edismax&qf=name_text%20options
will match all shirts that have "color" option - http://explain.solr.pl/explains/pn6fbpfq
Third problem
I have some doubts about using any FilterFactory after stemmers, but can't provide some meaninful information currently.

Searching names with Apache Solr

I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names.
After reading numerous posts and articles including:
How can I use Lucene for personal name (first name, last name) search?
http://dublincore.org/documents/1998/02/03/name-representation/
what's the best way to search a social network by prioritizing a users relationships first?
http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene Index and Query Design Question - Searching People
Lucene Fuzzy Search for customer names and partial address
... and a few others I cannot find at-the-moment. And getting at-least indexing and basic search working in my machine I have devised the following scheme for user searching:
1) Have a first, second and third name field and index those with Solr
2) Use edismax as the requestParser for multi column searching
3) Use a combination of normalization filters such as: transliteration, latin-to-ascii convesrion, etc.
4) Finally use fuzzy search
Evidently, being very new to this I am unsure if the above is the best way to do it and would like to hear from experienced users who have a better idea than me in this field.
I need to be able to match names in the following ways:
1) Accent folding: Jorn matches Jörn and vise versa
2) Alternative spellings: Karl matches Carl and vice versa
3) Shortened representations (I believe I do this with the SynonymFilterFactory): Sue matches Susanne, etc.
4) Levenstein matching: Jonn matches John, etc.
5) Soundex matching: Elin and Ellen
Any guidance, criticisms or comments are very welcome. Please let me know if this is possible ... or perhaps I'm just day-dreaming. :)
EDIT
I must also add that I also have a fullname field in case some people have long names, as an example from one of the posts: Jon Paul or Del Carmen should also match Jon Paul Del Carmen
And since this is a new project, I can modify the schema and architecture any way I see fit so there are very limited restrictions.
It sounds like you are catering for a corpus with searches that you need to match very loosely?
If you are doing that you will want to choose your fields and set different boosts to rank your results.
So have separate "copied" fields in solr:
one field for exact full name (with filters)
multivalued field with filters ASCIIFolding, Lowercase...
multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase...
PhoneticFilterFactory (with Caverphone or Double-Metaphone)
See Also: more non-english Soundex discussion
Synonyms for names, I don't know if there is a public synonym db available.
Fuzzy searching, I've not found it useful, it uses Levenshtein Distance.
Other filters and indexing get more superior "search relevant" results.
Unicode characters in names can be handled with the ASCIIFoldingFilterFactory
You are describing solutions up front for expected use cases.
If you want quality results, plan on tuning your Search Relevance
This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl).
Found a nickname db, not sure how good:
http://www.peacockdata2.com/products/pdnickname/
Note that it's not free.
The answer in another post is pretty good:
Training solr to recognize nicknames or name variants
<fieldType name="name_en" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
For phonetic name search you might also try the Beider-Morse Filter which works pretty well if you have a mixture of names from different countries.
If you want to use it with a typeahead feature, combine it with an EdgeNGramFilter:
<fieldType name="phoneticNames" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
</analyzer>
</fieldType>
We created a simple 'name' field type that allows mixing both 'key' (e.g., SOUNDEX) and 'pairwise' portions of the answers above.
Here's the overview:
at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations
Here's the core of its implementation...
List<IndexableField> createFields(SchemaField field, String name) {
Collection<FieldSpec> nameFields = deriveFieldsForName(name);
List<IndexableField> docFields = new ArrayList<>();
for (FieldSpec fs : nameFields) {
docFields.add(new Field(fs.getName(), fs.getStringValue(),
fs.getLuceneField()));
}
docFields.add(createDocValues(field.getName(), new Name(name)));
return docFields;
}
The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc.
at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time
Here's the core of its implementation...
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
Name name = parseNameString(externalVal, parser.getParams());
QuerySpec querySpec = buildQuery(name);
return querySpec.accept(new SolrQueryVisitor(field.getName()));
}
The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names.
then second, Solr’s Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results
Here's what this looks like in your query...
&rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")
The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation.
N.B. Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/) to handle more kinds of variations that the ones mentioned above (e.g., missing spaces, cross-language).

How to use n-grams approximate matching with Solr?

We have a database of movies and series, and as the data comes from many sources of varying reliability, we'd like to be able to do fuzzy string matching on the titles of episodes. We are using Solr for search in our application, but the default matching mechanisms operate on word levels, which is not good enough for short strings, like titles
I had used n-grams approximate matching in the past, and I was very happy to find that Lucene (and Solr) supports something this out of the box. Unfortunately, I haven't been able to configure it correctly.
I assumed that I need a special field type for this, so I added the
following field-type to my schema.xml:
<fieldType
name="trigrams"
stored="true"
class="solr.StrField">
<analyzer type="index">
<tokenizer
class="solr.analysis.NGramTokenizerFactory"
minGramSize="3"
maxGramSize="5"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
and changed the appropriate field in the schema to:
<field name="title" type="trigrams"
indexed="true" stored="true" multiValued="false" />
However, this is not working as I expected. The query analysis looks
correctly, but I don't get any results, which makes me believe that
something happens at index time (ie. the title is indexed like a
default string field instead of trigram field).
The query I am trying is something like
title:"guy walks into a psychiatrist office"
(with a typo or two) and it should match "Guy Walks into a Psychiatrist Office".
(I am not really sure if the query is correct.)
Moreover, I would like to be able to do something more in fact. I'd like to
lowercace the string, remove all punctuation marks and spaces, remove
English stopwords and THEN change the string into trigrams. However,
the filters are applied only after the string has been tokenized...
Thanks in advance for your answers.
To answer to the last part of your question: solr has also an ngram filter. So you should not use the ngram tokenizer (but one like "WhitespaceTokenizer" for example), apply all pre-ngram filters and then add this one:
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="3" />
The solution turned out to be very simple: AND was set as the default operator, and if any of the ngrams didn't match, the whole query failed. So, it was sufficient to add:
<solrQueryParser defaultOperator="OR" />
in my schema definition.

Resources