Solr Fuzzy search (max 2 edits) - search

I am using Solr 6.0.0
I am using data-driven-configuration for my configuration related purpose. Most of the configuration is standard.
I have a document in Solr with
name:"aquickbrownfox"
Now if I do a fuzzy search like:
name:aquickbrownfo~0.7
OR
name:aquickbrownf~0.7
It lists out the record in the results.
But if I do a search like:
name:aquickbrown~0.7
It does not list the record.
Does it have to do something with the maxEdits in solrconfig.xml which is set to 2 ?
I tried increasing it. But I could not create a collection with this configuration. It gave an error:
ERROR: Error CREATEing SolrCore 'my-search': Unable to create core
[my-search] Caused by: Invalid maxEdits
Max 2 Edits seems to be a serious limitation. I wonder what is the use of passing the fractional value after the ~ operator.
My Usecase:
I have a contact database. I am supposed to detect the duplicates based on three parameters : Name, Email and Phone. So I rely on Solr for Fuzzy search. Email and Phone are relatively easy to work with simple assumptions. Name seems to be a bit tricky. For each word in the Name, I plan to do a fuzzy search. I expected the optional parameter after ~ to work without the maxEdit distance limitation.

The documentation no longer suggests using a fractional value after the tilde - see http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fuzzy_Searches for more information.
However, you are correct that only 2 changes are allowed to be made to the search string in order to carry out a fuzzy search. I would guess this limitation strikes a balance between efficiency and usefulness.
The maxEdits parameter in solrconfig.xml applies to the DirectSpellChecker configuration, and doesn't affect your searching, unless you're using the spell checker.
For your use case, your best approach may be to index the name field twice, using different field configurations: one using a simple set of analyzers and filters (ie. StandardTokenizerFactory, StandardFilterFactory, LowerCaseFilterFactory), and the other using a phonetic matcher such as the Beider-Morse filter. You can use the first field to carry out fuzzy searches, and the second version to look for names which may be spelled differently but sound the same as the name being checked.

Related

How do you construct an Azure Search query to return a wildcard search based solely on a specific field?

If I may have missed this in some other area of SO please redirect me but I don't think this is a duped question.
I am using Azure Search with an index field called Title which is searchable and filterable using a Standard Lucerne Analyzer.
When using the built-in explorer, if I want to return all Job Titles that are explicitly named Full Stack Developer I can achieve it this way:
$filter=Title eq 'Full Stack Developer'&$count=true
But if I want to retrieve all the Job Titles using a wildcard to return all records having Full Stack in the name this way:
$filter=Title eq 'Full Stack*'&$count=true
The first 20 or so records returned are spot on, but after that point I get a mix of records that have absolutely nothing in common with the Title I specified in the query. My initial assumption was that perhaps Azure was including my specified Title performing an inclusive keyword search on the text as well.
Though I found a few instances where that hypothesis seemed to prove out, so many more of the records returned invalidated that altogether.
Maybe I don't understand fully the mechanics under the hood of Azure Search and so though my query appears to be valid; my expectation of the result is way off.
So how should my query look to perform a wildcard resulting to guarantee the words specified in the search to be included in the Titles returned, if this should be possible? And what would be the correct syntax to condition the return to accommodate for OR operators to be inclusive?
Azure Cognitive Search allows you to perform wildcard searches limited to specific fields only. To do so, you will need to specify the name of the fields in which you want to perform the search in searchFields parameter.
Your search URL in this case would be something like:
https://accountname.search.windows.net/indexes/indexname/docs?api-version=2020-06-30&searchFields=Title&search=Full Stack*
From the link here:

Returning accented as well as normal result set via azure search filters

Does anyone know how to ensure we can return normal result as well as accented result set via the azure search filter. For e.g the below filter query in Azure search returns a name called unicorn when i check for record with name unicorn.
var result= searchServiceClient.Documents.SearchAsync<myDto>("*",new SearchParameters
{
SearchFields = new List<string> {"Name"},
Filter = "Name eq 'unicorn'"
});
This is all good but what i want is i want to write a filter such that it returns record named unicorn as well as record named únicorn (please note the first accented character) provided that both record exist.
This can be achieved when searching for such name via the search query using language or Standard ASCII folding search analyzer as mentioned in this link. What i am struggling to find out is how can we implement the same with azure filters?
Please let me know if anyone has got any solutions around this.
Filters are applied on the non-analyzed representation of the data, so I don’t think there’s any way to do any type of linguistic analysis on filters. One way to work around this is to manually create a field which only do lowercasing + asciifloding (no tokenization) and then search lucene queries that look like this:
"normal search query terms" AND customFilterColumn:"filtérValuèWithÄccents"
Basically the document would both need to match the search terms in any field AND also match the filter term in the “customFilterColumn”. This may not be sufficient for your needs, but at least you understand the art of the possible.
Using filters it won't work unless you specify in advance all the possibilities:
for example:
$filter=name eq 'unicorn' or name eq 'únicorn'
You'd better work with a different analyzer that will change accents to it's root form. As another possibility, you can try fuzzy search:
search=unicorn~&highlight=Name

Lucene wildcard applied to indexed field

I have a set of indexed fields such as these:
submitted_form_2200FA17-AF7A-4E44-9749-79D3A391A1AF:true
submitted_form_2398389-2-32-43242423:true
submitted_form_54543-32SDf-3242340-32422:true
And I get that it's possible to wildcard queries such as
submitted_form_2398389-2-32-43242423:t*e
What I'm trying to do is get "any" submitted form via something like:
submitted_form_*:true
Is this possible? Or will I have to do a stream of "OR"s on the known forms (which seems quite heavy)
That's not the intended use of fields, I think. Field names aren't supposed to be the searchable values, field values are. Field names are supposed to be known a priori.
My suggestion is (if possible) to store the second part of the name as the field value, for instance: submitted_form:2398389-2-32-43242423. submitted_from would be the field known a priori, and the value could eventually be searched with a PrefixQuery.
Anyway, you could access the collection of fields' names using IndexReader.getFieldNames() in Lucene 3.x and this in Lucene 4.x. I wouldn't expect search performance there.

How can I retrieve meta information about synonym matches for a Solr query?

We are using Solr to provide search functionality for our site, and I have the following requirement which has me stumped:
Given the search term "2011 Bolinger", identify that "Bollinger" (note the different spelling) is a valid value for the Producer facet, and automatically apply facet filtering for this value.
It's the fuzzy matching of the search term which I'm stuck on. Does anyone know of a way to include information in a Solr response about synonym matches which have occurred for a query during the search (i.e. a way for Solr to tell me that it saw the word 'Bollinger' in a document and recognised it as equivalent to 'Bolinger')? From what I've read so far of the Solr documentation I can't see a way to do this, but I may have missed something.

How to implement faceted search suggestion with number of relevant items in Solr?

Hi
I have a very specific need in my company for the system's search engine, and I can't seem to find a solution.
We have a SOLR index of items, all of them have the same fields, with one of the fields being "Type", (And ofcourse, "Title", "Text", and so on).
What I need is: I get an Item Type and a Query String, and I need to return a list of search suggestion with each also saying how meny items of the correct type will that suggested string return.
Something like, if the original string is "goo" I'll get
Goo 10
Google 52
Goolag 2
and so on.
now, How do I do it?
I don't want to re-query SOLR for each different suggestion, but if there is no other way, I just might.
Thanks in advance
you can try edge n-gram tokenization
http://search.lucidimagination.com/search/document/CDRG_ch05_5.5.6
You can try facets. Take a look at my more detailed description ('Autocompletion').
This was implemented at http://jetwick.com with Solr ... now using ElasticSearch but the Solr sources are still available and the idea is also the identical https://github.com/karussell/Jetwick
The SpellCheckComponent of Solr (that gives the suggestions) have extended results that can give the frequency of every suggestion in the index - http://wiki.apache.org/solr/SpellCheckComponent#Extended_Results.
However, the .Net component SolrNet, does not currently seem to support the extendedResults option: "All of the SpellCheckComponent parameters are supported, except for the extendedResults option" - http://code.google.com/p/solrnet/wiki/SpellChecking.
This is implemented using a facet field query with a Prefix set. You can test this using the xml handler like this:
http://localhost:8983/solr/select/?rows=0&facet=true&facet.field=type&f.type.prefix=goo

Resources