Azure Search Suggester - fuzzy not returning results - azure

I created a new Azure Search Suggester but implemented Fuzzy search with the following code:
ISearchIndexClient indexClient = CreateSearchIndexClient();
var suggestParameters = new SuggestParameters();
suggestParameters.UseFuzzyMatching = true;
suggestParameters.MinimumCoverage = 100;
DocumentSuggestResult response = indexClient.Documents.Suggest(term, "suggester", suggestParameters);
IList<SuggestResult> results = response.Results;
The index contain the string "China", but when I search using the following "chn", no suggestion is return. I expect that fuzzy search will be able to return "China".
Searching for "chi" will return "China" as a suggestion correctly.
Can anyone advise what I am doing wrong? Thank you

The short answer to your question is that we do not expect "chn" to return "china" as a result.
The long answer is: suggestions with fuzzy matching happen in two steps. The first step is attempting to "complete" the last term of the query by finding matching words that have that term as a prefix. Only then do the edit distance kicks in as the second step which is to expand each term in the query using an edit distance of 1.
Since the word "chn" is not a prefix to "china", it won't be returned part of the first step. Since "chn" is 2 edit distance away from "china", its not found in the second step either. On the other side, "chi" is a prefix to "china", so its found during the first step. I expect that if you run the search query with "chna", "china" would be succesfully returned.
Hope this answers your question.

Related

Search the best match comparing prefixes

I have the numbers codes and text codes like in table1 below. And I have the numbers to search like in table2
for which I want to get the best match for a prefix of minimun length of 3 comparing from left to rigth and show as answer the corresponding TEXT CODE.
If there is an exact match, that would be the answer.
If there is no any value that has at least 3 length prefix then answer would be "not found".
I show some comments explaining the conditions applied in answer expected for each Number to search next to table2.
My current attempt shows the exact matches, but I'm not sure how to compare the values to search for the other conditions, when there is no exact match.
ncode = ["88271","1893","107728","4482","3527","71290","404","5081","7129","33751","3","40489","107724"]
tcode = ["RI","NE","JH","XT","LF","NE","RI","XT","QS","XT","YU","WE","RP"]
tosearch = ["50923","712902","404","10772"]
out = []
out.append([])
out.append([])
for code in tosearch:
for nc in ncode:
if code == nc:
indexOfMatched = ncode.index(nc)
out[0].append(nc)
out[1].append(tcode[indexOfMatched])
>>> out
[['404'], ['RI']]
The expected output would be
out = [
['50923', '712902', '404', '10772'],
['NOT FOUND', 'NE', 'RI', 'JH' ]
]
A simple solution you might consider would be the fuzzy-match library. It compares strings and calculates a similarity score. It really shines with strings rather than numbers, but it could easily be applied to find similar results in your prefix numbers.
Check out fuzzy-match here.
Here is a well written fuzzy-match tutorial.

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Autocomplete function with underscore/JS

what is the proper way to implement an autocomplete search with undescore?
i have a simple array (cities) and a text input field ($.autocomplete). when the user enters the first letters in the auto-complete textfield, it should output an array with all the cities starting with the entered letters (term).
cities:
["Graz","Hamburg","Innsbruck","Linz","München","Other","Salzburg","Wien"]
eventlistener:
$.autocomplete.addEventListener("change", function(e){
var cities = cities_array;
var term = $.autocomplete.value;
var results = _.filter(cities, function (city){
return
});
console.log(results + "this is results");
});
I’ve tried it with _.contains, but it only returns the city when its a complete match (e.g Graz is only output when „Graz“ is entered but not when „Gr“ is entered).
the _.filter/._select collection at http://underscorejs.org/docs/underscore.html are not very clear for me and the closest i found here is filtering JSON using underscore.
but i don’t understand the indexOf part.
thx for any suggestions!
By using #filter and #indexOf, you can get quite close to a pretty decent autocomplete.
What #indexOf does is that it checks the string if it contains the inputVal.
If it does not contain it it'll return -1 therefore our predicate below will work without fail.
Another small trick here is that you (read I) wanted it to be possible to search for s and get a hit for Innsbruck and Salzburg alike therefore I threw in #toLowerCase so that you always search in lower case.
return _.filter(cities, function(city) {
return city.toLowerCase().indexOf(inputVal.toLowerCase()) >= 0;
});

Is there a way to prevent partial word matching using Sitecore Search and Lucene?

Is there a way when using Sitecore Search and Lucene to not match partial words? For example when searching for "Bos" I would like to NOT match the word "Boston". Is there a way to require the entire word to match? Here is a code snippet. I am using FieldQuery.
bool _foundHits = false;
_index = SearchManager.GetIndex("product_version_index");
using (IndexSearchContext _searchContext = _index.CreateSearchContext())
{
QueryBase _query = new FieldQuery("title", txtProduct.Text.Trim());
SearchHits _hits = _searchContext.Search(_query, 1000);
...
}
You may want to try something like this to get the query you want to run. It will put the + in (indicating a required term) and quote the term, so it should exactly match what you're looking for, its worked for me. Providing you're passing in BooleanClause.Occur.MUST.
protected BooleanQuery GetBooleanQuery(string fieldName, string term, BooleanClause.Occur occur)
{
QueryParser parser = new QueryParser(fieldName, new StandardAnalyzer());
BooleanQuery query = new BooleanQuery();
query.Add(parser.Parse(term), occur);
return query;
}
Essentially so your query ends up being parsed to +title:"Bos", you could also download Luke and play around with the query syntax in there, its easier if you know what the syntax should be and then work backwards to see what query objects will generate that.
You have to place the query in double quotes for the exact match results. Lucene supports many such opertators and boolean parameters that can be found here: http://lucene.apache.org/core/2_9_4/queryparsersyntax.html
It depends on field type. If you have memo or text field then partial matching is applied. If you want exact matching use string field instead. There you can find some details: https://www.cmsbestpractices.com/bug-how-to-fix-solr-exact-string-matching-with-sitecore/ .

examine stripping out search words

I'm using umbraco and I have examine up and running however my query is having words stripped out
For example:
I am searching on "man on the moon" with the following line of code, the variable "searchTerm" should contain "man on the moon":
var Searcher = ExamineManager.Instance.SearchProviderCollection["MySearcher"];
var searchCriteria = Searcher.CreateSearchCriteria();
var query = searchCriteria.Field("Name", searchTerm).Compile();
however, the query is generated as this when I debug:
{ SearchIndexType: , LuceneQuery: +Name:"man moon" }
Notice how it has removed the words "on the" from the searchTerm?
Presumably these are because they are deemed as STOP/reserved words. However, this means I do not get the search results I expect.
How can I get around this?
Internally the StopAnalyzer class is used by the StandardAnalyzer as part of the standard indexing process. The StopAnalyzer (http://lucenenet.apache.org/docs/3.0.3/d7/df5/_stop_analyzer_8cs_source.html#l00054) contains a method which allows you to substitute a different set of stopwords as an ISet type parameter rather than use the standard ENGLISH_STOP_WORDS_SET (line 134).
And I read here (http://webcache.googleusercontent.com/search?q=cache:sA-uyAC015UJ:our.umbraco.org/m%3Fmode%3Dtopic%26id%3D25600+&cd=2&hl=en&ct=clnk&gl=uk) that you can get Examine to use an empty set of stopwords by adding the following line to your application_start method in global.asax
Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET = new System.Collections.Hashtable();
So with an empty set of stopwords your man in the moon should be back.
A bit of an odd idea but as an alternative you could also add a StopAnalyzer to ExamineSettings.config to create an index of docs with only the stop words and then AND them with your standardanalyzer result set?

Resources