Scenario is to boost documents on multiple field values:
I have a field "Category" containing values - "News", "image", "video", "audio".
Now on the basis of fields values mentioned above I would like to give some boosting(priority) to them, say for example "News" gets highest priority, followed by "video", than "audio" and so on.
Similar to category there are few more fields, which needed to boosted in the same manner based on fields values.
Ex. Boosting rules can be,
Category= News^1000
Category= Image^900
Premium_Contents = True^200
Sponsored = True^300
... so on
So I have came across a solution Reference. I am trying to find out the best approach for calculating my search relevancy result-sets.
Yes I think your link is a reasonable idea. It is what we use because we want to enforce are boosts on all searches and we don't change the logic very often, for example in your case:-
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="boost">product(
map(query($type1query),0,0,1,$type1boost),
map(query($type2query),0,0,1,$type2boost))</str>
<str name="type1query">Category:"News"</str>
<double name="type1boost">1000.0</double>
<str name="type2query">Category:"Image"</str>
<double name="type2boost">900.0</double>
</lst>
</requestHandler>
In this case the query function returns the score for the specific query. That is looking for match for News, Image etc in Category.
The map function has the following signature: map(x,min,max,target,value) maps any values of the function x that fall within min and max inclusive to target. min,max,target,value are constants. It outputs the field's value (or "value") if it does not fall between min and max. In other words if the result of query is a positive value (there is a match) it will output the boost (1000,900 etc). You'll need to play with the boost values as they can overwhelm any other ranking logic you have. You may get poor matches on News ranking first where there is a better match on Video, say.
You could create a separate request handler with these boosts so you can bypass them for other searches. Obviously you have to change solrconfig and restart Solr if you make any changes, which may be an issue.
Otherwise look at the bq (boost query) parameter.
bq=Category:News^1000.0+Category:Image^900...
which actually generates something like this under the covers
boost(+*:* (Category:News^1000 + Category:Image^900))
This means the boosts are done in your search code which is nice and flexible. Personally I prefer this way of working.
Related
I am evaluating OpenNLP for use as a document categorizer. I have a sanitized training corpus with roughly 4k files, in about 150 categories. The documents have many shared, mostly irrelevant words - but many of those words become relevant in n-grams, so I'm using the following parameters:
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 20000);
params.put(TrainingParameters.CUTOFF_PARAM, 10);
DoccatFactory dcFactory = new DoccatFactory(new FeatureGenerator[] { new NGramFeatureGenerator(3, 10) });
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
Some of these categories apply to documents that are almost completely identical (think boiler-plate legal documents, with maybe only names and addresses different between document instances) - and will be mostly identical to documents in the test set. However, no matter how I tweak these params, I can't break out of the "1 outcome patterns" result. When running a test, every document in the test set is tagged with "Category A."
I did manage to effect a single minor change in output, by moving from previous use of the BagOfWordsFeatureGenerator to the NGramFeatureGenerator, and from maxent to Naive Bayes; before the change, every document in the test set was assigned "Category A", but after the change, all the documents were now assigned to "Category B." But other than that, I can't seem to move the dial at all.
I've tried fiddling with iterations, cutoff, ngram sizes, using maxent instead of bayes, etc; but all to no avail.
Example code from tutorials that I've found on the interweb have used much smaller training sets with less iterations, and are able to perform at least some rudimentary differentation.
Usually in such a situation - bewildering lack of expected behavior - the engineer has forgotten to flip some simple switch, or has some fatal lack of fundamental understanding. I am eminently capable of both those failures. Also, I have no Data Science training, although I have read a couple of O'Reilly books on the subject. So the problem could be procedural. Is the training set too small? Is the number of iterations off by an order of magnitude? Would a different algo be a better fit? I'm utterly surprised that no tweaks have even slightly moved the dial away from the "1 outcome" outcome.
Any response appreciated.
Well, the answer to this one did not come from the direction in which the question was asked. It turns out that there was a code sample in the OpenNLP documentation that was wrong, and no amount of parameter tuning would have solved it. I've submitted a jira to the project so it should be resolved; but for those who make their way here before then, here's the rundown:
Documentation (wrong):
String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);
Should be something like:
String inputText = ... // sanitized document to be classified
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText.split(" "));
String category = myCategorizer.getBestCategory(outcomes);
DocumentCategorizerME.categorize() needs an array; since this is an obviously self-documenting bug the second you run the code, I had assumed the necessary array parameter should be an array of documents in string form; instead it needs
an array of tokens from a single document.
I'm using Azure Search based on the rich Lucene Query Parser syntax. I defined to "~1" as additional parameter to one symbol for distance ). But I faced with problem, that the entity is not ordered even if there is exact match. (For example,"blue~1" would return "blues", "blue", "glue". Or when searching product SKU like "P002", I would get result "P003", "P005", "P004", "P002", "P001", "P006" )
So my question: is there some way to define, that the entity with exact match must be first in list, or be singl search result even then I'm using fuzzy search "~1"?
With Lucene Query syntax you can boost individual subqueries, for example: term^2 | term~1 - this translates to "find documents that match 'term' OR 'term' with edit distance 1, and score the exact matches higher relative to fuzzy matches by a factor of two.
search=blue^2|blue~1&queryType=full
There is no guarantee that the exact match will always be first in the results set as the document score is a function of term frequency and inverse document frequency. If the fuzzy sub-query expands the input term to a term that's very unique in your document corpus you may need to bump the boosting factor (2 in my example). In general, relying on the relevance score for ordering is not a practical idea. Take a look at my answer in the following post for more information: Azure Search scoring
Let me know if this helps
Ok, so I am using many fields with qf, like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
but I really want that exact matches for the field fundraiseTitleExact to be on top.
With this previous set up of qf, they are on the position 32.
Let's say that I am boosting fundraiseTitleExact like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15000000000000000 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
But even now the fundraiseTitleExact exact match is only on the position 27 (5 positions up) and is not going upper.
How can I prioritise this field over the rest?
This looks more like a tuning problem, however you have several options:
Tune up your relevancy modifying all the boosts until you get the expected results (I would advise to work with lower boosts than the ones in your questions and then increase the boost of the most important field);
If you are using edismax query parser then You probably want to check the bq and bf parameters in order to boost your term;
If worse come to worst you could use Query Elevation Component to put some entries at the top of the list.
I advise to read the following books to widen your knowledge of solr boosting and relevancy mechanisms:
Solr in Action
Relevant Search
I wish to create a fuzzy search algorithm.
However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
I wrote an article about how I implemented a fuzzy search:
https://medium.com/#Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".
In my case I also give more points for matching the end of a part.
And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
A simple algorithm for "a kind of fuzzy search"
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
Here is my use case: Filtering down a list of countries using "Fuzzy search".
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
I was using Fusejs.
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
So I came up with the following algorithm ...
Grab the needle ... in this case: zam
Insert the .* pattern at the beginning and end of the needle
Insert the .* pattern between each letter of the needle
Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
This will also match country names like Mozambique ... which is perfect.
I just think that sometimes, we should not try to kill a fly with a bazooka.
Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.
The following code will helpful to use fuzzy sort in react.js.
Install fuzzy sort through npm,
npm install fuzzysort
Full demo code in react.js
import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');
class App extends React.Component {
constructor(props){
super(props)
this.state = {
keyword: '',
results: [],
}
console.log("data: ", data["steam_games"]);
}
search(keyword, category) {
return fuzzysort.go(keyword, data[category]);
}
render(){
return (
<div className="App">
<input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
value={this.state.keyword}
/>
<button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
{this.state.results !== null && this.state.results.length > 0 ?
<h3>Results:</h3> : null
}
<ul>
{this.state.results.map((item, index) =>{
return(
<li key={index}>{item.score} : {item.target}</li>
)
})
}
</ul>
</div>
);
}
}
export default App;
For more refer FuzzySort
The problem can be broken down into two parts:
1) Choosing the correct string metric.
2) Coming up with a fast implementation of the same.
Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.
Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.
Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.
For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.
So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"
Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.