Lucene search result not optimal - search

I know the different parts of lucene and how to use them but I have one problem left.
Lucene is running in my online shop and does a good job. Now I want to optimize the search result from my search function with a single search field, where the user can input anything he wants to search for.
Since now I'm combining different search approaches to get the results like phrase and fuzzy search. the problem is I get always a huge resultset back. I want a smaller result list with the best hits. I can achieve this when I leave fuzzy. Then I've got a great search result but when the user types something wrong the result is empty.
There must be a solution to get a small resultset with great score and also combine it with fuzzy search if the hits are bad?!
What do I have to think of? Which way should I go?
What's the best solution for that situation?

One approach is to first search using a non-fuzzy query. If that search returns a reasonable number of hits, then you are done.
If that first search returns 0 hits, then try searching again by combining searches where each individual search term is fuzzed once -- for example, if you have "A B C", the second search would be something like "(A~ AND B AND C) OR (A AND B~ AND C) OR (A AND B AND C~)". That would minimize the fuzziness introduced while still fuzzing each search term.
Please note that I have not tried this method, but I have tried a similar method where your second search drops each search term once so you handle search terms that are not in the index at all -- "(A AND B) OR (A AND C) OR (B AND C)".

Related

Find any values from a set/list within a string

Long time lurker, first time poster. I'm hoping to get some advice from the brilliant minds in this community. In the project I'm working in, the goal is to look at a user-provided string and determine if the content of that string contains any (one or many) matches to a list of match criteria. For example:
User-provided string: "I like thing a and thing b"
Match List:
Match Criteria
Match Type
Category
Foo
Exact (Case Insensitive)
Bar
Thing a
Contains (Case Insensitive)
Things
Thing b
Contains (Case Insensitive)
Stuff
In this case, it would return the following matches:
Thing a > Things
Thing b > Stuff
As of now, my approach is to iterate through the match criteria list and check each list item against the user-supplied string using the Match Type specified (Exact, Contains, Regular Expression), returning a list of the matches and then doing some stuff with that list. This approach works, even when matching ~100 rules and handling a 200-record batch, but it seems obvious that the performance will be pretty terrible if a large number of rules is introduced.
Is there a better way to do this that would be supported in Apex called by a trigger? I would love to learn a more sophisticated approach if there is one.
Thanks in advance!
What do you need it for. Is it a pure apex exercise or is it "close" to certain standard sObjects? There are lots of built-in features around "fuzzy matching".
In no specific order...
Have you looked into "all things Einstein", from categorising leads to predicting how likely this opportunity is to close. Might not be direction you expected to take but who knows
Obviously SOSL comes to mind, like what powers the global search. It automatically does some substitutions for you like Mike -> Michael
Matching rules, duplicate rules. You'd have a limit of say 5 active rules but you could hook to them up from Apex, including creative abuse of the system. "Dear salesforce, let's pretend I'm making such and such Opportunity, can you find me similar Opportunities?" (plot twist- you're not making an Oppty at all, you're creating account of some venture capitalist looking for investments that match his preferences). Give matching rules a go, if not for everything then at least for more creative fuzzy matching. You really don't want to implement soundex, levenshtein etc manually...
tags? There's somewhat forgotten feature from SF classic, it creates bunch of tables (AccountTag, ContactTag). This plus SOSL could be close to what you need.
Additionally if you need this for anything close to Knowledge Base:
Data Categories come to mind
KB supports synonyms, letting you define your (not very intuitive) "thing b => stuff" mapping.
and it should survive translations

Data structure to index entire document and algorithm for quick search of any size substring

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.
For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

Fuzzy search algorithm (approximate string matching algorithm)

I wish to create a fuzzy search algorithm.
However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
I wrote an article about how I implemented a fuzzy search:
https://medium.com/#Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".
In my case I also give more points for matching the end of a part.
And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
A simple algorithm for "a kind of fuzzy search"
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
Here is my use case: Filtering down a list of countries using "Fuzzy search".
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
I was using Fusejs.
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
So I came up with the following algorithm ...
Grab the needle ... in this case: zam
Insert the .* pattern at the beginning and end of the needle
Insert the .* pattern between each letter of the needle
Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
This will also match country names like Mozambique ... which is perfect.
I just think that sometimes, we should not try to kill a fly with a bazooka.
Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.
The following code will helpful to use fuzzy sort in react.js.
Install fuzzy sort through npm,
npm install fuzzysort
Full demo code in react.js
import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');
class App extends React.Component {
constructor(props){
super(props)
this.state = {
keyword: '',
results: [],
}
console.log("data: ", data["steam_games"]);
}
search(keyword, category) {
return fuzzysort.go(keyword, data[category]);
}
render(){
return (
<div className="App">
<input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
value={this.state.keyword}
/>
<button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
{this.state.results !== null && this.state.results.length > 0 ?
<h3>Results:</h3> : null
}
<ul>
{this.state.results.map((item, index) =>{
return(
<li key={index}>{item.score} : {item.target}</li>
)
})
}
</ul>
</div>
);
}
}
export default App;
For more refer FuzzySort
The problem can be broken down into two parts:
1) Choosing the correct string metric.
2) Coming up with a fast implementation of the same.
Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.
Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.
Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.
For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.
So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

how an search index works when querying many words?

I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net

Resources