Data structure for retrieving strings that are close by Levenshtein distance - string

For example, starting with the set of english words, is there a structure/algorithm that allows one fast retrieval of strings such as "light" and "tight", using the word "right" as the query? I.e., I want to retrieve strings with small Levenshtein distance to the query string.

The BK-tree data structure might be appropriate here. It's designed to efficiently support queries of the form "what are all words within edit distance k or less from a query word?" Its performance guarantees are reasonably good, and it's not too difficult to implement.
Hope this helps!

Since calculating Levenshtein distance is O(nm) for strings of length n and m, the naive approach of calculating all Levenshtein distances L(querystring, otherstring) is very expensive.
However, if you visualize the Levenshtein algorithm, it basically fills an n*m table with edit distances. But for words that start with the same few letters (prefix), the first few rows of the Levenshtein tables will be the same. (Fixing the query string, of course.)
This suggests using a trie (also called prefix tree): Read the query string, then build a trie of Levenshtein rows. Afterwards, you can easily traverse it to find strings close to the query string.
(This does mean that you have to build an new trie for a new query string. I don't think there is a similarly intriguing structure for all-pairs distances.)
I thought I recently saw an article about this with a nice python implementation. Will add a link if I can find it. Edit: Here it is, on Steve Hanov's blog.

I'm thinking the fastest way would be to pre-build a cache of similarities which you can index and access in O(1) time. The trick would be to find common misspellings to add to your cache, which could get pretty large.
I imagine Google would do something similar using their wide range of statistical query search data.

Related

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

Fuzzy search algorithm (approximate string matching algorithm)

I wish to create a fuzzy search algorithm.
However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
I wrote an article about how I implemented a fuzzy search:
https://medium.com/#Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".
In my case I also give more points for matching the end of a part.
And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
A simple algorithm for "a kind of fuzzy search"
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
Here is my use case: Filtering down a list of countries using "Fuzzy search".
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
I was using Fusejs.
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
So I came up with the following algorithm ...
Grab the needle ... in this case: zam
Insert the .* pattern at the beginning and end of the needle
Insert the .* pattern between each letter of the needle
Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
This will also match country names like Mozambique ... which is perfect.
I just think that sometimes, we should not try to kill a fly with a bazooka.
Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.
The following code will helpful to use fuzzy sort in react.js.
Install fuzzy sort through npm,
npm install fuzzysort
Full demo code in react.js
import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');
class App extends React.Component {
constructor(props){
super(props)
this.state = {
keyword: '',
results: [],
}
console.log("data: ", data["steam_games"]);
}
search(keyword, category) {
return fuzzysort.go(keyword, data[category]);
}
render(){
return (
<div className="App">
<input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
value={this.state.keyword}
/>
<button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
{this.state.results !== null && this.state.results.length > 0 ?
<h3>Results:</h3> : null
}
<ul>
{this.state.results.map((item, index) =>{
return(
<li key={index}>{item.score} : {item.target}</li>
)
})
}
</ul>
</div>
);
}
}
export default App;
For more refer FuzzySort
The problem can be broken down into two parts:
1) Choosing the correct string metric.
2) Coming up with a fast implementation of the same.
Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.
Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.
Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.
For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.
So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"

Is there an alternative to SOUNDEX or metaphone that handles names with fewer collisions?

I'm trying to find near duplicates in a large list of names by computing the metaphone key for each string, and then, within each set of possible duplicates, use something like Levenshtein distance to get a more refined estimate of duplicate likelihood.1
However, I'm finding that metaphone is heavily determined by the first characters in the strings, and so if I feed it a long list of people's names, I get huge buckets where everyone's name is "Jennifer X" or "Richard Y", but otherwise haven't got much in common.
If I reverse the string before generating the key, the results are more sensible, in that they group by last name, but still I find that the first names aren't particularly similar.
So is there a similar algorithm that samples more of the input string to produce a sound key, perhaps by using a longer key string?
[1] Ideally, I'd compute the string distances directly, but if my list has 10,000 names, that would mean 100,000,000 computations, which is why I'm trying to divide and conquer by sound keying each name first and only checking for similarities within the buckets. But if there's a better way, I'd love to hear about it!
Try eudex.
It's described as "A blazingly fast phonetic reduction/hashing algorithm."
There are many easy ways to use it, as it encodes a word into a 64-bit integer with most discriminating features towards the MSB. The hamming difference between hashes is also useful as a difference metric between words and spellings.

Clustering string data with ELKI

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

Getting fuzzy string matches from database very fast

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.

Resources