If I have three strings, the first one is string1 = Laptop, the second one is string2 = Latpop and the third one is string3 = Lavmop, then the levenshtein distance algorithm will return the same distance for the similarities of string1and string2 and the similarities of string1 and string3. that is because the levenshtein algorithm calculate only the operations: insert, delete and substution, which is not including the transposition operation, for example, we can swap the third and forth character at Latpop string which yields Laptop.
It's obviouse that Latpop is more similar to the Laptop than Lavmop, and it's not correct to classify them in the same similarity level.
Is there an algorithm, that take into account the transposition operation?
I found the answer in Damerau–Levenshtein distance and Jaro–Winkler distance
Related
I have two documents, for example:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
And I also know the similarity(correlation) of each pair of words, e.g
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
What is the best way to measure the similarity of the two documents?
It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.
I like a book by Peter Christen on this issue.
Here he describes a Monge-Elkan similarity measure between two sets of strings.
For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set.
You can see its description on page 30 here.
I have n strings and I want to find the closest pair.
What's the fastest practical algorithm to find this pair?
The paper "The Closest Pair Problem under the Hamming Metric", Min, Kao, Zhu seems to be what you are looking for, and it applies to finding a single closest pair.
For your case, where n0.294 < D < n, where D is the dimensionality of your data (1000) and n the size of your dataset, the algorithm will run in O(n1.843 D0.533).
I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114
I'm unfamiliar with string similarity algorithms except for Levenshtein Distance because that's what I'm using and it has turned out to be less than ideal.
So I've kind of got an idea of a recursive algorithm I'd like to implement but I want to know if it exists already so I can leverage other's expertise.
Here's the algorithm by example:
string 1: "Paul Johnson"
string 2: "John Paulson"
Step 1: find all longest matches
Match 1: "Paul"
Match 2: "John"
Match 3: "son"
Match 4: " "
Step 2: Calculate scores for each match with this formula: ((match.len/string.len)*match.len) This allows longer strings to be weighted more at a balanced rate based on the length of the string.
Match 1: (4/12)*4 = 1.333...
Match 2: 1.333...
Match 3: .75
Match 4: .083
Step 3: do steps 1 and 2 on larger scales, (matches of matches.) This I don't have figured out exactly. but my thinking is if "son" comes after "Paul John" and it comes after "John Paul" then that should count for something.
Step 4: sum all the scores that have been calculated.
Scores: 1.333 + 1.333 + .75 + .083333 = 3.4999... (plus whatever scores step 3 produces)
Does this look familiar to anyone? I hope someone else has gone to the trouble of actually making an algorithm along these lines so I don't have to figure it out myself.
What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms:
A Comparison of Personal Name Matching
This algorithm [11] repeatedly finds and removes the longest common
sub-string in the two strings compared, up to a minimum lengths
(normally set to 2 or 3).
...
A similarity measure can be calculated by
dividing the total length of the common sub-strings by the minimum,
maximum or average lengths of the two original strings (similar to
Smith-Waterman).
...
this algorithm is
suitable for compound names that have words (like given- and surname)
swapped.
I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.