Designing an algorithm to calculate the edit distance between two strings

Designing an algorithm to calculate the edit distance between two strings - string

Please consider the following question:
The edit distance of two strings s and t is the minimum number of single character operations (insert, delete, substitution) needed to convert s into t. Let m and n be the length of strings s and t.
Design an O(nm) time and O(nm) space algorithm to calculate the edit distance between s and t.
My thoughts:
Isn't it easier to just compare two strings one character at a time:
L = maximum(length(s), length(t))
for i in L:
if i > length(s):
distance += length(t) - i
break
if i > length(t):
distance += length(s) - i
break
if s[i] != t[i]:
distance += 1
If I am wrong, then am I supposed to use the edit distance algorithm table? Is so, how do I design an O(nm) time and O(nm) space algorithm?

Consider the strings abcd and bcd. They differ for one deletion, but your approach would count them as distance 4.
What you want to do is find the Longest Common Subsequence. This is a well known problem and you can google up a lot of code examples about it, with one solution being in fact O (NM).
For example, for strings abcdqef and xybcdzzzef the LCS is bcdqef. consider the subsequence in the two strings:
a-bcd-q-ef
xy-bcd-zzz-ef
You can transform a into xy with one modification and one insertion, and q into zzz with one modification and two insertion. If you think about it, the number of operations required (i.e. distance) is the number of characters in the longest string not belonging to the LCS.

Thank you #Roberto Attias for his answer, but the following is the complete algorithm I am looking for:
L1 = length(string1)
L2 = length(string2)
for i in L1:
table[i][0] = i
for i in L2:
table[0][i] = i
for i in L1:
for j in L2:
m = minimum(table[i-1][j],table[i][j-1])+1
if s[i] == t[j]: subvalue = 1
else: subvalue = 0
table[i][j] = minimum(m, table[i-1][j-1] + subvalue)
return table[L1][L2]
The above algorithm follows the strategy of an edit distance algorithm table

Related

From a bunch of n vectors, get all vectors which are mutually orthogonal

Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words)
Approach that I tried: using sklearn's count vectoriser, get the vectors for each string and compute dot product for each vector with every other vector. Those vectors with zero dot product will be added to a set.
This is done using O(n2) dot product computations. Is there a better way to approach this problem?

There is little you can do, suppose the trivial case where each string has a single unique word. In order to determine that all the intersections is empty you have to consider all the n * (n - 1) / 2 pairs, hence complexity is O(n^2*v) and v is the number of unique words in your vocabulary.
For the typical case however we can have better approaches. Assuming that the number of words in each string is much less than the number of unique words it is better to iterate over the words of the maybe even skipping the vectorization. Let 0 < id[word] < nWords be a unique number for each word
you could do
v1 = np.zeros(nWords)
for i in range(len(strings)):
for w in getWords(strings[i]):
v1[id[w]] = 1;
for j in range(i+1, len(strings)):
for w in getWords(strings[j]):
if v1[id[w]]:
# strings[j] and strings[i] share at least one word.
break
for w in getWords(strings[i]):
v1[id[w]] = 1;
still O(n * C), where C is the number of words in all your strings.
You may want to precompute the getWords(strings[i])

How does Duval's algorithm handle odd-length strings?

Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?

One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.

OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}

Concrete algorithm code for approximate string matching

Approximate string matching is not a stranger problem.
I am learning and trying to understand how to solve it. I even now don't want to get too deep into it and just want to understand the brute-force way.
In its wiki page (Approximate string matching), it says
A brute-force approach would be to compute the edit distance to P (the pattern) for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(m * n^3), n is the length of T, m is the length of P
Ok. I understand this statement in the following way:
We find out all possible substrings of T
We compute the edit distance of each pair of strings {P, t1}, {P, t2}, ...
We find out which substring has the shortest distance from P and this substring is the answer.
I have the following question:
a. I can use two for-loop to get all possible substrings and this requires O(n^2). So when I try to compute the edit distance of one substring and the patter, does it need O(n*m)? Why?
b. How exactly do I compute the distance of one pair (one substring and the patter)? I know I can insert, delete, substitute, but can anyone give me a algorithm that do just the calculation for one pair?
Thanks
Edit
Ok, I should use Levenshtein distance, but I don't quite understand its method.
Here is part of the code
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
So, assume I am now comparing {"suv", "svi"}.
So 'v' != 'i', then I have to see three other pairs:
{"su", "sv"}
{"suv", "sv"}
{"su", "svi"}
How can I understand this part? Why I need to see these 3 parts?
Does the distance between two prefixes mean that we need distance number of changes in order to make the two prefixes (or strings) equal?
So, let's take a look at {"su", "sv"}. We can see that distance of {"su", "sv"} is 1. Then how can {"su", "sv"} become {"suv", "svi"} by just adding 1? I think we need to insert 'v' into "su" and 'v' into "sv" and then substitute the last 'i' with 'v', which has 3 operations involved, right?

The standard way of measuring the edit distance between two strings is called Levenshtein distance - the wikipedia page contains pseudocode for the algorithm.
As for your edit: You need to look at {"su", "sv"} because it is possible that the best way to change "suv" into "svi" is to replace the last v by i, whose cost will come on top of the cost for changing "su" to "sv". Or, it could be that the best way is to change "suv" into "sv" somehow and then add an i. Or, it could be that the best way is to first delete the v from "suv" and then change "su" into "svi". The first way turns out to be best (or as good as the other options) in this case. The edit distance is indeed 2, and the operations are to change the u into a v and the v into an i.

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you

You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret

Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

How to find the actual sequence of a Longest Increasing Subsequence?

This is not a homework problem. I am reviewing myself of the Longest Increasing Subsequence problem. I read every where online. I understand how to find the "length", but I don't understand how to back-trace the actual sequence. I am using the patience sorting algorithm to find the length. Can anyone explain how to find the actual sequence? I do not really understand the version in Wikipedia. Can someone explain in a different method or different way?
Thanks.

Lets define as max(j) as the longest increasing subsequence up to A[j]. There are two options: or we use A[j] in this subsequence, or we don't.
If we dont use it, then the value will be max(j-1). If we do use it, then the value will be
max(i)+1, when i is the biggest index such that i < j and A[i] < A[j]. (Here we assume that the max(i) sequence uses i- not neccessary true, but we can solve this issue by saving for each cell 2 values- the max(j) value, and max*(j), when max*(j) is the longest increasing subsequence up to A[j] that uses A[j]. max*(j) will be calculated each time as max*(i)+1).
To sum up, the recursive formula for calculating max(j) will be:
max{max(j-1),max*(i)+1},and max*(j)= max*(i)+1.
In each array cell you can save a pointer, that tells you if you chose to use the A[j] cell or not. In this way you can find all the sequence while moving backwards on the array.
Time Complexity: The complexity of the recursive formula and finding the sequence at the end is O(n). The problem here is finding for each A[j] the corresponding A[i] such that i is the biggest index such that i < j, A[i] < A[j].
Of course you can do it naivly in O(n^2) (from each cell go backwards until you find this i). If you want to do better then I'm pretty sure that you can do it in O(nlogn) in the following way:
*Sort your Array.
1) go for the smallest integer in the array, and notate is position in the array as k.
2)For A[k+1], we have of course A[k] < A[k+1]. If A[k+1]>A[k+2] then k will feet to the k+2 cell as well, and so on until we have A[k+m] < A[k+m+1], and then k+m is feet to k+m+1,
3)delete all the cells that you found thier corresponding cell in the previous stage
4) return to 1.
Hoped that it help. Please notice that I thought about it all alone, therefore there is a very small chance that there is some mistake here- please be convinced that I'm right and ask for more clarifications, if you need.

This Python code solves the Longest Increasing Sequence problem, and also returns one of such sequences. The trick is, at the same time that the dynamic programming table gets filled, another array is also filled, storing the index of the elements that were used to construct the optimal solution.
def an_lis(nums):
table, solution = lis_table(nums)
if not table:
return (0, [])
n, maxLen = max(enumerate(table), key=itemgetter(1))
lis = [nums[n]]
while solution[n] != -1:
lis.append(nums[solution[n]])
n = solution[n]
return lis[::-1]
def lis_table(nums):
n = len(nums)
table, solution = [0] * n, [-1] * n
for i in xrange(n):
maxLen, maxIdx = 0, -1
for j in xrange(i):
if nums[j] < nums[i] and table[j] > maxLen:
maxLen, maxIdx = table[j], j
table[i], solution[i] = 1 + maxLen, maxIdx
return (table, solution)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Designing an algorithm to calculate the edit distance between two strings - string

Related

From a bunch of n vectors, get all vectors which are mutually orthogonal

How does Duval's algorithm handle odd-length strings?

Concrete algorithm code for approximate string matching

Converting N strings to a common target string in maximum of K edits

How to find the actual sequence of a Longest Increasing Subsequence?

Categories

Resources