Concrete algorithm code for approximate string matching - string

Approximate string matching is not a stranger problem.
I am learning and trying to understand how to solve it. I even now don't want to get too deep into it and just want to understand the brute-force way.
In its wiki page (Approximate string matching), it says
A brute-force approach would be to compute the edit distance to P (the pattern) for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(m * n^3), n is the length of T, m is the length of P
Ok. I understand this statement in the following way:
We find out all possible substrings of T
We compute the edit distance of each pair of strings {P, t1}, {P, t2}, ...
We find out which substring has the shortest distance from P and this substring is the answer.
I have the following question:
a. I can use two for-loop to get all possible substrings and this requires O(n^2). So when I try to compute the edit distance of one substring and the patter, does it need O(n*m)? Why?
b. How exactly do I compute the distance of one pair (one substring and the patter)? I know I can insert, delete, substitute, but can anyone give me a algorithm that do just the calculation for one pair?
Thanks
Edit
Ok, I should use Levenshtein distance, but I don't quite understand its method.
Here is part of the code
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
So, assume I am now comparing {"suv", "svi"}.
So 'v' != 'i', then I have to see three other pairs:
{"su", "sv"}
{"suv", "sv"}
{"su", "svi"}
How can I understand this part? Why I need to see these 3 parts?
Does the distance between two prefixes mean that we need distance number of changes in order to make the two prefixes (or strings) equal?
So, let's take a look at {"su", "sv"}. We can see that distance of {"su", "sv"} is 1. Then how can {"su", "sv"} become {"suv", "svi"} by just adding 1? I think we need to insert 'v' into "su" and 'v' into "sv" and then substitute the last 'i' with 'v', which has 3 operations involved, right?

The standard way of measuring the edit distance between two strings is called Levenshtein distance - the wikipedia page contains pseudocode for the algorithm.
As for your edit: You need to look at {"su", "sv"} because it is possible that the best way to change "suv" into "svi" is to replace the last v by i, whose cost will come on top of the cost for changing "su" to "sv". Or, it could be that the best way is to change "suv" into "sv" somehow and then add an i. Or, it could be that the best way is to first delete the v from "suv" and then change "su" into "svi". The first way turns out to be best (or as good as the other options) in this case. The edit distance is indeed 2, and the operations are to change the u into a v and the v into an i.

Related

From a bunch of n vectors, get all vectors which are mutually orthogonal

Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words)
Approach that I tried: using sklearn's count vectoriser, get the vectors for each string and compute dot product for each vector with every other vector. Those vectors with zero dot product will be added to a set.
This is done using O(n2) dot product computations. Is there a better way to approach this problem?
There is little you can do, suppose the trivial case where each string has a single unique word. In order to determine that all the intersections is empty you have to consider all the n * (n - 1) / 2 pairs, hence complexity is O(n^2*v) and v is the number of unique words in your vocabulary.
For the typical case however we can have better approaches. Assuming that the number of words in each string is much less than the number of unique words it is better to iterate over the words of the maybe even skipping the vectorization. Let 0 < id[word] < nWords be a unique number for each word
you could do
v1 = np.zeros(nWords)
for i in range(len(strings)):
for w in getWords(strings[i]):
v1[id[w]] = 1;
for j in range(i+1, len(strings)):
for w in getWords(strings[j]):
if v1[id[w]]:
# strings[j] and strings[i] share at least one word.
break
for w in getWords(strings[i]):
v1[id[w]] = 1;
still O(n * C), where C is the number of words in all your strings.
You may want to precompute the getWords(strings[i])

How does Duval's algorithm handle odd-length strings?

Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?
One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.
OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}

Randomized algorithm for string matching

Question:
Given a text t[1...n, 1...n] and p[1...m, 1...m], n = 2m, from alphabet [0, Sigma-1], we say p matches t at [i,j] if t[i+k-1, j+L-1] = p[k,L] for all k,L. Design a randomized algorithm to find all matches in O(n^2) time with high probability.
Image:
Can someone help me understand what this text means? I believe it is saying that 't' has two words in it and the pattern is also two words but the length of both patterns is half of 't'. However, from here I don't understand how the range of [i,j] comes into play. That if statement goes over my head.
This could also be saying that t and p are 2D arrays and you are trying to match a "box" from the pattern in the t 2D array.
Any help would be appreciated, thank you!
The problem asks you to find a 2D pattern i.e defined by the p array in the t array which is also 2D.
The most obvious randomized solution to this problem would be to generate two random indexes i and j and then start searching for the pattern from that (i, j).
To avoid doing redundant searches you can keep track of which pairs of (i, j) you have visited before, this can be done using a simple look up 2D array.
The complexity of above would be O(n^3) in the worst case.
You can also use hashing for comparing the strings to reduce the complexity to O(n^2).
You first need to hash the t array row by row and store the value in an array like hastT, you can use the Rolling hash algorithm for that.
You can then hash the p array using Rolling hash algorithm and store the hashes row by row in the array hashP.
Then when you generate the random pair (i, j), you can get the hash of the corresponding t array using the array hashT in linear time instead of the brute force comparision that takes quadratic time and compare (Note there can be collisions in the hash you can brute force when a hash matches to be completely sure).
To find the corresponding hash using the hashT we can do the following, suppose the current pair (i, j) is (3, 4), and the dimensions of the p array are 2 x 3.
Then we can compare hashT[3][7] - hash[3][3] == hashP[3] to find the result, the above logic comes from the rolling hash algo.
Pseudocode for search in linear time using hashing :
hashT[][], hashS[]
i = rand(), j = rand();
for(int k = i;k < i + lengthOfColumn(p);i++){
if((hashT[i][j + lengthOfRow(p)] - hashT[i][j-1]) != hashP[i]){
//patter does not match.
return false;
}
}

Designing an algorithm to calculate the edit distance between two strings

Please consider the following question:
The edit distance of two strings s and t is the minimum number of single character operations (insert, delete, substitution) needed to convert s into t. Let m and n be the length of strings s and t.
Design an O(nm) time and O(nm) space algorithm to calculate the edit distance between s and t.
My thoughts:
Isn't it easier to just compare two strings one character at a time:
L = maximum(length(s), length(t))
for i in L:
if i > length(s):
distance += length(t) - i
break
if i > length(t):
distance += length(s) - i
break
if s[i] != t[i]:
distance += 1
If I am wrong, then am I supposed to use the edit distance algorithm table? Is so, how do I design an O(nm) time and O(nm) space algorithm?
Consider the strings abcd and bcd. They differ for one deletion, but your approach would count them as distance 4.
What you want to do is find the Longest Common Subsequence. This is a well known problem and you can google up a lot of code examples about it, with one solution being in fact O (NM).
For example, for strings abcdqef and xybcdzzzef the LCS is bcdqef. consider the subsequence in the two strings:
a-bcd-q-ef
xy-bcd-zzz-ef
You can transform a into xy with one modification and one insertion, and q into zzz with one modification and two insertion. If you think about it, the number of operations required (i.e. distance) is the number of characters in the longest string not belonging to the LCS.
Thank you #Roberto Attias for his answer, but the following is the complete algorithm I am looking for:
L1 = length(string1)
L2 = length(string2)
for i in L1:
table[i][0] = i
for i in L2:
table[0][i] = i
for i in L1:
for j in L2:
m = minimum(table[i-1][j],table[i][j-1])+1
if s[i] == t[j]: subvalue = 1
else: subvalue = 0
table[i][j] = minimum(m, table[i-1][j-1] + subvalue)
return table[L1][L2]
The above algorithm follows the strategy of an edit distance algorithm table

Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
Thanks!
This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
#article{DBLP:journals/corr/GutinJSW14,
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {http://arxiv.org/abs/1402.2137},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Edit:
Please refer to David's answer for a fuller and correct explanation of the problem.
Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.

Resources