Randomized algorithm for string matching - string

Question:
Given a text t[1...n, 1...n] and p[1...m, 1...m], n = 2m, from alphabet [0, Sigma-1], we say p matches t at [i,j] if t[i+k-1, j+L-1] = p[k,L] for all k,L. Design a randomized algorithm to find all matches in O(n^2) time with high probability.
Image:
Can someone help me understand what this text means? I believe it is saying that 't' has two words in it and the pattern is also two words but the length of both patterns is half of 't'. However, from here I don't understand how the range of [i,j] comes into play. That if statement goes over my head.
This could also be saying that t and p are 2D arrays and you are trying to match a "box" from the pattern in the t 2D array.
Any help would be appreciated, thank you!

The problem asks you to find a 2D pattern i.e defined by the p array in the t array which is also 2D.
The most obvious randomized solution to this problem would be to generate two random indexes i and j and then start searching for the pattern from that (i, j).
To avoid doing redundant searches you can keep track of which pairs of (i, j) you have visited before, this can be done using a simple look up 2D array.
The complexity of above would be O(n^3) in the worst case.
You can also use hashing for comparing the strings to reduce the complexity to O(n^2).
You first need to hash the t array row by row and store the value in an array like hastT, you can use the Rolling hash algorithm for that.
You can then hash the p array using Rolling hash algorithm and store the hashes row by row in the array hashP.
Then when you generate the random pair (i, j), you can get the hash of the corresponding t array using the array hashT in linear time instead of the brute force comparision that takes quadratic time and compare (Note there can be collisions in the hash you can brute force when a hash matches to be completely sure).
To find the corresponding hash using the hashT we can do the following, suppose the current pair (i, j) is (3, 4), and the dimensions of the p array are 2 x 3.
Then we can compare hashT[3][7] - hash[3][3] == hashP[3] to find the result, the above logic comes from the rolling hash algo.
Pseudocode for search in linear time using hashing :
hashT[][], hashS[]
i = rand(), j = rand();
for(int k = i;k < i + lengthOfColumn(p);i++){
if((hashT[i][j + lengthOfRow(p)] - hashT[i][j-1]) != hashP[i]){
//patter does not match.
return false;
}
}

Related

From a bunch of n vectors, get all vectors which are mutually orthogonal

Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words)
Approach that I tried: using sklearn's count vectoriser, get the vectors for each string and compute dot product for each vector with every other vector. Those vectors with zero dot product will be added to a set.
This is done using O(n2) dot product computations. Is there a better way to approach this problem?
There is little you can do, suppose the trivial case where each string has a single unique word. In order to determine that all the intersections is empty you have to consider all the n * (n - 1) / 2 pairs, hence complexity is O(n^2*v) and v is the number of unique words in your vocabulary.
For the typical case however we can have better approaches. Assuming that the number of words in each string is much less than the number of unique words it is better to iterate over the words of the maybe even skipping the vectorization. Let 0 < id[word] < nWords be a unique number for each word
you could do
v1 = np.zeros(nWords)
for i in range(len(strings)):
for w in getWords(strings[i]):
v1[id[w]] = 1;
for j in range(i+1, len(strings)):
for w in getWords(strings[j]):
if v1[id[w]]:
# strings[j] and strings[i] share at least one word.
break
for w in getWords(strings[i]):
v1[id[w]] = 1;
still O(n * C), where C is the number of words in all your strings.
You may want to precompute the getWords(strings[i])

Search and remove algorithm

Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.

How does Duval's algorithm handle odd-length strings?

Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?
One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.
OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}

What is an efficient way to compute the Dice coefficient between 900,000 strings?

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.
Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.
Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.

How to find the actual sequence of a Longest Increasing Subsequence?

This is not a homework problem. I am reviewing myself of the Longest Increasing Subsequence problem. I read every where online. I understand how to find the "length", but I don't understand how to back-trace the actual sequence. I am using the patience sorting algorithm to find the length. Can anyone explain how to find the actual sequence? I do not really understand the version in Wikipedia. Can someone explain in a different method or different way?
Thanks.
Lets define as max(j) as the longest increasing subsequence up to A[j]. There are two options: or we use A[j] in this subsequence, or we don't.
If we dont use it, then the value will be max(j-1). If we do use it, then the value will be
max(i)+1, when i is the biggest index such that i < j and A[i] < A[j]. (Here we assume that the max(i) sequence uses i- not neccessary true, but we can solve this issue by saving for each cell 2 values- the max(j) value, and max*(j), when max*(j) is the longest increasing subsequence up to A[j] that uses A[j]. max*(j) will be calculated each time as max*(i)+1).
To sum up, the recursive formula for calculating max(j) will be:
max{max(j-1),max*(i)+1},and max*(j)= max*(i)+1.
In each array cell you can save a pointer, that tells you if you chose to use the A[j] cell or not. In this way you can find all the sequence while moving backwards on the array.
Time Complexity: The complexity of the recursive formula and finding the sequence at the end is O(n). The problem here is finding for each A[j] the corresponding A[i] such that i is the biggest index such that i < j, A[i] < A[j].
Of course you can do it naivly in O(n^2) (from each cell go backwards until you find this i). If you want to do better then I'm pretty sure that you can do it in O(nlogn) in the following way:
*Sort your Array.
1) go for the smallest integer in the array, and notate is position in the array as k.
2)For A[k+1], we have of course A[k] < A[k+1]. If A[k+1]>A[k+2] then k will feet to the k+2 cell as well, and so on until we have A[k+m] < A[k+m+1], and then k+m is feet to k+m+1,
3)delete all the cells that you found thier corresponding cell in the previous stage
4) return to 1.
Hoped that it help. Please notice that I thought about it all alone, therefore there is a very small chance that there is some mistake here- please be convinced that I'm right and ask for more clarifications, if you need.
This Python code solves the Longest Increasing Sequence problem, and also returns one of such sequences. The trick is, at the same time that the dynamic programming table gets filled, another array is also filled, storing the index of the elements that were used to construct the optimal solution.
def an_lis(nums):
table, solution = lis_table(nums)
if not table:
return (0, [])
n, maxLen = max(enumerate(table), key=itemgetter(1))
lis = [nums[n]]
while solution[n] != -1:
lis.append(nums[solution[n]])
n = solution[n]
return lis[::-1]
def lis_table(nums):
n = len(nums)
table, solution = [0] * n, [-1] * n
for i in xrange(n):
maxLen, maxIdx = 0, -1
for j in xrange(i):
if nums[j] < nums[i] and table[j] > maxLen:
maxLen, maxIdx = table[j], j
table[i], solution[i] = 1 + maxLen, maxIdx
return (table, solution)

Resources