Removing minimal letters from a string 'A' to remove all instances of string 'B' - string

If we have string A of length N and string B of length M, where M < N, can I quickly compute the minimum number of letters I have to remove from string A so that string B does not occur as a substring in A?
If we have tiny string lengths, this problem is pretty easy to brute force: you just iterate a bitmask from 0 to 2^N and see if B occurs as a substring in this subsequence of A. However, when N can go up to 10,000 and M can go up to 1,000, this algorithm obviously falls apart quickly. Is there a faster way to do this?
Example: A=ababaa B=aba. Answer=1.Removing the second a in A will result in abbaa, which does not contain B.
Edit: User n.m. posted a great counter example: aabcc and abc. We want to remove the single b, because removing any a or c will create another instance of the string abc.

Solve it with dynamic programming. Let dp[i][j] the minimum operator to make A[0...i-1] have a suffix of B[0...j-1] as well as A[0...i] doesn't contain B, dp[i][j] = Infinite to index the operator is impossible. Then
if(A[i-1]=B[i-1])
dp[i][j] = min(dp[i-1][j-1], dp[i-1][j])
else dp[i][j]=dp[i-1][j]`,
return min(A[N][0],A[N][1],...,A[N][M-1]);`

Can you do a graph search on the string A. This is probably too slow for large N and special input but it should work better than an exponential brute force algorithm. Maybe a BFS.

I'm not sure this question is still of someone interest, but I have an idea that maybe could work.
once we decided that the problem is not to find the substring, is to decide which letter is more convenient to remove from string A, the solution to me appears pretty simple: if you find an occurrence of B string into A, the best thing you can do is just remove a char that is inside the string, closed to the right bondary...let say the one previous the last. That's why if you have a substring that actually end how it starts, if you remove a char at the beginning you just remove one of the B occurencies, while you can actually remove two at once.
Algorithm in pseudo cose:
String A, B;
int occ_posit = 0;
N = B.length();
occ_posit = A.getOccurrencePosition(B); // pseudo function that get the first occurence of B into A and returns the offset (1° byte = 1), or 0 if no occurence present.
while (occ_posit > 0) // while there are B into A
{
if (B.firstchar == B.lastchar) // if B starts as it ends
{
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
}
else
{
int x = occ_posit + N - 1;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x]; // B does not ends as it starts, so if there are overlapping instances they overlap by more than one char. Removing the first that is not equal to the char following B instance, we kill both occurrencies at once.
}
}
Let's explain with an example:
A = "123456789000987654321"
B = "890"
read this as a table:
occ_posit: 123456789012345678901
A = "123456789000987654321"
B = "890"
first occurrence is at occ_posit = 8. B does not end as it starts, so it get into the second loop:
int x = 8 + 3 - 1 = 10;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat11 matches A.charat[10] (="0"), so x become 9 and then while exits as A.charat[10] does not match A.charat9. A then become:
A = "12345678000987654321"
with no more occurencies in it.
Let's try with another:
A = "abccccccccc"
B = "abc"
first occurrence is at occ_posit = 1. B does not end as it starts, so it get into the second loop:
int x = 1 + 3 - 1 = 3;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat4 matches A.charat[3] (="c"), so x become 2 and then while exits as A.charat[3] does not match A.charat2. A then become:
A = "accccccccc"
let's try with overlapping:
A = "abcdabcdabff"
B = "abcdab"
the algorithm results in: A = "abcdacdabff" that has no more occurencies.
finally, one letter overlap:
A = "abbabbabbabba"
B = "abba"
B end as it starts, so it enters the first if:
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
that lets the last "a" of B instance to be removed. So:
1° step: A= "abbbbabbabba"
2° step: A= "abbbbabbbba" and we are done.
Hope this helps
EDIT: pls note that the algotirhm must be corrected a little not to give error when you are close to the A end with your search, but this is just an easy programming issue.

Here's a sketch I've come up with.
First, if A contains any symbols that are not found in B, split up A into a bunch of smaller strings containing only those characters found in B. Apply the algorithm on each of the smaller strings, then glue them back together to get the total result. This really functions as an optimization.
Next, check if A contains any of B. If there isn't, you're done. If A = B, then delete all of them.
I think a relatively greedy algorithm may work.
First, mark all of the symbols in A which belong to at least one occurrence of B. Let A = aabcbccabcaa, B = abc. Bolding indicates these marked characters:
a abc bcc abc aa. If there's an overlap, mark all possible. This operation is naively approximately (A-B) operations, but I believe it can be done in around (A/B) operations.
Consider the deletion of each marked letter in A: a abc bcc abc aa.
Check whether the deletion of that marked letter decreases the number of marked letters. You only need to check the substrings which could possibly be affected by the deletion of the letter. If B has a length of 4, only the substrings starting at the following locations would need to be deleted if x were being checked:
-------x------
^^^^
Any further left or right will exist regardless of the presence of x.
For instance:
Marking the [a] in the following string: a [a]bc bcc abc aa.
Its deletion yields abcbccabcaa, which when marked produces abc bcc abc aa, which has an equal number of marked characters. Since only the relative number is required for this operation, it can be done in approximately 2B time for each selected letter. For each, assign the relative difference between the two. Pick an arbitrary one which is maximal and delete it. Repeat until done. Each pass is roughly up to 2AB operations, for a maximum of A passes, giving a total time of about 2A^2 B.
In the above example, these values are assigned:
aabcbccabcaa
033 333
So arbitrarily deleting the first marked b gives you: aacbccabcaa. If you repeat the process, you get:
aacbccabcaa
333
The final result is done.
I believe the algorithm is correctly minimal. I think it is true that whenever A requires only one deletion, the algorithm must be optimal. In that case, the letter which reduces the most possible matches (ie: all of them) should be best. I can come up with no such proof, though. I'd be interested in finding any counter-examples to optimality.

Find the indeces of each substring in the main string.
Then using a dynamic programming algorithm (so memoize intermediate values), remove each letter that is part of a substring from the main string, add 1 to the count, and repeat.
You can find the letters, because they are within the indeces of each match index + length of B.
A = ababaa
B = aba
count = 0
indeces = (0, 2)
A = babaa, aabaa, abbaa, abbaa, abaaa, ababa
B = aba
count = 1
(2nd abbaa is memoized)
indeces = (1), (1), (), (), (0), (0, 2)
answer = 1
You can take it a step further, and try to memoize the substring match indeces of substrings, but that might not actually be a performance gain.
Not sure on the exact bounds, but shouldn't take too long computationally.

Related

Convert S to T by performing K operations (HackerRank)

I was solving a problem on HackerRank. It required me to see if it is possible to convert string s to string t by performing k operations.
https://www.hackerrank.com/challenges/append-and-delete/problem
The operations we can perform are: appending a lowercase letter to the end of s or removing a lowercase letter from the end of s. For example Ash Ashley 2 would return No since we need 3 operations, not 2.
I tried solving the problem as follows:
def appendAndDelete(s, t, k):
if len(s) > len(t):
maxs = [s,t]
else:
maxs = [t,s]
maximum = maxs[0]
minimum = maxs[1]
k -= len(maximum) - len(minimum)
substr = maximum[len(minimum): len(maximum)]
maximum = maximum.replace(substr, '')
i = 0
while i < len(maximum):
if maximum[i] != minimum[i]:
k -= (len(maximum)-i)*2
break
i += 1
if k < 0:
return 'No'
else:
return 'Yes'
However, it fails at this weird test case. y yu 2. The expected answer is No but according to my code, it would return Yes since only one operation was required. Is there something I do not understand?
Since you don't explain your idea, it's difficult for us to understand
what you mean in your code and debug it to tell you where you went wrong.
However, I would like to share my idea(I solved this on the website too)-
len1 => Length of first string s.
len2 => Length of second/target string t.
Exactly K makes it a bit tricky. So, if len1 + len2 <= k, you can blindly assume it can be accomplished and return true since we can delete empty string many times to get an empty string(as it says) and we can delete characters of one string entirely and keep appending new letters to get the another.
When we start matching s with t from left to right, this looks more like longest common prefix but this is NOT the case. Let's take an example -
aaaaaaaaa (source)
aaaa (target)
7 (k)
Here, up till aaaa it's common and looks like there are additional 5 a's in the source. So, we can delete those 5 a's and get the target but 5 != 7, hence it appears to be a No. But this ain't the case since we can delete an a from the source just like that and append it again(2 operations) just to satisfy k. So, it need not be longest common prefix all the time, however it gets us closer to the solution.
So, let's match both strings from left to right and stop when there is a mismatch. Let's assume we got this index in a variable called first_unmatched. Initialize first_unmatched = min(len(s),len(t)) at the beginning of your method itself.
Let
rem1 = len1 - first_unmatched
rem2 = len2 - first_unmatched
where rem1 is remaining substring of s and rem2 is the remaining substring of t.
Now, comes the conditions.
if(rem1 + rem2 == k) return true-
This is because rem1 characters to delete and rem2 characters to add. If both sum up to k then it's possible.
if(rem1 + rem2 > k) return false-
This is because rem1 characters to delete and rem2 characters to add. If both sum greater than k then it's not possible.
if(rem1 + rem2 < k) return (k - (rem1 + rem2)) % 2 == 0-
This is because rem1 characters to delete and rem2 characters to add. If both sum less than k, then it depends.
Here, (k - (rem1 + rem2)) will give you the extra in k. This extra can or cannot depends upon whether it's divisible by 2 or not. Here, we do %2 because we have 2 operations in our question - delete and append. If the extra k falls short of any operation, then the answer is No, else it's a Yes.
You can cross check this with above example.

Combining words in a dictionary to match a single word

I'm working on a problem where I need to check how many words in a dictionary can be combined to match a single word.
For example:
Given the string "hellogoodsir", and the dictionary: {hello, good, sir, go, od, e, l}, the goal is to find all the possible combinations to form the string.
In this case, the result would be hello + good + sir, and hello + go + od + sir, resulting in 3 + 4 = 7 words used, or 1 + 1 = 2 combinations.
What I've come up with is simply to put all the words starting with the first character ("h" in this instance) in one hashmap (startH), and the rest in another hashmap (endH). I then go through every single word in the startH hashmap, and check if "hellogoodsir" contains the new word (start + end), where end is every word in the endH hashmap. If it does, I check if it equals the word to match, and then increments the counter with the value of the number for each word used. If it contains it, but doesn't equal it, I call the same method (recursion) using the new word (i.e. start + end), and proceed to try to append any word in the end hashmap to the new word to get a match.
This is obviously very slow for large number of words (and a long string to match). Is there a more efficient way to solve this problem?
As far as I know, this is an O(n^2) algorithm, but I'm sure this can be done faster.
Let's start with your solution. It is no linear nor quadric time, it's actually exponential time. A counter example that shows that is:
word = "aaa...a"
dictionary = {"a", "aa", "aaa", ..., "aa...a"}
Since your solution is going through each possible matching, and there is exponential number of such in this example - the solution is exponential time.
However, that can be done more efficiently (quadric time worst case), with Dynamic Programming, by following the recursive formula:
D[0] = 1 #
D[i] = sum { D[j] | word.Substring(i,j) is in the dictionary | 0 <= j < i }
Calculating each D[i] (given the previous ones are already known) is done in O(i)
This sums to total O(n^2) time, with O(n) extra space.
Quick note: By iterating the dictionary instead of all (i,j) pairs for each D[i], you can achieve O(k) time for each D[i], which ends up as O(n*k), where k is the dictionary size. This can be optimized for some cases by traversing only potentially valid strings - but for the same counter example as above, it will result in O(n*k).
Example:
dictionary = {hello, good, sir, go, od, e, l}
string = "hellogoodsir"
D[0] = 1
D[1] = 0 (no substring h)
D[2] = 0 (no substring he, d[1] = 0 for e)
...
D[5] = 1 (hello is the only valid string in dictionary)
D[6] = 0 (no dictionary string ending with g)
D[7] = D[5], because string.substring(5,7)="go" is in dictionary
D[8] = 0, no substring ending with "oo"
D[9] = 2: D[7] for "od", and D[5] for "good"
D[10] = D[11] = 0 (no strings in dictionary ending with "si" or "s")
D[12] = D[7] = 2 for substring "sir"
My suggestion would be to use a prefix tree. The nodes beneath the root would be h, g, s, o, e, and l. You will need nodes for terminating characters as well, to differentiate between go and good.
To find all matches, use a Breadth-first-search approach. The state you will want to keep track of is a composition of: the current index in the search-string, the current node in the tree, and the list of words used so far.
The initial state should be 0, root, []
While the list of states is not empty, dequeue the next state, and see if the index matches any of the keys of the children of the node. If so, modify a copy of the state and enqueue it. Also, if any of the children are the terminating character, do the same, adding the word to the list in the state.
I'm not sure on the O(n) time on this algorithm, but it should be much faster.

Searching a minimal string meeting some conditions

Recently, I was asked the following problem during an interview.
Given a string S, I need to find another string S2 such that S2 is a subsequence of S and also S is a subsequence of S2+reverse(S2). Here '+' means concatenation. I need to output the min possible length of S2 for given S.
I was told that this is a dynamic programming problem however I was unable to solve it. Can somebody help me with this problem?
EDIT-
Is there a way to do this in O(N2) or less.
There are 2 important aspects in this problem.
Since we need S as a substring of S2+reverse(S2), S2 should have
atleast n/2 length.
After concatenation of S2 and reverse(S2), there is a pattern where
the alphabets repeats such as
So the solution is to check from the center of S to end of S for any consecutive elements. If you find one then check the elements on either side as shown.
Now if you are able to reach till the end of the string, then the minimum number of elements (result) is the distance from start to the point where you find consecutive elements. In this example its C i.e 3.
We know that this may not happen always. i.e you may not be able to find consecutive elements at the center. Let us say the consecutive elements are after the center then we can do the same test.
Main string
Substring
Concatenated string
Now arrives the major doubt. Why we consider only the left side starting from center? The answer is simple, the concatenated string is made by S+reverse(S). So we are sure that the last element in the substring comes consecutive in the concatenated string. There is no way that any repetition in the first half of the main string can give a better result because at least we should have the n alphabets in the final concatenated string
Now the matter of complexity:
Searching for consecutive alphabets give a maximum of O(n)
Now checking elements on either side iteratively can give a worst case complexity of O(n). i.e maximum n/2 comparisons.
We may fail many times doing the second check so the we have a multiplicative relation between the complexities i.e O(n*n).
I believe this is a correct solution and didn't find any loophole yet.
Let's say that S2 is "apple". Then we can make this assumption:
S2 + reverseS2 >= S >= S2
"appleelppa" >= S >= "apple"
So the given S will something including "apple" to not more than "appleelppe". It could be "appleel" or "appleelpp".
String S ="locomotiffitomoc";
// as you see S2 string is "locomotif" but
// we don't know S2 yet, so it's blank
String S2 = "";
for (int a=0; a<S.length(); a++) {
try {
int b = 0;
while (S.charAt(a - b) == S.charAt(a + b + 1))
b++;
// if this for loop breaks that means that there is a character that doesn't match the rule
// if for loop doesn't break but throws an exception we found it.
} catch (Exception e) {
// if StringOutOfBoundsException is thrown this means end of the string.
// you can check this manually of course.
S2 = S.substring(0,a+1);
break;
}
}
System.out.println(S2); // will print out "locomotif"
Congratulations, you found the minimum S2.
Each character from S can be includes in S2 or not. With that we can construct recursion that tries two cases:
first character of S is used for cover,
first character of S is not
used for cover,
and calculate minimum of these two covers. To implement this, it is enough to track how much of S is covered with already chosen S2+reverse(S2).
There are optimizations where we know what result is (found cover, can't have cover), and it is not needed to take first character for cover if it will not cover something.
Simple python implementation:
cache = {}
def S2(S, to_cover):
if not to_cover: # Covered
return ''
if not S: # Not covered
return None
if len(to_cover) > 2*len(S): # Can't cover
return None
key = (S, to_cover)
if key not in cache:
without_char = S2(S[1:], to_cover) # Calculate with first character skipped
cache[key] = without_char
_f = to_cover[0] == S[0]
_l = to_cover[-1] == S[0]
if _f or _l:
# Calculate with first character used
with_char = S2(S[1:], to_cover[int(_f):len(to_cover)-int(_l)])
if with_char is not None:
with_char = S[0] + with_char # Append char to result
if without_char is None or len(with_char) <= len(without_char):
cache[key] = with_char
return cache[key]
s = '21211233123123213213131212122111312113221122132121221212321212112121321212121132'
c = S2(s, s)
print len(s), s
print len(c), c

Efficiently counting the number of substrings of a digit string that are divisible by k?

We are given a string which consists of digits 0-9. We have to count number of sub-strings divisible by a number k. One way is to generate all the sub-strings and check if it is divisible by k but this will take O(n^2) time. I want to solve this problem in O(n*k) time.
1 <= n <= 100000 and 2 <= k <= 1000.
I saw a similar question here. But k was fixed as 4 in that question. So, I used the property of divisibility by 4 to solve the problem.
Here is my solution to that problem:
int main()
{
string s;
vector<int> v[5];
int i;
int x;
long long int cnt = 0;
cin>>s;
x = 0;
for(i = 0; i < s.size(); i++) {
if((s[i]-'0') % 4 == 0) {
cnt++;
}
}
for(i = 1; i < s.size(); i++) {
int f = s[i-1]-'0';
int s1 = s[i] - '0';
if((10*f+s1)%4 == 0) {
cnt = cnt + (long long)(i);
}
}
cout<<cnt;
}
But I wanted a general algorithm for any value of k.
This is a really interesting problem. Rather than jumping into the final overall algorithm, I thought I'd start with a reasonable algorithm that doesn't quite cut it, then make a series of modifications to it to end up with the final, O(nk)-time algorithm.
This approach combines together a number of different techniques. The major technique is the idea of computing a rolling remainder over the digits. For example, let's suppose we want to find all prefixes of the string that are multiples of k. We could do this by listing off all the prefixes and checking whether each one is a multiple of k, but that would take time at least Θ(n2) since there are Θ(n2) different prefixes. However, we can do this in time Θ(n) by being a bit more clever. Suppose we know that we've read the first h characters of the string and we know the remainder of the number formed that way. We can use this to say something about the remainder of the first h+1 characters of the string as well, since by appending that digit we're taking the existing number, multiplying it by ten, and then adding in the next digit. This means that if we had a remainder of r, then our new remainder is (10r + d) mod k, where d is the digit that we uncovered.
Here's quick pseudocode to count up the number of prefixes of a string that are multiples of k. It runs in time Θ(n):
remainder = 0
numMultiples = 0
for i = 1 to n: // n is the length of the string
remainder = (10 * remainder + str[i]) % k
if remainder == 0
numMultiples++
return numMultiples
We're going to use this initial approach as a building block for the overall algorithm.
So right now we have an algorithm that can find the number of prefixes of our string that are multiples of k. How might we convert this into an algorithm that finds the number of substrings that are multiples of k? Let's start with an approach that doesn't quite work. What if we count all the prefixes of the original string that are multiples of k, then drop off the first character of the string and count the prefixes of what's left, then drop off the second character and count the prefixes of what's left, etc? This will eventually find every substring, since each substring of the original string is a prefix of some suffix of the string.
Here's some rough pseudocode:
numMultiples = 0
for i = 1 to n:
remainder = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if remainder == 0
numMultiples++
return numMultiples
For example, running this approach on the string 14917 looking for multiples of 7 will turn up these strings:
String 14917: Finds 14, 1491, 14917
String 4917: Finds 49,
String 917: Finds 91, 917
String 17: Finds nothing
String 7: Finds 7
The good news about this approach is that it will find all the substrings that work. The bad news is that it runs in time Θ(n2).
But let's take a look at the strings we're seeing in this example. Look, for example, at the substrings found by searching for prefixes of the entire string. We found three of them: 14, 1491, and 14917. Now, look at the "differences" between those strings:
The difference between 14 and 14917 is 917.
The difference between 14 and 1491 is 91
The difference between 1491 and 14917 is 7.
Notice that the difference of each of these strings is itself a substring of 14917 that's a multiple of 7, and indeed if you look at the other strings that we've matched later on in the run of the algorithm we'll find these other strings as well.
This isn't a coincidence. If you have two numbers with a common prefix that are multiples of the same number k, then the "difference" between them will also be a multiple of k. (It's a good exercise to check the math on this.)
So this suggests another route we can take. Suppose that we find all prefixes of the original string that are multiples of k. If we can find all of them, we can then figure out how many pairwise differences there are among those prefixes and potentially avoid rescanning things multiple times. This won't find everything, necessarily, but it will find all substrings that can be formed by computing the difference of two prefixes. Repeating this over all suffixes - and being careful not to double-count things - could really speed things up.
First, let's imagine that we find r different prefixes of the string that are multiples of k. How many total substrings did we just find if we include differences? Well, we've found k strings, plus one extra string for each (unordered) pair of elements, which works out to k + k(k-1)/2 = k(k+1)/2 total substrings discovered. We still need to make sure we don't double-count things, though.
To see whether we're double-counting something, we can use the following technique. As we compute the rolling remainders along the string, we'll store the remainders we find after each entry. If in the course of computing a rolling remainder we rediscover a remainder we've already computed at some point, we know that the work we're doing is redundant; some previous scan over the string will have already computed this remainder and anything we've discovered from this point forward will have already been found.
Putting these ideas together gives us this pseudocode:
numMultiples = 0
seenRemainders = array of n sets, all initially empty
for i = 1 to n:
remainder = 0
prefixesFound = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if seenRemainders[j] contains remainder:
break
add remainder to seenRemainders[j]
if remainder == 0
prefixesFound++
numMultiples += prefixesFound * (prefixesFound + 1) / 2
return numMultiples
So how efficient is this? At first glance, this looks like it runs in time O(n2) because of the outer loops, but that's not a tight bound. Notice that each element can only be passed over in the inner loop at most k times, since after that there aren't any remainders that are still free. Therefore, since each element is visited at most O(k) times and there are n total elements, the runtime is O(nk), which meets your runtime requirements.

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Resources