Understanding Knuth-Morris-Pratt Algorithm - string

Can someone explain this to me? I've been reading about it and it still is hard to follow.
text : ababdbaababa
pattern: ababa
table for ababa is -1 0 0 1 2.
I think I understand how the table is constructed but, I dont understand how to shift once mismatch has occurred. Seems like we dont even use the table when shifting?
when do we use the table?

Here I have briefly described computing the prefix function and shifting through the text here.
For further information: Knuth–Morris–Pratt string search algorithm
Shifting through the text :
Text: ABC ABCDAB ABCDABCDABDE
Pattern : ABCDABD
Scenario 1 - There is/are some matching character/s in Pattern and Text.
e.g 1: In here there are 3 matching characters.
Get the value from table for 3 characters. (index 2, ABC) i.e 0
Therefore shift = 3 - 0 i.e 3
e.g 2: In here there are 6 matching characters.
Get the value from table for 6 characters. (index 5, ABCDAB) i.e 2
Therefore shift = 6 - 2 i.e 4
Scenario 2 - If there is no matching characters then shift by one.

the table is used when your mismatch occurs. Let's apply the pattern to your text:
You start matching text with pattern and test if your pattern could be in text, starting at the first position. You compare text[1] with pattern[1] and that turns out to be a match. You do the same for text[2], text[3] and text[4].
when you want to match text[5] with pattern[5] you don't have a match (d<>a). You then know that your pattern will not start at the first position. You could then start the matching all over again for position 2 but that is not efficient. You can use the table now.
The error occured at pattern[5] so you go to table[5] which is 2. That tells you that you can start matching at the current position again with 2 already matched characters. Instead of having to start matching position 2, you can start at your previous position (1) + table[5] (2)=3. Indeed, If we look at text[3] and text[4], we see that it is equal to pattern[1] and pattern[2], respectivily.
The numbers in table tell you how many positions are already matched when an error occurs. In this case 2 characters of the next pattern were already matched. You can then immediately start matching for position 3 and skip position 2 (as the pattern can not be found starting at position[2]).

Well this is an old topic but hopefully someone who searches for this in the future will see it. Answer given above is good but I worked through an example myself to see what's going on exactly.
First part of the exposition is taken from wiki, the part I really wanted to elaborate on is how this backtracking array is constructed.
Here goes:
we work through a (relatively artificial) run of the algorithm, where
W = "ABCDABD" and
S = "ABC ABCDAB ABCDABCDABDE".
At any given time, the algorithm is in a state determined by two integers:
m which denotes the position within S which is the beginning of a prospective match for W
i the index in W denoting the character currently under consideration.
In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step,
we get S[3] is a space and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S
except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again.
Therefore we move on to the next character, setting m = 4 and i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial
match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match
the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus,
not only do we omit previously matched characters of S, but also previously matched characters of W.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin
searching at the next character of S: m = 11, reset i = 0.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before,
we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
This time we are able to complete the match, whose first character is S[15].
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table T, described below, which
indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that
if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is,
T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, T[0] = -1, which indicates that if W[0] is a mismatch,
we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example
above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]].
BACKTRACKING ARRAY CONSTRUCTION:
so this backtracking array T[] we will call lps[], let's see how we calculate this guy
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”,
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
//so just going through this real quick
lps[0] is just 0 by default
lps[1] is 1 because it's looking at AA and A is both a prefix and suffix
lps[2] is 0 because it's looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules
lps[3] is 1 because it's looking at AABA and first A matches last A
lps[4] is 2 becuase it's looking at AABAA and first 2 A matches last 2 A
lps[5] is 0 becuase it's looking at AABAAC and nothing matches C
...
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
And this totally makes sense if you think about it...if you mismatch, you want to go back as far as you can obviously, how far back you go (the suffix
portion) is essentially the prefix since you must start matching from the first character again by definition. so if your string looks like
aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as your new head, just think it through

A Complete Solution using Java:
package src.com.recursion;
/*
* This Expains the Search of pattern in text in O(n)
*/
public class FindPatternInText {
public int checkIfExists(char[] text, char[] pattern) {
int index = 0;
int[] lps = new int[pattern.length];
createPrefixSuffixArray(pattern, lps);
int i = 0;
int j = 0;
int textLength = text.length;
while (i < textLength) {
if (pattern[j] == text[i]) {
j++;
i++;
}
if (j == pattern.length)
return i - j;
else if (i < textLength && pattern[j] != text[i]) {
if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return index;
}
private void createPrefixSuffixArray(char[] pattern, int[] lps) {
lps[0] = 0;
int index = 0;
int i = 1;
while (i < pattern.length) {
if (pattern[i] == pattern[index]) {
lps[i] = index;
i++;
index++;
} else {
if (index != 0) {
index = lps[index - 1];
} else {
lps[i] = 0;
i++;
}
}
}
}
public static void main(String args[]) {
String text = "ABABDABACDABABCABAB";
String pattern = "ABABCABAB";
System.out.println("Point where the pattern match starts is "
+ new FindPatternInText().checkIfExists(text.toCharArray(), pattern.toCharArray()));
}
}

Related

Leetcode Problem First Unique Character In a string

I am struggling to understand how the leetcode solution for the above problem works. If any help on how the post increment operator is working on the value of the array it would be great.
class Solution {
public int firstUniqChar(String s) {
int [] charArr = new int[26];
for(int i=0;i<s.length();i++){
charArr[s.charAt(i)-'a']++;
}
for(int i=0;i<s.length();i++){
if(charArr[s.charAt(i)-'a']==1) return i;
}
return -1;
}
The problem link here https://leetcode.com/problems/first-unique-character-in-a-string/submissions/!
First, you need to understand there are 26 letters in the English alphabet. So the code creates an array of 26 integers that will hold the count of each letter in the string.
int [] charArr = new int[26];
The count of all a's will be at index 0, the cound of all b's at index 1, etc. The default value for int is 0, so this gives an array of 26 zeros to start with.
Each letter has two character codes; one for upper case and one for lower case. The function String.charAt() returns a char but char is an integral type so you can do math on it. When doing math on a char, it uses the char code. So for example:
char c = 'B';
c -= 'A';
System.out.println((int)c); // Will print 1 since char code of 'A' = 65, 'B' = 66
So this line:
charArr[s.charAt(i)-'a']++;
Takes the char at i and subtracts 'a' from it. The range of lower case codes are 97-122. Subtracting 'a' shifts those values to 0-25 - which gives the indexes into the array. (Note this code only checks lower case letters).
After converting the character to an index, it increments the value at that index. So each item in the array represents the character count of the corresponding letter.
For example, the string "aabbee" will give the array {2, 2, 0, 0, 2, 0, 0....0}

find minimum steps required to change one binary string to another

Given two string str1 and str2 which contain only 0 or 1, there
are some steps to change str1 to str2,
step1: find a substring of str1 of length 2 and reverse the substring, and str1 becomes str1' (str1' != str1)
step2: find a substring of str1' of length 3, and reverse the substring, and str1' becomes str1'' (str1'' != str1')
the following steps are similar.
the string length is in the range [2, 30]
Requirement: each step must be performed once and we can not skip
previous steps and perform the next step.
If it is possible to change str1 to str2, output the minimum steps required, otherwise, output -1
Example 1
str1 = "1010", str2 = "0011", the minimum step required is 2
first, choose substring in range [2, 3], "1010" --> "1001",
then choose substring in the range [0, 2], "1001" --> "0011"
Example 2
str1 = "1001", str2 = "0110", it is impossible to change str1 to str2,
because in step1, str1 can be changed to "0101" or "1010", but in step3, it is impossible to change a length3 substring to make it different. So the output is -1.
Example 3
str1 = "10101010", str2 = "00101011", output is 7
I can not figure out example 3, because there are two many possibilities. Can anyone gives some hint on how to solve this problem? What is the type of this
problem? Is it dynamic programming?
This is in fact a dynamic programming problem. To solve it, we are going to try all possible permutations, but memoize the results along the way. It could seem that there are way too many options - there are 2^30 different binary strings of length 30, but keep in mind that reverting a string doesn't change number of zeroes and ones we have, so the upper bound is in fact 30 choose 15 = 155117520 when we have a string of 15 zeroes and ones. Around 150 million possible results is not too bad.
So starting with our start string, we are going to derive all possible string from each string we derived so far, until we generate end string. We are also going to track predecessors to reconstruct generation. Here's my code:
start = '10101010'
end = '00101011'
dp = [{} for _ in range(31)]
dp[1][start] = '' # Originally only start string is reachable
for i in range(2, len(start) + 1):
for s in dp[i - 1].keys():
# Try all possible reversals for each string in dp[i - 1]
for j in range(len(start) - i + 1):
newstr = s
newstr = newstr[:j] + newstr[j:j+i][::-1] + newstr[j+i:]
dp[i][newstr] = s
if end in dp[i]:
ans = []
cur = end
for j in range(i, 0, -1):
ans.append(cur)
cur = dp[j][cur]
print(ans[::-1])
exit(0)
print('Impossible!')
And for your third example, this gives us sequence ['10101010', '10101001', '10101100', '10100011', '00101011'] - from your str1 to str2. If you check differences between the strings, you'll see which transitions were made. So this transformation can be done in 4 steps rather than 7 like you suggested.
Lastly, this will be a bit slow for 30 in python, but if you rewrite it into C++, it's going to be a couple of seconds tops.
This Question can be solved using Backtracking. here is my C++ Code, Which runs smooth with my testcases. This Question Came in an OA of Persistent systems and i was a bit confused about the steps, but this is simple Backtracking. Wants your suggestions if Dp can Optimize my solution!.
//prabaljainn
#include <bits/stdc++.h>
using namespace std;
string s1,s2;
int ans=1e9; int n;
void rec(string s1,int level){
if(s1==s2){
ans = min(ans,level-2);
return;
}
for(int i=0; i<= n-level; i++){
reverse(s1.begin()+i, s1.begin()+i+level);
rec(s1,level+1);
reverse(s1.begin()+i, s1.begin()+i+level);
}
}
int main(){
cin>>s1>>s2;
n = s1.size();
rec(s1,2);
if(ans==1e9)
cout<<"-1"<<endl;
else
cout<<ans<<endl;
}
Happy coding
This problem can be solved using breadth-first search. The following solution uses a queue which stores a pair having the current string as the first member and current operation length(initially 2) as the second member. A set is used to store already visited strings to prevent entering redundant states. For current string, we reverse every substring of length k where k is current operation length and add it to the queue if it hasn't been seen before. If the current string equals the desired string then answer is 'current operation length-2'. If queue becomes empty, then the answer isn't possible.
string str1,str2;
cin>>str1>>str2;
queue<pair<string, int>> q;
set<string> s;
q.push({str1,2});
s.insert(str1);
while(!q.empty())
{
auto p=q.front();
q.pop();
if(p.first==str2)
{
cout<<p.second-2;
return 0;
}
if(p.second<=p.first.size())
{
for(int i=0;i<=p.first.size()-p.second;i++)
{
string x=p.first;
reverse(x.begin()+i,x.begin()+i+p.second);
if(s.find(x)==s.end())
{
q.push({x,p.second+1});
s.insert(x);
}
}
}
}
cout<<-1;
save str1 as start of BFS and at each step,reverse values of all substrings of length 2 and 3 and see if the new strings formed after reversing have been seen previously or not.....if not seen....push them in the queue and also maintain count of steps...if the string at the front of queue is str2 at any time...that step is the answer

In Place Run Length Encoding Algorithm

I encountered an interview question:
Given a input String: aaaaabcddddee, convert it to a5b1c1d4e2.
One extra constraint is, this needs to be done in-place, means no extra space(array) should be used.
It is guaranteed that the encoded string will always fit in the original string. In other words, string like abcde will not occur, since it will be encoded to a1b1c1d1e1 which occupies more space than the original string.
One hint interviewer gave me was to traverse the string once and find the space that is saved.
Still I am stuck as some times, without using extra variables, some values in the input string may be overwritten.
Any suggestions will be appreciated?
This is a good interview question.
Key Points
There are 2 key points:
Single character must be encoded as c1;
The encoded length will always be smaller than the original array.
Since 1, we know each character requires at least 2 places to be encoded. This is to say, only single character will require more spaces to be encoded.
Simple Approach
From the key points, we notice that the single character causes us a lot problem during the encoding, because they might not have enough place to hold the encoded string. So how about we leave them first, and compressed the other characters first?
For example, we encode aaaaabcddddee from the back while leaving the single character first, we will get:
aaaaabcddddee
_____a5bcd4e2
Then we could safely start from the beginning and encoding the partly encoded sequence, given the key point 2 such that there will be enough spaces.
Analysis
Seems like we've got a solution, are we done? No. Consider this string:
aaa3dd11ee4ff666
The problem doesn't limit the range of characters, so we could use digit as well. In this case, if we still use the same approach, we will get this:
aaa3dd11ee4ff666
__a33d212e24f263
Ok, now tell me, how do you distinguish the run-length from those numbers in the original string?
Well, we need to try something else.
Let's define Encode Benefit (E) as: the length difference between the encoded sequence and the original consecutive character sequence..
For example, aa has E = 0, since aa will be encoded to a2, and they have no length difference; aaa has E = 1, since it will be encoded as a3, and the length difference between the encoded and the original is 1. Let's look at the single character case, what's its E? Yes, it's -1. From the definition, we could deduce the formula for E: E = ori_len - encoded_len.
Now let's go back to the problem. From key point 2, we know the encoded string will always be shorter than the original one. How do we use E to rephrase this key point?
Very simple: sigma(E_i) >= 0, where E_i is the Encode Benefit of the ith consecutive character substring.
For example, the sample you gave in your problem: aaaaabcddddee, can be broken down into 5 parts:
E(0) = 5 - 2 = 3 // aaaaa -> a5
E(1) = 1 - 2 = -1 // b -> b1
E(2) = 1 - 2 = -1 // c -> c1
E(3) = 4 - 2 = 2 // dddd -> d4
E(4) = 2 - 2 = 0 // ee -> e2
And the sigma will be: 3 + (-1) + (-1) + 2 + 0 = 3 > 0. This means there will be 3 spaces left after encoding.
However, from this example, we could see a potential problem: since we are doing summing, even if the final answer is bigger than 0, it's possible to get some negatives in the middle!
Yes, this is a problem, and it's quite serious. If we get E falls below 0, this means we do not have enough space to encode the current character and will overwrite some characters after it.
But but but, why do we need to sum it from the first group? Why can't we start summing from somewhere in the middle to skip the negative part? Let's look at an example:
2 0 -1 -1 -1 1 3 -1
If we sum up from the beginning, we will fall below 0 after adding the third -1 at index 4 (0-based); if we sum up from index 5, loop back to index 0 when we reach the end, we have no problem.
Algorithm
The analysis gives us an insight on the algorithm:
Start from the beginning, calculate E of the current consecutive group, and add to the total E_total;
If E_total is still non-negative (>= 0), we are fine and we could safely proceed to the next group;
If the E_total falls below 0, we need to start over from the current position, i.e. clear E_total and proceed to the next position.
If we reach the end of the sequence and E_total is still non-negative, the last starting point is a good start! This step takes O(n) time. Usually we need to loop back and check again, but since key point 2, we will definitely have a valid answer, so we could safely stop here.
Then we could go back to the starting point and start traditional run-length encoding, after we reach the end we need to go back to the beginning of the sequence to finish the first part. The tricky part is, we need to make use the remaining spaces at the end of the string. After that, we need to do some shifting just in case we have some order issues, and remove any extra white spaces, then we are finally done :)
Therefore, we have a solution (the code is just a pseudo and hasn't been verified):
// find the position first
i = j = E_total = pos = 0;
while (i < s.length) {
while (s[i] == s[j]) j ++;
E_total += calculate_encode_benefit(i, j);
if (E_total < 0) {
E_total = 0;
pos = j;
}
i = j;
}
// do run length encoding as usual:
// start from pos, end with len(s) - 1, the first available place is pos
int last_available_pos = runlength(s, pos, len(s)-1, pos);
// a tricky part here is to make use of the remaining spaces from the end!!!
int fin_pos = runlength(s, 0, pos-1, last_available_pos);
// eliminate the white
eliminate(s, fin_pos, pos);
// update last_available_pos because of elimination
last_available_pos -= pos - fin_pos < 0 ? 0 : pos - fin_pos;
// rotate back
rotate(s, last_available_pos);
Complexity
We have 4 parts in the algorithm:
Find the starting place: O(n)
Run-Length-Encoding on the whole string: O(n)
White space elimination: O(n)
In place string rotation: O(n)
Therefore we have O(n) in total.
Visualization
Suppose we need to encode this string: abccdddefggggghhhhh
First step, we need to find the starting position:
Group 1: a -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 1;
Group 2: b -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 2;
Group 3: cc -> E_total += 0 -> E_total = 0 >= 0 -> proceed;
Group 4: ddd -> E_total += 1 -> E_total = 1 >= 0 -> proceed;
Group 5: e -> E_total += -1 -> E_total = 0 >= 0 -> proceed;
Group 6: f -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 9;
Group 7: ggggg -> E_total += 3 -> E_total = 3 >= 0 -> proceed;
Group 8: hhhhh -> E_total += 3 -> E_total = 6 >= 0 -> end;
So the start position will be 9:
v this is the starting point
abccdddefggggghhhhh
abccdddefg5h5______
^ last_available_pos, we need to make use of these remaining spaces
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
^^^ remove the white space
d3e1f1g5h5a1b1c2
^ last_available_pos, rotate
a1b1c2d3e1f1g5h5
Last Words
This question is not trivial, and actually glued several traditional coding interview questions together naturally. A suggested mind flow would be:
observe the pattern and figure out the key points;
realize the reason for insufficient space is because of encoding single character;
quantize the benefit/cost of encoding on each consecutive characters group (a.k.a Encoding Benefit);
use the quantization you proposed to explain the original statement;
figure out the algorithm to find a good starting point;
figure out how to do run-length-encoding with a good starting point;
realize you need to rotate the encoded string and eliminate the white spaces;
figure out the algorithm to do in place string rotation;
figure out the algorithm to do in place white space elimination.
To be honest, it's a bit challenging for an interviewee to come up with a solid algorithm in a short time, so your analysis flow really matters. Don't say nothing, show your mind flow, this helps the interviewer to find out your current stage.
Maybe just encode it normally, but if you see that your output index overtakes the input index, just skip the "1". Then when you finish go backwards and insert 1 after all letters without a count, shifting the rest of the string back. It is O(N^2) in the worst case (no repeating letters), so I assume there might be better solutions.
EDIT: it appears I missed the part that the final string always fits into the source. With that restriction, yeah, this is not the optimal solution.
EDIT2: an O(N) version of it would be during the first pass also compute the final compressed length (which in the general case might be more than the source), set pointer p1 to it, a pointer p2 to the compressed string with 1s omitted (p2 is thus <= p1), then just keep going backwards on both pointers, copying p2 to p1 and adding 1s when necessary (when this happens the difference between p2 and p1 will decrease)
O(n) and in place
set var = 0;
Loop from 1-length and find the first non-matching character.
The count would be the difference of the indices of both characters.
Let's run through an example
s = "wwwwaaadexxxxxxywww"
add a dummy letter to s
s = s + '#'
now our string becomes
s = "wwwwaaadexxxxxxywww#"
we'll come back to this step later.
j gives the first character of the string.
j = 0 // s[j] = w
now loop through 1 - length. The first non-matching character is 'a'
print(s[j], i - j) // i = 4, j = 0
j = i // j = 4, s[j] = a
Output: w4
i becomes the next non-matching character which would be 'd'
print(s[j], i - j) // i = 7, j = 4 => a3
j = i // j = 7, s[j] = d
Output: w4a3
.
. (Skipping to the second last)
.
j = 15, s[j] = y, i = 16, s[i] = w
print(s[j], i - y) => y1
Output: w4a3d1e1x6y1
Okay so now we reached the last, assume that we didn't add any dummy letter
j = 16, s[j] = w and we cannot print it's count
because we've no 'mis-matching' character
That's why need to add a dummy letter.
Here's a C++ implementation
void compress(string s){
int j = 0;
s = s + '#';
for(int i=1; i < s.length(); i++){
if(s[i] != s[j]){
cout << s[j] << i - j;
j = i;
}
}
}
int main(){
string s = "wwwwaaadexxxxxxywww";
compress(s);
return 0;
}
Output: w4a3d1e1x6y1w3
If the use of insert and erase string functions are allowed then you can efficiently get the solution with this implementation.
#include<bits/stdc++.h>
using namespace std;
int dig(int n){
int k=0;
while(n){
k++;
n/=10;
}
return k;
}
void stringEncoding(string &n){
int i=0;
for(int i=0;i<n.size();i++){
while(n[i]==n[i+j])j++;
n.erase((i+1),(j-1));
n.insert(i+1,to_string(j));
i+=(dig(j));
}
}
int main(){
ios_base::sync_with_stdio(0), cin.tie(0);
string n="kaaaabcddedddllllllllllllllllllllllp";
stringEncoding(n);
cout<<n;
}
This will give the following output : k1a4b1c1d2e1d3l22p1

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Removing minimal letters from a string 'A' to remove all instances of string 'B'

If we have string A of length N and string B of length M, where M < N, can I quickly compute the minimum number of letters I have to remove from string A so that string B does not occur as a substring in A?
If we have tiny string lengths, this problem is pretty easy to brute force: you just iterate a bitmask from 0 to 2^N and see if B occurs as a substring in this subsequence of A. However, when N can go up to 10,000 and M can go up to 1,000, this algorithm obviously falls apart quickly. Is there a faster way to do this?
Example: A=ababaa B=aba. Answer=1.Removing the second a in A will result in abbaa, which does not contain B.
Edit: User n.m. posted a great counter example: aabcc and abc. We want to remove the single b, because removing any a or c will create another instance of the string abc.
Solve it with dynamic programming. Let dp[i][j] the minimum operator to make A[0...i-1] have a suffix of B[0...j-1] as well as A[0...i] doesn't contain B, dp[i][j] = Infinite to index the operator is impossible. Then
if(A[i-1]=B[i-1])
dp[i][j] = min(dp[i-1][j-1], dp[i-1][j])
else dp[i][j]=dp[i-1][j]`,
return min(A[N][0],A[N][1],...,A[N][M-1]);`
Can you do a graph search on the string A. This is probably too slow for large N and special input but it should work better than an exponential brute force algorithm. Maybe a BFS.
I'm not sure this question is still of someone interest, but I have an idea that maybe could work.
once we decided that the problem is not to find the substring, is to decide which letter is more convenient to remove from string A, the solution to me appears pretty simple: if you find an occurrence of B string into A, the best thing you can do is just remove a char that is inside the string, closed to the right bondary...let say the one previous the last. That's why if you have a substring that actually end how it starts, if you remove a char at the beginning you just remove one of the B occurencies, while you can actually remove two at once.
Algorithm in pseudo cose:
String A, B;
int occ_posit = 0;
N = B.length();
occ_posit = A.getOccurrencePosition(B); // pseudo function that get the first occurence of B into A and returns the offset (1° byte = 1), or 0 if no occurence present.
while (occ_posit > 0) // while there are B into A
{
if (B.firstchar == B.lastchar) // if B starts as it ends
{
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
}
else
{
int x = occ_posit + N - 1;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x]; // B does not ends as it starts, so if there are overlapping instances they overlap by more than one char. Removing the first that is not equal to the char following B instance, we kill both occurrencies at once.
}
}
Let's explain with an example:
A = "123456789000987654321"
B = "890"
read this as a table:
occ_posit: 123456789012345678901
A = "123456789000987654321"
B = "890"
first occurrence is at occ_posit = 8. B does not end as it starts, so it get into the second loop:
int x = 8 + 3 - 1 = 10;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat11 matches A.charat[10] (="0"), so x become 9 and then while exits as A.charat[10] does not match A.charat9. A then become:
A = "12345678000987654321"
with no more occurencies in it.
Let's try with another:
A = "abccccccccc"
B = "abc"
first occurrence is at occ_posit = 1. B does not end as it starts, so it get into the second loop:
int x = 1 + 3 - 1 = 3;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat4 matches A.charat[3] (="c"), so x become 2 and then while exits as A.charat[3] does not match A.charat2. A then become:
A = "accccccccc"
let's try with overlapping:
A = "abcdabcdabff"
B = "abcdab"
the algorithm results in: A = "abcdacdabff" that has no more occurencies.
finally, one letter overlap:
A = "abbabbabbabba"
B = "abba"
B end as it starts, so it enters the first if:
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
that lets the last "a" of B instance to be removed. So:
1° step: A= "abbbbabbabba"
2° step: A= "abbbbabbbba" and we are done.
Hope this helps
EDIT: pls note that the algotirhm must be corrected a little not to give error when you are close to the A end with your search, but this is just an easy programming issue.
Here's a sketch I've come up with.
First, if A contains any symbols that are not found in B, split up A into a bunch of smaller strings containing only those characters found in B. Apply the algorithm on each of the smaller strings, then glue them back together to get the total result. This really functions as an optimization.
Next, check if A contains any of B. If there isn't, you're done. If A = B, then delete all of them.
I think a relatively greedy algorithm may work.
First, mark all of the symbols in A which belong to at least one occurrence of B. Let A = aabcbccabcaa, B = abc. Bolding indicates these marked characters:
a abc bcc abc aa. If there's an overlap, mark all possible. This operation is naively approximately (A-B) operations, but I believe it can be done in around (A/B) operations.
Consider the deletion of each marked letter in A: a abc bcc abc aa.
Check whether the deletion of that marked letter decreases the number of marked letters. You only need to check the substrings which could possibly be affected by the deletion of the letter. If B has a length of 4, only the substrings starting at the following locations would need to be deleted if x were being checked:
-------x------
^^^^
Any further left or right will exist regardless of the presence of x.
For instance:
Marking the [a] in the following string: a [a]bc bcc abc aa.
Its deletion yields abcbccabcaa, which when marked produces abc bcc abc aa, which has an equal number of marked characters. Since only the relative number is required for this operation, it can be done in approximately 2B time for each selected letter. For each, assign the relative difference between the two. Pick an arbitrary one which is maximal and delete it. Repeat until done. Each pass is roughly up to 2AB operations, for a maximum of A passes, giving a total time of about 2A^2 B.
In the above example, these values are assigned:
aabcbccabcaa
033 333
So arbitrarily deleting the first marked b gives you: aacbccabcaa. If you repeat the process, you get:
aacbccabcaa
333
The final result is done.
I believe the algorithm is correctly minimal. I think it is true that whenever A requires only one deletion, the algorithm must be optimal. In that case, the letter which reduces the most possible matches (ie: all of them) should be best. I can come up with no such proof, though. I'd be interested in finding any counter-examples to optimality.
Find the indeces of each substring in the main string.
Then using a dynamic programming algorithm (so memoize intermediate values), remove each letter that is part of a substring from the main string, add 1 to the count, and repeat.
You can find the letters, because they are within the indeces of each match index + length of B.
A = ababaa
B = aba
count = 0
indeces = (0, 2)
A = babaa, aabaa, abbaa, abbaa, abaaa, ababa
B = aba
count = 1
(2nd abbaa is memoized)
indeces = (1), (1), (), (), (0), (0, 2)
answer = 1
You can take it a step further, and try to memoize the substring match indeces of substrings, but that might not actually be a performance gain.
Not sure on the exact bounds, but shouldn't take too long computationally.

Resources