What is the best algorithm to find longest substring with constraints? - string

Unfortunately I don't know the name of following problem but I am sure that it is well known problem. I want to find effective algorithm to solve problem.
Let S - input string and K - some number (1 <= K <= 26).
Problem is to find longest substring of S, which has only K different characters. What is the best algorithm to solve this problem?
Some examples:
1) S = aaaaabcdef, K = 3, answer = aaaaabc
2) S = acaaba, K = 2, answer = acaa or aaba
3) S = abcde, K = 5, answer = abcde
I have sketch of solution of this problem. But it seems too difficult for me, also it has quadratic complexity. So, in single linear pass I can compute sequent of the same characters by one and appropriated count. Next step is to use set which will contain only K characters. Usage is similar:
std::string max_string;
for (int i = 0; i < s.size(); ++i)
{
std::set<int> my_set;
std::string possible_solution;
for (int j = i; j < s.size(); ++j)
{
// filling set and possible_solution
}
if (my_set.size() == K && possible_solution.size() > max_string.size())
max_string = possible_solution;
}

Notation:
s = input string, zero-based index
[start, end) = substring of input from start to end, including start but excluding end
k-substring = a substring that contains at most k different characters
Algorithm: linear complexity O(n)
start = 0
result = empty string
find max(end): [start, end) is a k-substring
LOOP:
// please note in every loop iteration, [start, end) is a k-substring
update result=[start, end) if (end-start) > length(result)
if end >= length(s) then DONE! EXIT
increase start until [start, end) is a (k-1)-substring
increase end while [start, end] is a k-substring
ENDLOOP
To check if increasing start or end respectively decrease or increase the character pool size (k property), we can use a count[] array, where count[c] = number of occurence of c in the current substring [start, end).
C++ Implementation: http://ideone.com/i2JPCq

The best solution I can come up with is with time complexity O(log(n) * n)) and additional memory complexity O(n). The idea is the following:
First for all 26 characters compute a prefix sum array. For the character C this array has the following property a0 = 0, ai = <number of occurrences of C up to position i>. It is very easy to compute this:
a[0] = 0;
for (int i = 1; i <= n; ++i) {
a[i] = a[i - 1] + (s[i - 1] == C)
}
Now let us assume you have these arrays. It is very easy to compute the number of occurrences of the character C in a closed interval [i, j]. This is precisely a[j + 1] - a[j]. Using this you can also check if C appears somewhere in the interval [i, j] - simply check if the count of the occurrences is greater than 0.
The last part of my solution is to use binary search. For each index i in the string use binary search to identify what is the longest length of substring starting at position i that has no more than K different characters. The complexity of this part of the algorithm is O(n * log(n)).

Since your alphabet consists of only 26 letters, a linear time algorithm can be as follows:
Scan the string from left to right, at each step maintain two separate arrays startIndex[26], endIndex[26].
startIndex[i] = index of first instance of ('a' + i)th letter in the current active substring.
endIndex[i] = index of last instance of ('a' + i)th letter in the current active substring.
You can initialize the arrays elements to be any strange value (like -1) to check their validity during the algorithm.
Also, maintain the maximum length of sub-string obtained so far and the number of current active unique characters.
Algorithm:
1. i = 0.
- Mark the startIndex and endIndex of S[0].
- Initialize maxLength = 1
- Initialize activeChars = 1.
2. for i = 1 to S.size()-1
- if (S[i] != any of the activeChars) // can be done in O(26)
if (activeChars == K)
update maxLength if maxLength < currLength.
remove an active char with least startIndex.
add this new char to startIndex and endIndex
currLength = i - min (remaining active startIndex) + 1
else
activeChars++;
add this S[i] to startIndex and endIndex
currLength++.
update maxLength if maxLength < currLength.
else
update endIndex for S[i].
currLength++.
update maxLength if maxLength < currLength.
3. again update maxLength if maxLength < currLength.

I'll try to modify Abhishek Bansal's algorithm to keep linear complexity and patch the errors that could arise with repeated characters in the active group.
Scan the string from left to right, at each step maintain two separate arrays startIndex[26], endIndex[26], and a map where you associate each char(key) to all its occurencies in the active substring(value).
startIndex[i] = index of first instance of ('a' + i)th letter in the current active substring
endIndex[i] = index of last instance of ('a' + i)th letter in the current active substring.
map.get(i) = list of occurencies in considered substring.
Algorithm:
1. i = 0.
- Mark the startIndex and endIndex of S[0], add the occurency of S[0] to the map.
- Initialize maxLength = 1
- Initialize activeChars = 1.
2. for i = 1 to S.size()-1
- if (S[i] != any of the activeChars) // can be done in O(26)
if (activeChars == K)
update maxLength if maxLength < currLength.
remove the active char with least endIndex.
add this new char to startIndex and endIndex, and to the map with this occurency
remove from the map all the occurencies of all the chars that are previous than removed char's endIndex
update all the startIndex referring to the edited map
currLength = i - min (remaining active startIndex) + 1
else
activeChars++;
add this S[i] to startIndex and endIndex and to the map
currLength++.
update maxLength if maxLength < currLength.
else
update endIndex for S[i], add the occurency to the map.
currLength++.
update maxLength if maxLength < currLength.
3. again update maxLength if maxLength < currLength.
I kept startIndex and endIndex arrays for clarity sake, but you could avoid the extra space and the extra work to update them using the first and the last element of the list of occurencies stored in the map for the key == char C.

Related

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given a string S of length n. Choose an integer K and two non-empty sub-sequences A and B of length K such that it satisfies the following conditions:
A = B i.e. for each i the ith character in A is same as the ith character in B.
Let's denote the indices used to construct A as a1,a2,a3,...,an where ai belongs to S and B as b1,b2,b3,...,bn where bi belongs to S. If we denote the number of common indices in A and B by M then M + 1 <= K.
Find the maximum value of K such that it is possible to find the sub-sequences A and B which satisfies the above conditions.
Constraints:
0 < N <= 10^5
Things which I observed are:
The value of K = 0 if the number of characters in the given string are all distinct i.e S = abcd.
K = length of S - 1 if all the characters in the string are same i.e. S = aaaa.
The value of M cannot be equal to K because then M + 1 <= K will not be true i.e you cannot have a sub-sequence A and B that satifies A = B and a1 = b1, a2 = b2, a3 = b3, ..., an = bn.
If the string S is palindrome then K = (Total number of times a character is repeated in the string if the repeatation count > 1) - 1. i.e. S = tenet then t is repeated 2 times, e is repeated 2 times, Total number of times a character is repeated = 4, K = 4 - 1 = 3.
I am having trouble designing the algorithm to solve the above problem.
Let me know in the comments if you need more clarification.
(Update: see O(n) answer.)
We can modify the classic longest common subsequence recurrence to take an extra parameter.
JavaScript code (not memoised) that I hope is self explanatory:
function f(s, i, j, haveUncommon){
if (i < 0 || j < 0)
return haveUncommon ? 0 : -Infinity
if (s[i] == s[j]){
if (haveUncommon){
return 1 + f(s, i-1, j-1, true)
} else if (i == j){
return Math.max(
1 + f(s, i-1, j-1, false),
f(s, i-1, j, false),
f(s, i, j-1, false)
)
} else {
return 1 + f(s, i-1, j-1, true)
}
}
return Math.max(
f(s, i-1, j, haveUncommon),
f(s, i, j-1, haveUncommon)
)
}
var s = "aabcde"
console.log(f(s, s.length-1, s.length-1, false))
I believe we are just looking for the closest equal pair of characters since the only characters excluded from A and B would be one of the characters in the pair and any characters in between.
Here's O(n) in JavaScript:
function f(s){
let map = {}
let best = -1
for (let i=0; i<s.length; i++){
if (!map.hasOwnProperty(s[i])){
map[s[i]] = i
continue
}
best = Math.max(best, s.length - i + map[s[i]])
map[s[i]] = i
}
return best
}
var strs = [
"aabcde", // 5
"aaababcd", // 7
"aebgaseb", // 4
"aefttfea",
// aeft fea
"abcddbca",
// abcd bca,
"a" // -1
]
for (let s of strs)
console.log(`${ s }: ${ f(s) }`)
O(n) solution in Python3:
def compute_maximum_k(word):
last_occurences = {}
max_k = -1
for i in range(len(word)):
if(not last_occurences or not word[i] in last_occurences):
last_occurences[word[i]] = i
continue
max_k = max(max_k,(len(word) - i) + last_occurences[word[i]])
last_occurences[word[i]] = i
return max_k
def main():
words = ["aabcde","aaababcd","aebgaseb","aefttfea","abcddbca","a","acbdaadbca"]
for word in words:
print(compute_maximum_k(word))
if __name__ == "__main__":
main()
A solution for the maximum length substring would be the following:
After building a Suffix Array you can derive the LCP Array. The maximum value in the LCP array corresponds to the K you are looking for. The overall complexity of both constructions is O(n).
A suffix array will sort all prefixes in you string S in ascending order. The longest common prefix array then computes the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in the sorted suffix array. Thus the maximum value in this array corresponds to the length of the two maximum length substrings of S.
For a nice example using the word "banana", check out the LCP Array Wikipage
I deleted my previous answer as I don't think we need an LCS-like solution (LCS=longest Common Subsequence).
It is sufficient to find the couple of subsequences (A, B) that differ in one character and share all the others.
The code below finds the solution in O(N) time.
def function(word):
dp = [0]*len(word)
lastOccurences = {}
for i in range(len(dp)-1, -1, -1):
if i == len(dp)-1:
dp[i] = 0
else:
if dp[i+1] > 0:
dp[i] = 1 + dp[i+1]
elif word[i] in lastOccurences:
dp[i] = len(word)-lastOccurences[word[i]]
lastOccurences[word[i]] = i
return dp[0]
dp[i] is equal to 0 when all characters from i to the end of the string are different.
I will explain my code by an example.
For "abcack", there are two cases:
Either the first 'a' will be shared by the two subsequences A and B, in this case the solution will be = 1 + function("bcack")
Or 'a' will not be shared between A and B. In this case the result will be 1 + "ck". Why 1 + "ck" ? It's because we have already satisfied M+1<=K so just add all the remaining characters. In terms of indices, the substrings are [0, 4, 5] and [3, 4, 5].
We take the maximum between these two cases.
The reason I'm scanning right to left is to not have O(N) search for the current character in the rest of the string, I maintain the index of the last visited occurence of the character in the dict lastOccurences.

In Place Run Length Encoding Algorithm

I encountered an interview question:
Given a input String: aaaaabcddddee, convert it to a5b1c1d4e2.
One extra constraint is, this needs to be done in-place, means no extra space(array) should be used.
It is guaranteed that the encoded string will always fit in the original string. In other words, string like abcde will not occur, since it will be encoded to a1b1c1d1e1 which occupies more space than the original string.
One hint interviewer gave me was to traverse the string once and find the space that is saved.
Still I am stuck as some times, without using extra variables, some values in the input string may be overwritten.
Any suggestions will be appreciated?
This is a good interview question.
Key Points
There are 2 key points:
Single character must be encoded as c1;
The encoded length will always be smaller than the original array.
Since 1, we know each character requires at least 2 places to be encoded. This is to say, only single character will require more spaces to be encoded.
Simple Approach
From the key points, we notice that the single character causes us a lot problem during the encoding, because they might not have enough place to hold the encoded string. So how about we leave them first, and compressed the other characters first?
For example, we encode aaaaabcddddee from the back while leaving the single character first, we will get:
aaaaabcddddee
_____a5bcd4e2
Then we could safely start from the beginning and encoding the partly encoded sequence, given the key point 2 such that there will be enough spaces.
Analysis
Seems like we've got a solution, are we done? No. Consider this string:
aaa3dd11ee4ff666
The problem doesn't limit the range of characters, so we could use digit as well. In this case, if we still use the same approach, we will get this:
aaa3dd11ee4ff666
__a33d212e24f263
Ok, now tell me, how do you distinguish the run-length from those numbers in the original string?
Well, we need to try something else.
Let's define Encode Benefit (E) as: the length difference between the encoded sequence and the original consecutive character sequence..
For example, aa has E = 0, since aa will be encoded to a2, and they have no length difference; aaa has E = 1, since it will be encoded as a3, and the length difference between the encoded and the original is 1. Let's look at the single character case, what's its E? Yes, it's -1. From the definition, we could deduce the formula for E: E = ori_len - encoded_len.
Now let's go back to the problem. From key point 2, we know the encoded string will always be shorter than the original one. How do we use E to rephrase this key point?
Very simple: sigma(E_i) >= 0, where E_i is the Encode Benefit of the ith consecutive character substring.
For example, the sample you gave in your problem: aaaaabcddddee, can be broken down into 5 parts:
E(0) = 5 - 2 = 3 // aaaaa -> a5
E(1) = 1 - 2 = -1 // b -> b1
E(2) = 1 - 2 = -1 // c -> c1
E(3) = 4 - 2 = 2 // dddd -> d4
E(4) = 2 - 2 = 0 // ee -> e2
And the sigma will be: 3 + (-1) + (-1) + 2 + 0 = 3 > 0. This means there will be 3 spaces left after encoding.
However, from this example, we could see a potential problem: since we are doing summing, even if the final answer is bigger than 0, it's possible to get some negatives in the middle!
Yes, this is a problem, and it's quite serious. If we get E falls below 0, this means we do not have enough space to encode the current character and will overwrite some characters after it.
But but but, why do we need to sum it from the first group? Why can't we start summing from somewhere in the middle to skip the negative part? Let's look at an example:
2 0 -1 -1 -1 1 3 -1
If we sum up from the beginning, we will fall below 0 after adding the third -1 at index 4 (0-based); if we sum up from index 5, loop back to index 0 when we reach the end, we have no problem.
Algorithm
The analysis gives us an insight on the algorithm:
Start from the beginning, calculate E of the current consecutive group, and add to the total E_total;
If E_total is still non-negative (>= 0), we are fine and we could safely proceed to the next group;
If the E_total falls below 0, we need to start over from the current position, i.e. clear E_total and proceed to the next position.
If we reach the end of the sequence and E_total is still non-negative, the last starting point is a good start! This step takes O(n) time. Usually we need to loop back and check again, but since key point 2, we will definitely have a valid answer, so we could safely stop here.
Then we could go back to the starting point and start traditional run-length encoding, after we reach the end we need to go back to the beginning of the sequence to finish the first part. The tricky part is, we need to make use the remaining spaces at the end of the string. After that, we need to do some shifting just in case we have some order issues, and remove any extra white spaces, then we are finally done :)
Therefore, we have a solution (the code is just a pseudo and hasn't been verified):
// find the position first
i = j = E_total = pos = 0;
while (i < s.length) {
while (s[i] == s[j]) j ++;
E_total += calculate_encode_benefit(i, j);
if (E_total < 0) {
E_total = 0;
pos = j;
}
i = j;
}
// do run length encoding as usual:
// start from pos, end with len(s) - 1, the first available place is pos
int last_available_pos = runlength(s, pos, len(s)-1, pos);
// a tricky part here is to make use of the remaining spaces from the end!!!
int fin_pos = runlength(s, 0, pos-1, last_available_pos);
// eliminate the white
eliminate(s, fin_pos, pos);
// update last_available_pos because of elimination
last_available_pos -= pos - fin_pos < 0 ? 0 : pos - fin_pos;
// rotate back
rotate(s, last_available_pos);
Complexity
We have 4 parts in the algorithm:
Find the starting place: O(n)
Run-Length-Encoding on the whole string: O(n)
White space elimination: O(n)
In place string rotation: O(n)
Therefore we have O(n) in total.
Visualization
Suppose we need to encode this string: abccdddefggggghhhhh
First step, we need to find the starting position:
Group 1: a -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 1;
Group 2: b -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 2;
Group 3: cc -> E_total += 0 -> E_total = 0 >= 0 -> proceed;
Group 4: ddd -> E_total += 1 -> E_total = 1 >= 0 -> proceed;
Group 5: e -> E_total += -1 -> E_total = 0 >= 0 -> proceed;
Group 6: f -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 9;
Group 7: ggggg -> E_total += 3 -> E_total = 3 >= 0 -> proceed;
Group 8: hhhhh -> E_total += 3 -> E_total = 6 >= 0 -> end;
So the start position will be 9:
v this is the starting point
abccdddefggggghhhhh
abccdddefg5h5______
^ last_available_pos, we need to make use of these remaining spaces
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
^^^ remove the white space
d3e1f1g5h5a1b1c2
^ last_available_pos, rotate
a1b1c2d3e1f1g5h5
Last Words
This question is not trivial, and actually glued several traditional coding interview questions together naturally. A suggested mind flow would be:
observe the pattern and figure out the key points;
realize the reason for insufficient space is because of encoding single character;
quantize the benefit/cost of encoding on each consecutive characters group (a.k.a Encoding Benefit);
use the quantization you proposed to explain the original statement;
figure out the algorithm to find a good starting point;
figure out how to do run-length-encoding with a good starting point;
realize you need to rotate the encoded string and eliminate the white spaces;
figure out the algorithm to do in place string rotation;
figure out the algorithm to do in place white space elimination.
To be honest, it's a bit challenging for an interviewee to come up with a solid algorithm in a short time, so your analysis flow really matters. Don't say nothing, show your mind flow, this helps the interviewer to find out your current stage.
Maybe just encode it normally, but if you see that your output index overtakes the input index, just skip the "1". Then when you finish go backwards and insert 1 after all letters without a count, shifting the rest of the string back. It is O(N^2) in the worst case (no repeating letters), so I assume there might be better solutions.
EDIT: it appears I missed the part that the final string always fits into the source. With that restriction, yeah, this is not the optimal solution.
EDIT2: an O(N) version of it would be during the first pass also compute the final compressed length (which in the general case might be more than the source), set pointer p1 to it, a pointer p2 to the compressed string with 1s omitted (p2 is thus <= p1), then just keep going backwards on both pointers, copying p2 to p1 and adding 1s when necessary (when this happens the difference between p2 and p1 will decrease)
O(n) and in place
set var = 0;
Loop from 1-length and find the first non-matching character.
The count would be the difference of the indices of both characters.
Let's run through an example
s = "wwwwaaadexxxxxxywww"
add a dummy letter to s
s = s + '#'
now our string becomes
s = "wwwwaaadexxxxxxywww#"
we'll come back to this step later.
j gives the first character of the string.
j = 0 // s[j] = w
now loop through 1 - length. The first non-matching character is 'a'
print(s[j], i - j) // i = 4, j = 0
j = i // j = 4, s[j] = a
Output: w4
i becomes the next non-matching character which would be 'd'
print(s[j], i - j) // i = 7, j = 4 => a3
j = i // j = 7, s[j] = d
Output: w4a3
.
. (Skipping to the second last)
.
j = 15, s[j] = y, i = 16, s[i] = w
print(s[j], i - y) => y1
Output: w4a3d1e1x6y1
Okay so now we reached the last, assume that we didn't add any dummy letter
j = 16, s[j] = w and we cannot print it's count
because we've no 'mis-matching' character
That's why need to add a dummy letter.
Here's a C++ implementation
void compress(string s){
int j = 0;
s = s + '#';
for(int i=1; i < s.length(); i++){
if(s[i] != s[j]){
cout << s[j] << i - j;
j = i;
}
}
}
int main(){
string s = "wwwwaaadexxxxxxywww";
compress(s);
return 0;
}
Output: w4a3d1e1x6y1w3
If the use of insert and erase string functions are allowed then you can efficiently get the solution with this implementation.
#include<bits/stdc++.h>
using namespace std;
int dig(int n){
int k=0;
while(n){
k++;
n/=10;
}
return k;
}
void stringEncoding(string &n){
int i=0;
for(int i=0;i<n.size();i++){
while(n[i]==n[i+j])j++;
n.erase((i+1),(j-1));
n.insert(i+1,to_string(j));
i+=(dig(j));
}
}
int main(){
ios_base::sync_with_stdio(0), cin.tie(0);
string n="kaaaabcddedddllllllllllllllllllllllp";
stringEncoding(n);
cout<<n;
}
This will give the following output : k1a4b1c1d2e1d3l22p1

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Non increasing and Non Decreasing Subsequence

Finding non-decreasing subsequence is well known problem.
But this Question is a slight variant of the finding longest non-decreasing subsequence. In this problem we have to find the length of longest subsequence which comprises 2 disjoint sequences 1. non decreasing 2. non-increasing.
e.g. in string "aabcazcczba" longest such sequence is aabczcczba. aabczcczba is made up of 2 disjoint subsequence aabcZccZBA. (capital letter shows non-increasing sequence)
My algorithm is
length = 0
For i = 0 to length of given string S
let s' = find the longest non-decreasing subsequence starting at position i
let s" = find the longest non-increasing subsequence from S-s'.
if (length of s' + length of s") > length
length = (length of s' + length of s")
enter code here
But I am not sure whether this would give correct answer or not. Can you find a bug in this algo and if there is bug also suggest correct algorithm. Also I need to optimize the solution. My algorithm would take roughly o(n^4) steps.
Your solution is definitely incorrect. Eg. addddbc. The longest non-decreasing sequence is adddd, but that would never give you a non-increasing sequence. The optimal solution is abc and dddd ( or ab ddddc, or ac ddddb).
One solution is to use dynamic programming.
F(i, x, a, b) = 1, if there is a non-decreasing and non-increasing combo from first i letters of x ( x[:i]) such that last letter of non-decreasing part is a, and non-increasing part is b. Both of these letters equal to NULL if the corresponding sub-sequence is empty.
Otherwise F(i, x, a, b) = 0.
F(i+1,x,x[i+1],b) = 1 if there exists a and b such that
a<=x[i+1] or a=NULL and F(i,x,a,b)=1. 0 otherwise.
F(i+1,x,a,x[i+1]) = 1 if there exists a and b such that
b>=x[i+1] or b=NULL and F(i,x,a,b)=1. 0 otherwise.
Initialize F(0,x,NULL,NULL)=1 and iterate from i=1..n
As you can see, you can get F(i+1, x, a, b) from F(i, x, a, b). Complexity: Linear in length, polynomial in size of the alphabet.
I got the answer, And here is how it works, thanx to #ElKamina
maintain a table of 27X27 dimension. 27 = (1 Null character + 26 (alphabets))
table[i][j] denotes the length of the sub sequence whose non decreasing subsequence has last character 'i' and non increasing subsequence has last character 'j' (0th index denote null character and kth index denotes character 'k')
for i = 0 to length of string S
//subsequence whose non decreasing subsequence's last character is smaller than S[i], find such a subsequence of maximum length. Now S[i] can be part of this subsequence's non-decreasing part.
int lim = S[i] - 'a' + 1;
for(int k=0; k<27; k++){
if(lim == k) continue;
int tmax = 0;
for(int j=0; j<=lim; j++){
if(table[k][j] > tmax) tmax = table[k][j];
}
if(k == 0 && tmax == 0) table[0][lim] = 1;
else if (tmax != 0) table[k][lim] = tmax + 1;
}
//Simillarly for non-increasing subsequence
Time complexity is o(lengthOf(S)*27*27) and space complexity is o(27*27)

How to find all combinations of a multiset in a string in linear time?

I am given a bag B (multiset) of characters with the size m and a string text S of size n. Is it possible to find all substrings that can be created by B (4!=24 combinations) in S in linear time O(n)?
Example:
S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}
The fastest solution I found is to keep a counter for each character and compare it with the Bag in each step, thus the runtime is O(n*m). Algorithm can be shown if needed.
There is a way to do it in O(n), assuming we're only interested in substrings of length m (otherwise it's impossible, because for the bag that has all characters in the string, you'd have to return all substrings of s, which means a O(n^2) result that can't be computed in O(n)).
The algorithm is as follows:
Convert the bag to a histogram:
hist = []
for c in B do:
hist[c] = hist[c] + 1
Initialize a running histogram that we're going to modify (histrunsum is the total count of characters in histrun):
histrun = []
histrunsum = 0
We need two operations: add a character to the histogram and remove it. They operate as follows:
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then:
histrun[c] = histrun[c] + 1
histrunsum = histrunsum + 1
remove(c):
if histrun[c] > 0 then:
histrun[c] = histrun[c] - 1
histrunsum = histrunsum + 1
Essentially, histrun captures the amount of characters that are present in B in current substring. If histrun is equal to hist, our substring has the same characters as B. histrun is equal to hist iff histrunsum is equal to length of B.
Now add first m characters to histrun; if histrunsum is equal to length of B; emit first substring; now, until we reach the end of string, remove the first character of the current substring and add the next character.
add, remove are O(1) since hist and histrun are arrays; checking if hist is equal to histrun is done by comparing histrunsum to length(B), so it's also O(1). Loop iteration count is O(n), the resulting running time is O(n).
Thanks for the answer. The add() and remove() methods have to be changed to make the algorithm work correctly.
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] + 1
remove(c):
if histrun[c] > hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] - 1
Explanation:
histrunsum can be seen as a score of how identical both multisets are.
add(c): when there are less occurrences of a char in the histrun multiset than in the hist multiset, the additional occurrence of that char has to be "rewarded" since the histrun multiset is getting closer to the hist multiset. If there are at least equal or more chars in the histrun set already, and additional char is negative.
remove(c): like add(c), where a removal of a char is weighted positively when it's number in the histrun multiset > hist multiset.
Sample Code (PHP):
function multisetSubstrings($sequence, $mset)
{
$multiSet = array();
$substringLength = 0;
foreach ($mset as $char)
{
$multiSet[$char]++;
$substringLength++;
}
$sum = 0;
$currentSet = array();
$result = array();
for ($i=0;$i<strlen($sequence);$i++)
{
if ($i>=$substringLength)
{
$c = $sequence[$i-$substringLength];
if ($currentSet[$c] > $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]--;
}
$c = $sequence[$i];
if ($currentSet[$c] < $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]++;
echo $sum."<br>";
if ($sum==$substringLength)
$result[] = $i+1-$substringLength;
}
return $result;
}
Use hashing. For each character in the multiset, assign a UNIQUE prime number. Compute the hash for any string by multiplying the prime number associated with a number, as many times as the frequency of that number.
Example : CATTA. Let C = 2, A=3, T = 5. Hash = 2*3*5*5*3 = 450
Hash the multiset ( treat it as a string ). Now go through the input string, and compute the hash of each substring of length k ( where k is the number of characters in the multiset ). Check if this hash matches the multiset hash. If yes, then it is one such occurence.
The hashes can be computed very easily in linear time as follows :
Let multiset = { A, A, B, C }, A=2, B=3, C=5.
Multiset hash = 2*2*3*5 = 60
Let text = CABBAACCA
(i) CABB = 5*2*3*3 = 90
(ii) Now, the next letter is A, and the letter discarded is the first one, C. So the new hash = ( 90/5 )*2 = 36
(iii) Now, A is discarded, and A is also added, so new hash = ( 36/2 ) * 2= 36
(iv) Now B is discarded, and C is added, so hash = ( 36/3 ) * 5 = 60 = multiset hash. Thus we have found one such required occurence - BAAC
This procedure will obviously take O( n ) time.

Resources