In Place Run Length Encoding Algorithm - string

I encountered an interview question:
Given a input String: aaaaabcddddee, convert it to a5b1c1d4e2.
One extra constraint is, this needs to be done in-place, means no extra space(array) should be used.
It is guaranteed that the encoded string will always fit in the original string. In other words, string like abcde will not occur, since it will be encoded to a1b1c1d1e1 which occupies more space than the original string.
One hint interviewer gave me was to traverse the string once and find the space that is saved.
Still I am stuck as some times, without using extra variables, some values in the input string may be overwritten.
Any suggestions will be appreciated?

This is a good interview question.
Key Points
There are 2 key points:
Single character must be encoded as c1;
The encoded length will always be smaller than the original array.
Since 1, we know each character requires at least 2 places to be encoded. This is to say, only single character will require more spaces to be encoded.
Simple Approach
From the key points, we notice that the single character causes us a lot problem during the encoding, because they might not have enough place to hold the encoded string. So how about we leave them first, and compressed the other characters first?
For example, we encode aaaaabcddddee from the back while leaving the single character first, we will get:
aaaaabcddddee
_____a5bcd4e2
Then we could safely start from the beginning and encoding the partly encoded sequence, given the key point 2 such that there will be enough spaces.
Analysis
Seems like we've got a solution, are we done? No. Consider this string:
aaa3dd11ee4ff666
The problem doesn't limit the range of characters, so we could use digit as well. In this case, if we still use the same approach, we will get this:
aaa3dd11ee4ff666
__a33d212e24f263
Ok, now tell me, how do you distinguish the run-length from those numbers in the original string?
Well, we need to try something else.
Let's define Encode Benefit (E) as: the length difference between the encoded sequence and the original consecutive character sequence..
For example, aa has E = 0, since aa will be encoded to a2, and they have no length difference; aaa has E = 1, since it will be encoded as a3, and the length difference between the encoded and the original is 1. Let's look at the single character case, what's its E? Yes, it's -1. From the definition, we could deduce the formula for E: E = ori_len - encoded_len.
Now let's go back to the problem. From key point 2, we know the encoded string will always be shorter than the original one. How do we use E to rephrase this key point?
Very simple: sigma(E_i) >= 0, where E_i is the Encode Benefit of the ith consecutive character substring.
For example, the sample you gave in your problem: aaaaabcddddee, can be broken down into 5 parts:
E(0) = 5 - 2 = 3 // aaaaa -> a5
E(1) = 1 - 2 = -1 // b -> b1
E(2) = 1 - 2 = -1 // c -> c1
E(3) = 4 - 2 = 2 // dddd -> d4
E(4) = 2 - 2 = 0 // ee -> e2
And the sigma will be: 3 + (-1) + (-1) + 2 + 0 = 3 > 0. This means there will be 3 spaces left after encoding.
However, from this example, we could see a potential problem: since we are doing summing, even if the final answer is bigger than 0, it's possible to get some negatives in the middle!
Yes, this is a problem, and it's quite serious. If we get E falls below 0, this means we do not have enough space to encode the current character and will overwrite some characters after it.
But but but, why do we need to sum it from the first group? Why can't we start summing from somewhere in the middle to skip the negative part? Let's look at an example:
2 0 -1 -1 -1 1 3 -1
If we sum up from the beginning, we will fall below 0 after adding the third -1 at index 4 (0-based); if we sum up from index 5, loop back to index 0 when we reach the end, we have no problem.
Algorithm
The analysis gives us an insight on the algorithm:
Start from the beginning, calculate E of the current consecutive group, and add to the total E_total;
If E_total is still non-negative (>= 0), we are fine and we could safely proceed to the next group;
If the E_total falls below 0, we need to start over from the current position, i.e. clear E_total and proceed to the next position.
If we reach the end of the sequence and E_total is still non-negative, the last starting point is a good start! This step takes O(n) time. Usually we need to loop back and check again, but since key point 2, we will definitely have a valid answer, so we could safely stop here.
Then we could go back to the starting point and start traditional run-length encoding, after we reach the end we need to go back to the beginning of the sequence to finish the first part. The tricky part is, we need to make use the remaining spaces at the end of the string. After that, we need to do some shifting just in case we have some order issues, and remove any extra white spaces, then we are finally done :)
Therefore, we have a solution (the code is just a pseudo and hasn't been verified):
// find the position first
i = j = E_total = pos = 0;
while (i < s.length) {
while (s[i] == s[j]) j ++;
E_total += calculate_encode_benefit(i, j);
if (E_total < 0) {
E_total = 0;
pos = j;
}
i = j;
}
// do run length encoding as usual:
// start from pos, end with len(s) - 1, the first available place is pos
int last_available_pos = runlength(s, pos, len(s)-1, pos);
// a tricky part here is to make use of the remaining spaces from the end!!!
int fin_pos = runlength(s, 0, pos-1, last_available_pos);
// eliminate the white
eliminate(s, fin_pos, pos);
// update last_available_pos because of elimination
last_available_pos -= pos - fin_pos < 0 ? 0 : pos - fin_pos;
// rotate back
rotate(s, last_available_pos);
Complexity
We have 4 parts in the algorithm:
Find the starting place: O(n)
Run-Length-Encoding on the whole string: O(n)
White space elimination: O(n)
In place string rotation: O(n)
Therefore we have O(n) in total.
Visualization
Suppose we need to encode this string: abccdddefggggghhhhh
First step, we need to find the starting position:
Group 1: a -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 1;
Group 2: b -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 2;
Group 3: cc -> E_total += 0 -> E_total = 0 >= 0 -> proceed;
Group 4: ddd -> E_total += 1 -> E_total = 1 >= 0 -> proceed;
Group 5: e -> E_total += -1 -> E_total = 0 >= 0 -> proceed;
Group 6: f -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 9;
Group 7: ggggg -> E_total += 3 -> E_total = 3 >= 0 -> proceed;
Group 8: hhhhh -> E_total += 3 -> E_total = 6 >= 0 -> end;
So the start position will be 9:
v this is the starting point
abccdddefggggghhhhh
abccdddefg5h5______
^ last_available_pos, we need to make use of these remaining spaces
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
^^^ remove the white space
d3e1f1g5h5a1b1c2
^ last_available_pos, rotate
a1b1c2d3e1f1g5h5
Last Words
This question is not trivial, and actually glued several traditional coding interview questions together naturally. A suggested mind flow would be:
observe the pattern and figure out the key points;
realize the reason for insufficient space is because of encoding single character;
quantize the benefit/cost of encoding on each consecutive characters group (a.k.a Encoding Benefit);
use the quantization you proposed to explain the original statement;
figure out the algorithm to find a good starting point;
figure out how to do run-length-encoding with a good starting point;
realize you need to rotate the encoded string and eliminate the white spaces;
figure out the algorithm to do in place string rotation;
figure out the algorithm to do in place white space elimination.
To be honest, it's a bit challenging for an interviewee to come up with a solid algorithm in a short time, so your analysis flow really matters. Don't say nothing, show your mind flow, this helps the interviewer to find out your current stage.

Maybe just encode it normally, but if you see that your output index overtakes the input index, just skip the "1". Then when you finish go backwards and insert 1 after all letters without a count, shifting the rest of the string back. It is O(N^2) in the worst case (no repeating letters), so I assume there might be better solutions.
EDIT: it appears I missed the part that the final string always fits into the source. With that restriction, yeah, this is not the optimal solution.
EDIT2: an O(N) version of it would be during the first pass also compute the final compressed length (which in the general case might be more than the source), set pointer p1 to it, a pointer p2 to the compressed string with 1s omitted (p2 is thus <= p1), then just keep going backwards on both pointers, copying p2 to p1 and adding 1s when necessary (when this happens the difference between p2 and p1 will decrease)

O(n) and in place
set var = 0;
Loop from 1-length and find the first non-matching character.
The count would be the difference of the indices of both characters.
Let's run through an example
s = "wwwwaaadexxxxxxywww"
add a dummy letter to s
s = s + '#'
now our string becomes
s = "wwwwaaadexxxxxxywww#"
we'll come back to this step later.
j gives the first character of the string.
j = 0 // s[j] = w
now loop through 1 - length. The first non-matching character is 'a'
print(s[j], i - j) // i = 4, j = 0
j = i // j = 4, s[j] = a
Output: w4
i becomes the next non-matching character which would be 'd'
print(s[j], i - j) // i = 7, j = 4 => a3
j = i // j = 7, s[j] = d
Output: w4a3
.
. (Skipping to the second last)
.
j = 15, s[j] = y, i = 16, s[i] = w
print(s[j], i - y) => y1
Output: w4a3d1e1x6y1
Okay so now we reached the last, assume that we didn't add any dummy letter
j = 16, s[j] = w and we cannot print it's count
because we've no 'mis-matching' character
That's why need to add a dummy letter.
Here's a C++ implementation
void compress(string s){
int j = 0;
s = s + '#';
for(int i=1; i < s.length(); i++){
if(s[i] != s[j]){
cout << s[j] << i - j;
j = i;
}
}
}
int main(){
string s = "wwwwaaadexxxxxxywww";
compress(s);
return 0;
}
Output: w4a3d1e1x6y1w3

If the use of insert and erase string functions are allowed then you can efficiently get the solution with this implementation.
#include<bits/stdc++.h>
using namespace std;
int dig(int n){
int k=0;
while(n){
k++;
n/=10;
}
return k;
}
void stringEncoding(string &n){
int i=0;
for(int i=0;i<n.size();i++){
while(n[i]==n[i+j])j++;
n.erase((i+1),(j-1));
n.insert(i+1,to_string(j));
i+=(dig(j));
}
}
int main(){
ios_base::sync_with_stdio(0), cin.tie(0);
string n="kaaaabcddedddllllllllllllllllllllllp";
stringEncoding(n);
cout<<n;
}
This will give the following output : k1a4b1c1d2e1d3l22p1

Related

Convert S to T by performing K operations (HackerRank)

I was solving a problem on HackerRank. It required me to see if it is possible to convert string s to string t by performing k operations.
https://www.hackerrank.com/challenges/append-and-delete/problem
The operations we can perform are: appending a lowercase letter to the end of s or removing a lowercase letter from the end of s. For example Ash Ashley 2 would return No since we need 3 operations, not 2.
I tried solving the problem as follows:
def appendAndDelete(s, t, k):
if len(s) > len(t):
maxs = [s,t]
else:
maxs = [t,s]
maximum = maxs[0]
minimum = maxs[1]
k -= len(maximum) - len(minimum)
substr = maximum[len(minimum): len(maximum)]
maximum = maximum.replace(substr, '')
i = 0
while i < len(maximum):
if maximum[i] != minimum[i]:
k -= (len(maximum)-i)*2
break
i += 1
if k < 0:
return 'No'
else:
return 'Yes'
However, it fails at this weird test case. y yu 2. The expected answer is No but according to my code, it would return Yes since only one operation was required. Is there something I do not understand?
Since you don't explain your idea, it's difficult for us to understand
what you mean in your code and debug it to tell you where you went wrong.
However, I would like to share my idea(I solved this on the website too)-
len1 => Length of first string s.
len2 => Length of second/target string t.
Exactly K makes it a bit tricky. So, if len1 + len2 <= k, you can blindly assume it can be accomplished and return true since we can delete empty string many times to get an empty string(as it says) and we can delete characters of one string entirely and keep appending new letters to get the another.
When we start matching s with t from left to right, this looks more like longest common prefix but this is NOT the case. Let's take an example -
aaaaaaaaa (source)
aaaa (target)
7 (k)
Here, up till aaaa it's common and looks like there are additional 5 a's in the source. So, we can delete those 5 a's and get the target but 5 != 7, hence it appears to be a No. But this ain't the case since we can delete an a from the source just like that and append it again(2 operations) just to satisfy k. So, it need not be longest common prefix all the time, however it gets us closer to the solution.
So, let's match both strings from left to right and stop when there is a mismatch. Let's assume we got this index in a variable called first_unmatched. Initialize first_unmatched = min(len(s),len(t)) at the beginning of your method itself.
Let
rem1 = len1 - first_unmatched
rem2 = len2 - first_unmatched
where rem1 is remaining substring of s and rem2 is the remaining substring of t.
Now, comes the conditions.
if(rem1 + rem2 == k) return true-
This is because rem1 characters to delete and rem2 characters to add. If both sum up to k then it's possible.
if(rem1 + rem2 > k) return false-
This is because rem1 characters to delete and rem2 characters to add. If both sum greater than k then it's not possible.
if(rem1 + rem2 < k) return (k - (rem1 + rem2)) % 2 == 0-
This is because rem1 characters to delete and rem2 characters to add. If both sum less than k, then it depends.
Here, (k - (rem1 + rem2)) will give you the extra in k. This extra can or cannot depends upon whether it's divisible by 2 or not. Here, we do %2 because we have 2 operations in our question - delete and append. If the extra k falls short of any operation, then the answer is No, else it's a Yes.
You can cross check this with above example.

What is the best algorithm to find longest substring with constraints?

Unfortunately I don't know the name of following problem but I am sure that it is well known problem. I want to find effective algorithm to solve problem.
Let S - input string and K - some number (1 <= K <= 26).
Problem is to find longest substring of S, which has only K different characters. What is the best algorithm to solve this problem?
Some examples:
1) S = aaaaabcdef, K = 3, answer = aaaaabc
2) S = acaaba, K = 2, answer = acaa or aaba
3) S = abcde, K = 5, answer = abcde
I have sketch of solution of this problem. But it seems too difficult for me, also it has quadratic complexity. So, in single linear pass I can compute sequent of the same characters by one and appropriated count. Next step is to use set which will contain only K characters. Usage is similar:
std::string max_string;
for (int i = 0; i < s.size(); ++i)
{
std::set<int> my_set;
std::string possible_solution;
for (int j = i; j < s.size(); ++j)
{
// filling set and possible_solution
}
if (my_set.size() == K && possible_solution.size() > max_string.size())
max_string = possible_solution;
}
Notation:
s = input string, zero-based index
[start, end) = substring of input from start to end, including start but excluding end
k-substring = a substring that contains at most k different characters
Algorithm: linear complexity O(n)
start = 0
result = empty string
find max(end): [start, end) is a k-substring
LOOP:
// please note in every loop iteration, [start, end) is a k-substring
update result=[start, end) if (end-start) > length(result)
if end >= length(s) then DONE! EXIT
increase start until [start, end) is a (k-1)-substring
increase end while [start, end] is a k-substring
ENDLOOP
To check if increasing start or end respectively decrease or increase the character pool size (k property), we can use a count[] array, where count[c] = number of occurence of c in the current substring [start, end).
C++ Implementation: http://ideone.com/i2JPCq
The best solution I can come up with is with time complexity O(log(n) * n)) and additional memory complexity O(n). The idea is the following:
First for all 26 characters compute a prefix sum array. For the character C this array has the following property a0 = 0, ai = <number of occurrences of C up to position i>. It is very easy to compute this:
a[0] = 0;
for (int i = 1; i <= n; ++i) {
a[i] = a[i - 1] + (s[i - 1] == C)
}
Now let us assume you have these arrays. It is very easy to compute the number of occurrences of the character C in a closed interval [i, j]. This is precisely a[j + 1] - a[j]. Using this you can also check if C appears somewhere in the interval [i, j] - simply check if the count of the occurrences is greater than 0.
The last part of my solution is to use binary search. For each index i in the string use binary search to identify what is the longest length of substring starting at position i that has no more than K different characters. The complexity of this part of the algorithm is O(n * log(n)).
Since your alphabet consists of only 26 letters, a linear time algorithm can be as follows:
Scan the string from left to right, at each step maintain two separate arrays startIndex[26], endIndex[26].
startIndex[i] = index of first instance of ('a' + i)th letter in the current active substring.
endIndex[i] = index of last instance of ('a' + i)th letter in the current active substring.
You can initialize the arrays elements to be any strange value (like -1) to check their validity during the algorithm.
Also, maintain the maximum length of sub-string obtained so far and the number of current active unique characters.
Algorithm:
1. i = 0.
- Mark the startIndex and endIndex of S[0].
- Initialize maxLength = 1
- Initialize activeChars = 1.
2. for i = 1 to S.size()-1
- if (S[i] != any of the activeChars) // can be done in O(26)
if (activeChars == K)
update maxLength if maxLength < currLength.
remove an active char with least startIndex.
add this new char to startIndex and endIndex
currLength = i - min (remaining active startIndex) + 1
else
activeChars++;
add this S[i] to startIndex and endIndex
currLength++.
update maxLength if maxLength < currLength.
else
update endIndex for S[i].
currLength++.
update maxLength if maxLength < currLength.
3. again update maxLength if maxLength < currLength.
I'll try to modify Abhishek Bansal's algorithm to keep linear complexity and patch the errors that could arise with repeated characters in the active group.
Scan the string from left to right, at each step maintain two separate arrays startIndex[26], endIndex[26], and a map where you associate each char(key) to all its occurencies in the active substring(value).
startIndex[i] = index of first instance of ('a' + i)th letter in the current active substring
endIndex[i] = index of last instance of ('a' + i)th letter in the current active substring.
map.get(i) = list of occurencies in considered substring.
Algorithm:
1. i = 0.
- Mark the startIndex and endIndex of S[0], add the occurency of S[0] to the map.
- Initialize maxLength = 1
- Initialize activeChars = 1.
2. for i = 1 to S.size()-1
- if (S[i] != any of the activeChars) // can be done in O(26)
if (activeChars == K)
update maxLength if maxLength < currLength.
remove the active char with least endIndex.
add this new char to startIndex and endIndex, and to the map with this occurency
remove from the map all the occurencies of all the chars that are previous than removed char's endIndex
update all the startIndex referring to the edited map
currLength = i - min (remaining active startIndex) + 1
else
activeChars++;
add this S[i] to startIndex and endIndex and to the map
currLength++.
update maxLength if maxLength < currLength.
else
update endIndex for S[i], add the occurency to the map.
currLength++.
update maxLength if maxLength < currLength.
3. again update maxLength if maxLength < currLength.
I kept startIndex and endIndex arrays for clarity sake, but you could avoid the extra space and the extra work to update them using the first and the last element of the list of occurencies stored in the map for the key == char C.

Efficiently counting the number of substrings of a digit string that are divisible by k?

We are given a string which consists of digits 0-9. We have to count number of sub-strings divisible by a number k. One way is to generate all the sub-strings and check if it is divisible by k but this will take O(n^2) time. I want to solve this problem in O(n*k) time.
1 <= n <= 100000 and 2 <= k <= 1000.
I saw a similar question here. But k was fixed as 4 in that question. So, I used the property of divisibility by 4 to solve the problem.
Here is my solution to that problem:
int main()
{
string s;
vector<int> v[5];
int i;
int x;
long long int cnt = 0;
cin>>s;
x = 0;
for(i = 0; i < s.size(); i++) {
if((s[i]-'0') % 4 == 0) {
cnt++;
}
}
for(i = 1; i < s.size(); i++) {
int f = s[i-1]-'0';
int s1 = s[i] - '0';
if((10*f+s1)%4 == 0) {
cnt = cnt + (long long)(i);
}
}
cout<<cnt;
}
But I wanted a general algorithm for any value of k.
This is a really interesting problem. Rather than jumping into the final overall algorithm, I thought I'd start with a reasonable algorithm that doesn't quite cut it, then make a series of modifications to it to end up with the final, O(nk)-time algorithm.
This approach combines together a number of different techniques. The major technique is the idea of computing a rolling remainder over the digits. For example, let's suppose we want to find all prefixes of the string that are multiples of k. We could do this by listing off all the prefixes and checking whether each one is a multiple of k, but that would take time at least Θ(n2) since there are Θ(n2) different prefixes. However, we can do this in time Θ(n) by being a bit more clever. Suppose we know that we've read the first h characters of the string and we know the remainder of the number formed that way. We can use this to say something about the remainder of the first h+1 characters of the string as well, since by appending that digit we're taking the existing number, multiplying it by ten, and then adding in the next digit. This means that if we had a remainder of r, then our new remainder is (10r + d) mod k, where d is the digit that we uncovered.
Here's quick pseudocode to count up the number of prefixes of a string that are multiples of k. It runs in time Θ(n):
remainder = 0
numMultiples = 0
for i = 1 to n: // n is the length of the string
remainder = (10 * remainder + str[i]) % k
if remainder == 0
numMultiples++
return numMultiples
We're going to use this initial approach as a building block for the overall algorithm.
So right now we have an algorithm that can find the number of prefixes of our string that are multiples of k. How might we convert this into an algorithm that finds the number of substrings that are multiples of k? Let's start with an approach that doesn't quite work. What if we count all the prefixes of the original string that are multiples of k, then drop off the first character of the string and count the prefixes of what's left, then drop off the second character and count the prefixes of what's left, etc? This will eventually find every substring, since each substring of the original string is a prefix of some suffix of the string.
Here's some rough pseudocode:
numMultiples = 0
for i = 1 to n:
remainder = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if remainder == 0
numMultiples++
return numMultiples
For example, running this approach on the string 14917 looking for multiples of 7 will turn up these strings:
String 14917: Finds 14, 1491, 14917
String 4917: Finds 49,
String 917: Finds 91, 917
String 17: Finds nothing
String 7: Finds 7
The good news about this approach is that it will find all the substrings that work. The bad news is that it runs in time Θ(n2).
But let's take a look at the strings we're seeing in this example. Look, for example, at the substrings found by searching for prefixes of the entire string. We found three of them: 14, 1491, and 14917. Now, look at the "differences" between those strings:
The difference between 14 and 14917 is 917.
The difference between 14 and 1491 is 91
The difference between 1491 and 14917 is 7.
Notice that the difference of each of these strings is itself a substring of 14917 that's a multiple of 7, and indeed if you look at the other strings that we've matched later on in the run of the algorithm we'll find these other strings as well.
This isn't a coincidence. If you have two numbers with a common prefix that are multiples of the same number k, then the "difference" between them will also be a multiple of k. (It's a good exercise to check the math on this.)
So this suggests another route we can take. Suppose that we find all prefixes of the original string that are multiples of k. If we can find all of them, we can then figure out how many pairwise differences there are among those prefixes and potentially avoid rescanning things multiple times. This won't find everything, necessarily, but it will find all substrings that can be formed by computing the difference of two prefixes. Repeating this over all suffixes - and being careful not to double-count things - could really speed things up.
First, let's imagine that we find r different prefixes of the string that are multiples of k. How many total substrings did we just find if we include differences? Well, we've found k strings, plus one extra string for each (unordered) pair of elements, which works out to k + k(k-1)/2 = k(k+1)/2 total substrings discovered. We still need to make sure we don't double-count things, though.
To see whether we're double-counting something, we can use the following technique. As we compute the rolling remainders along the string, we'll store the remainders we find after each entry. If in the course of computing a rolling remainder we rediscover a remainder we've already computed at some point, we know that the work we're doing is redundant; some previous scan over the string will have already computed this remainder and anything we've discovered from this point forward will have already been found.
Putting these ideas together gives us this pseudocode:
numMultiples = 0
seenRemainders = array of n sets, all initially empty
for i = 1 to n:
remainder = 0
prefixesFound = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if seenRemainders[j] contains remainder:
break
add remainder to seenRemainders[j]
if remainder == 0
prefixesFound++
numMultiples += prefixesFound * (prefixesFound + 1) / 2
return numMultiples
So how efficient is this? At first glance, this looks like it runs in time O(n2) because of the outer loops, but that's not a tight bound. Notice that each element can only be passed over in the inner loop at most k times, since after that there aren't any remainders that are still free. Therefore, since each element is visited at most O(k) times and there are n total elements, the runtime is O(nk), which meets your runtime requirements.

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Non increasing and Non Decreasing Subsequence

Finding non-decreasing subsequence is well known problem.
But this Question is a slight variant of the finding longest non-decreasing subsequence. In this problem we have to find the length of longest subsequence which comprises 2 disjoint sequences 1. non decreasing 2. non-increasing.
e.g. in string "aabcazcczba" longest such sequence is aabczcczba. aabczcczba is made up of 2 disjoint subsequence aabcZccZBA. (capital letter shows non-increasing sequence)
My algorithm is
length = 0
For i = 0 to length of given string S
let s' = find the longest non-decreasing subsequence starting at position i
let s" = find the longest non-increasing subsequence from S-s'.
if (length of s' + length of s") > length
length = (length of s' + length of s")
enter code here
But I am not sure whether this would give correct answer or not. Can you find a bug in this algo and if there is bug also suggest correct algorithm. Also I need to optimize the solution. My algorithm would take roughly o(n^4) steps.
Your solution is definitely incorrect. Eg. addddbc. The longest non-decreasing sequence is adddd, but that would never give you a non-increasing sequence. The optimal solution is abc and dddd ( or ab ddddc, or ac ddddb).
One solution is to use dynamic programming.
F(i, x, a, b) = 1, if there is a non-decreasing and non-increasing combo from first i letters of x ( x[:i]) such that last letter of non-decreasing part is a, and non-increasing part is b. Both of these letters equal to NULL if the corresponding sub-sequence is empty.
Otherwise F(i, x, a, b) = 0.
F(i+1,x,x[i+1],b) = 1 if there exists a and b such that
a<=x[i+1] or a=NULL and F(i,x,a,b)=1. 0 otherwise.
F(i+1,x,a,x[i+1]) = 1 if there exists a and b such that
b>=x[i+1] or b=NULL and F(i,x,a,b)=1. 0 otherwise.
Initialize F(0,x,NULL,NULL)=1 and iterate from i=1..n
As you can see, you can get F(i+1, x, a, b) from F(i, x, a, b). Complexity: Linear in length, polynomial in size of the alphabet.
I got the answer, And here is how it works, thanx to #ElKamina
maintain a table of 27X27 dimension. 27 = (1 Null character + 26 (alphabets))
table[i][j] denotes the length of the sub sequence whose non decreasing subsequence has last character 'i' and non increasing subsequence has last character 'j' (0th index denote null character and kth index denotes character 'k')
for i = 0 to length of string S
//subsequence whose non decreasing subsequence's last character is smaller than S[i], find such a subsequence of maximum length. Now S[i] can be part of this subsequence's non-decreasing part.
int lim = S[i] - 'a' + 1;
for(int k=0; k<27; k++){
if(lim == k) continue;
int tmax = 0;
for(int j=0; j<=lim; j++){
if(table[k][j] > tmax) tmax = table[k][j];
}
if(k == 0 && tmax == 0) table[0][lim] = 1;
else if (tmax != 0) table[k][lim] = tmax + 1;
}
//Simillarly for non-increasing subsequence
Time complexity is o(lengthOf(S)*27*27) and space complexity is o(27*27)

Resources