Recently, I was asked the following problem during an interview.
Given a string S, I need to find another string S2 such that S2 is a subsequence of S and also S is a subsequence of S2+reverse(S2). Here '+' means concatenation. I need to output the min possible length of S2 for given S.
I was told that this is a dynamic programming problem however I was unable to solve it. Can somebody help me with this problem?
EDIT-
Is there a way to do this in O(N2) or less.
There are 2 important aspects in this problem.
Since we need S as a substring of S2+reverse(S2), S2 should have
atleast n/2 length.
After concatenation of S2 and reverse(S2), there is a pattern where
the alphabets repeats such as
So the solution is to check from the center of S to end of S for any consecutive elements. If you find one then check the elements on either side as shown.
Now if you are able to reach till the end of the string, then the minimum number of elements (result) is the distance from start to the point where you find consecutive elements. In this example its C i.e 3.
We know that this may not happen always. i.e you may not be able to find consecutive elements at the center. Let us say the consecutive elements are after the center then we can do the same test.
Main string
Substring
Concatenated string
Now arrives the major doubt. Why we consider only the left side starting from center? The answer is simple, the concatenated string is made by S+reverse(S). So we are sure that the last element in the substring comes consecutive in the concatenated string. There is no way that any repetition in the first half of the main string can give a better result because at least we should have the n alphabets in the final concatenated string
Now the matter of complexity:
Searching for consecutive alphabets give a maximum of O(n)
Now checking elements on either side iteratively can give a worst case complexity of O(n). i.e maximum n/2 comparisons.
We may fail many times doing the second check so the we have a multiplicative relation between the complexities i.e O(n*n).
I believe this is a correct solution and didn't find any loophole yet.
Let's say that S2 is "apple". Then we can make this assumption:
S2 + reverseS2 >= S >= S2
"appleelppa" >= S >= "apple"
So the given S will something including "apple" to not more than "appleelppe". It could be "appleel" or "appleelpp".
String S ="locomotiffitomoc";
// as you see S2 string is "locomotif" but
// we don't know S2 yet, so it's blank
String S2 = "";
for (int a=0; a<S.length(); a++) {
try {
int b = 0;
while (S.charAt(a - b) == S.charAt(a + b + 1))
b++;
// if this for loop breaks that means that there is a character that doesn't match the rule
// if for loop doesn't break but throws an exception we found it.
} catch (Exception e) {
// if StringOutOfBoundsException is thrown this means end of the string.
// you can check this manually of course.
S2 = S.substring(0,a+1);
break;
}
}
System.out.println(S2); // will print out "locomotif"
Congratulations, you found the minimum S2.
Each character from S can be includes in S2 or not. With that we can construct recursion that tries two cases:
first character of S is used for cover,
first character of S is not
used for cover,
and calculate minimum of these two covers. To implement this, it is enough to track how much of S is covered with already chosen S2+reverse(S2).
There are optimizations where we know what result is (found cover, can't have cover), and it is not needed to take first character for cover if it will not cover something.
Simple python implementation:
cache = {}
def S2(S, to_cover):
if not to_cover: # Covered
return ''
if not S: # Not covered
return None
if len(to_cover) > 2*len(S): # Can't cover
return None
key = (S, to_cover)
if key not in cache:
without_char = S2(S[1:], to_cover) # Calculate with first character skipped
cache[key] = without_char
_f = to_cover[0] == S[0]
_l = to_cover[-1] == S[0]
if _f or _l:
# Calculate with first character used
with_char = S2(S[1:], to_cover[int(_f):len(to_cover)-int(_l)])
if with_char is not None:
with_char = S[0] + with_char # Append char to result
if without_char is None or len(with_char) <= len(without_char):
cache[key] = with_char
return cache[key]
s = '21211233123123213213131212122111312113221122132121221212321212112121321212121132'
c = S2(s, s)
print len(s), s
print len(c), c
I have set of strings
[abcd,
efgh,
abefg]
How to find the minimum number of strings that covers all the characters (abcdefgh)
Answer would be abcd and efgh. But what would be the algorithm to find this answer?
The "set cover problem" can be reduced to your problem. You can read about it on Wikipedia link. There is no known polynomial solution for it.
#j_random_hacker: That's what I meant. Corrected.
#Yuvaraj: Check the following pseudo code:
str = input string
S = input set
for each subset s of S in ascending order of cardinality:
if s covers str
return s
return none
python
>>> a="abcd efgh abefg"
>>> set(a)
set(['a', ' ', 'c', 'b', 'e', 'd', 'g', 'f', 'h'])
>>> ''.join(set(a))
'a cbedgfh'
>>> ''.join(set(a)-set(' '))
'acbedgfh'
If you want to check every possible combination of strings to find the shortest combination which covers a set of characters, there are two basic approaches:
Generating every combination of strings, and for each one, checking whether it covers the whole character set.
For each character in the set, making a list of strings it appears in, and then combining those lists to find combinations of strings which cover the character set.
(If the number of characters or strings is too big to check all combinations in reasonable time, you'll have to use an approximation algorithm, which will find a good-enough solution, but can't guarantee to find the optimal solution.)
The first approach generates N! combinations of strings (where N is the number of strings) so e.g. for 13 strings that is more than 2^32 combinations, and for 21 strings more than 2^64. For large numbers of strings, this may become too inefficient. On the other hand, the size of the character set doesn't have much impact on the efficiency of this approach.
The second approach generates N lists of indexes pointing to string (where N is the number of characters in the set), and each of these lists holds at most M indexes (where M is the number of strings). So there are potentially M^N combinations. However, the number of combinations that are actually considered is much lower; consider this example with 8 characters and 8 strings:
character set: abcdefg
strings: 0:pack, 1:my, 2:bag, 3:with, 4:five, 5:dozen, 6:beige, 7:eggs
string matches for each character:
a: [0,2]
b: [2,6]
c: [0]
d: [5]
e: [4,5,6,7]
f: [4]
g: [2,6,7]
optimal combinations (size 4):
[0,2,4,5] = ["pack,"bag","five","dozen"]
[0,4,5,6] = ["pack,"five","dozen","beige"]
Potentially there are 2x2x1x1x4x1x3 = 48 combinations. However, if string 0 is selected for character "a", that also covers character "c"; if string 2 is selected for character "a", that also covers characters "b" and "g". In fact, only three combinations are ever considered: [0,2,5,4], [0,6,5,4] and [2,0,5,4].
If the number of strings is much greater than the number of characters, approach 2 is the better choice.
code example 1
This is a simple algorithm which uses recursion to try all possible combinations of strings to find the combinations which contain all characters.
Run the code snippet to see the algorithm find solutions for 12 strings and the whole alphabet (see console for output).
// FIND COMBINATIONS OF STRINGS WHICH COVER THE CHARACTER SET
function charCover(chars, strings, used) {
used = used || [];
// ITERATE THROUGH THE LIST OF STRINGS
for (var i = 0; i < strings.length; i++) {
// MAKE A COPY OF THE CHARS AND DELETE THOSE WHICH OCCUR IN THE CURRENT STRING
var c = chars.replace(new RegExp("[" + strings[i] + "]","g"), "");
// MAKE A COPY OF THE STRINGS AFTER THE CURRENT STRING
var s = strings.slice(i + 1);
// ADD THE CURRENT STRING TO THE LIST OF USED STRINGS
var u = used.concat([strings[i]]);
// IF NO CHARACTERS ARE LEFT, PRINT THE LIST OF USED STRINGS
if (c.length == 0) console.log(u.length + " strings:\t" + u)
// IF CHARACTERS AND STRINGS ARE LEFT, RECURSE WITH THE REST
else if (s.length > 0) charCover(c, s, u);
}
}
var strings = ["the","quick","brown","cow","fox","jumps","over","my","lazy","cats","dogs","unicorns"];
var chars = "abcdefghijklmnopqrstuvwxyz";
charCover(chars, strings);
You can prune some unnecessary paths by adding this line after the characters are removed with replace():
// IF NO CHARS WERE DELETED, THIS STRING IS UNNECESSARY
if (c.length == chars.length) continue;
code example 2
This is an algorithm which firsts creates a list of matching strings for every character, and then uses recursion to combine the lists to find combinations of strings that cover the character set.
Run the code snippet to see the algorithm find solutions for 24 strings and 12 characters (see console for output).
// FIND COMBINATIONS OF STRINGS WHICH COVER THE CHARACTER SET
function charCover(chars, strings) {
// CREAT LIST OF STRINGS MATCHING EACH CHARACTER
var matches = [], min = strings.length, output = [];
for (var i = 0; i < chars.length; i++) {
matches[i] = [];
for (var j = 0; j < strings.length; j++) {
if (strings[j].indexOf(chars.charAt(i)) > -1) {
matches[i].push(j);
}
}
}
combine(matches);
return output;
// RECURSIVE FUNCTION TO COMBINE MATCHES
function combine(matches, used) {
var m = []; used = used || [];
// COPY ONLY MATCHES FOR CHARACTERS NOT ALREADY COVERED
for (var i = 0; i < matches.length; i++) {
for (var j = 0, skip = false; j < matches[i].length; j++) {
if (used.indexOf(matches[i][j]) > -1) {
skip = true;
break;
}
}
if (! skip) m.push(matches[i].slice());
}
// IF ALL CHARACTERS ARE COVERED, STORE COMBINATION
if (m.length == 0) {
// IF COMBINATION IS SHORTER THAN MINIMUM, DELETE PREVIOUSLY STORED COMBINATIONS
if (used.length < min) {
min = used.length;
output = [];
}
// CONVERT INDEXES TO STRINGS AND STORE COMBINATION
var u = [];
for (var i = 0; i < used.length; i++) {
u.push(strings[used[i]]);
}
output.push(u);
}
// RECURSE IF CURRENT MINIMUM NUMBER OF STRINGS HAS NOT BEEN REACHED
else if (used.length < min) {
// ITERATE OVER STRINGS MATCHING NEXT CHARACTER AND RECURSE
for (var i = 0; i < m[0].length; i++) {
combine(m, used.concat([m[0][i]]));
}
}
}
}
var strings = ["the","quick","brown","fox","jumps","over","lazy","dogs","pack","my","bag","with","five","dozen","liquor","jugs","jaws","love","sphynx","of","black","quartz","this","should","do"];
var chars = "abcdefghijkl";
var result = charCover(chars, strings);
for (var i in result) console.log(result[i]);
This algorithm can be further optimised to avoid finding duplicate combinations with the same strings in different order. Sorting the matches by size before combining them may also improve efficiency.
Thanks everyone for the response,
Finally completed it, have given the algorithm below in simple words as a refernce for others
Sub optimize_strings()
Capture list of strings in an array variable & number of strings in an integer
Initialize array of optimized strings as empty & pointer to it as zero
Get the list of all characters in an array & number of characters in a variable
Do While number of characters>0
Reset the frequency of all characters as zero & then calculate the frequency of all characters in uncovered strings in separate array
Reset the number of uncovered characters for each strings as zero & then calculate the number of uncovered characters in each strings in separate array
Sort the characters in characters array in ascending order based on their characters frequency array
Fetch list of strings that contains the character present in the top of the character array & place them in filtered strings array
Bubble sort filtered strings array in descending order based on the number of uncovered characters which was stored in step 2 of this loop
Store the Top of the filtered strings array in optimized strings array & increase its pointer to 1
Iterate through all the characters in the optimized string & remove all the characters present in it from characters array
Loop
Print the result of optimized strings present in optimized strings array
End Sub
I have three strings as the input (A,B,C).
A = "SLOVO", B = "WORD", C =
And I need to find algorithm which decide, if the string C is a concatenation of infinite repetiton strings A and B. Example of repetition: A^2 = "SLOVOSLOVO" and in the string C is first 8 letters "SLOVOSLO" from "SLOVOSLOVO". String B is similar.
My idea for algorithm:
index_A = 0; //index of actual letter of string A
index_B = 0;
Go throught the hole string C from 0 to size(C)
{
Pick the actual letter from C (C[i])
if(C[i] == A[index_A] && C[i] != B[index_B])
{
index_A++;
Go to next letter in C
}
else if(C[i] == B[index_B] && C[i] != A[index_A])
{
index_B++;
Go to next letter in C
}
else if(C[i] == B[index_B] && C[i] == A[index_A])
{
Now we couldn´t decice which way to go, so we should test both options (maybe recusrsion)
}
else
{
return false;
}
}
It´s only quick description of the algorithm but I hope you understand main idea of this algorithm should do. Is this the way of solving this problem good? Do you have better solution? Or some tips?
Basically you've got the problem that every regular expression matcher has. Yes, you would need to test both options, and if one doesn't work you will have to backtrack to the other. Expressing your loop over the string recursively can help here.
However, there is also a way to try both options at the same time. See the popular article Regular Expression Matching Can Be Simple And Fast for the idea - you basically keep track of all possible positions in the two strings during the iteration of c. The required lookup structure would have a size of len(A)*len(B), as you can just use a modulus for the string position instead of storing the position in the infinite, repeated string.
// some (pythonic) pseudocode for this:
isIntermixedRepetition(a, b, c)
alen = length(a)
blen = length(c)
pos = new Set() // to store tuples
// could be implemented as bool array of dimension alen*blen
pos.add( [0,0] ) // init start pos
for ci of c
totest = pos.getContents() // copy and
pos.clear() // empty the set
for [indexA, indexB] of totest
if a[indexA] == ci
pos.add( [indexA + 1 % alen, indexB] )
// no else
if b[indexB] == ci
pos.add( [indexA, indexB + 1 % blen] )
if pos.isEmpty
break
return !pos.isEmpty
Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.