Anagram of String 2 is Substring of String 1 - string

How to find that any anagram of String 1 is sub string of String 2?
Eg :-
String 1 =rove
String 2=stackoverflow
So it will return true as anagram of "rove" is "over" which is sub-string of String 2

On edit: my first answer was quadratic in the worst case. I've tweaked it to be strictly linear:
Here is an approach based on the notion of a sliding window: Create a dictionary keyed by the letters of the first dictionary with frequency counts of the letters for the corresponding values. Think of this as a dictionary of targets which need to be matched by m consecutive letters in the second string, where m is the length of the first string.
Start by processing the first m letters in the second string. For each such letter if it appears as a key in the target dictionary decrease the corresponding value by 1. The goal is to drive all target values to 0. Define discrepancy to be the sum of the absolute values of the values after processing the first window of m letters.
Repeatedly do the following: check if discrepancy == 0 and return Trueif it does. Otherwise -- take the character m letters ago and check if it is a target key and if so -- increase the value by 1. In this case, this either increases or decreases the discrepancy by 1, adjust accordingly. Then get the next character of the second string and process it as well. Check if it is a key in the dictionary and if so adjust the value and the discrepancy as appropriate.
Since there are no nested loop and each pass through the main loop involves just a few dictionary lookups, comparisons, addition and subtractions, the overall algorithm is linear.
A Python 3 implementation (which shows the basic logic of how the window slides and the target counts and discrepancy are adjusted):
def subAnagram(s1,s2):
m = len(s1)
n = len(s2)
if m > n: return false
target = dict.fromkeys(s1,0)
for c in s1: target[c] += 1
#process initial window
for i in range(m):
c = s2[i]
if c in target:
target[c] -= 1
discrepancy = sum(abs(target[c]) for c in target)
#repeatedly check then slide:
for i in range(m,n):
if discrepancy == 0:
return True
else:
#first process letter from m steps ago from s2
c = s2[i-m]
if c in target:
target[c] += 1
if target[c] > 0: #just made things worse
discrepancy +=1
else:
discrepancy -=1
#now process new letter:
c = s2[i]
if c in target:
target[c] -= 1
if target[c] < 0: #just made things worse
discrepancy += 1
else:
discrepancy -=1
#if you get to this stage:
return discrepancy == 0
Typical output:
>>> subAnagram("rove", "stack overflow")
True
>>> subAnagram("rowe", "stack overflow")
False
To stress-test it, I downloaded the complete text of Moby Dick from Project Gutenberg. This has over 1 million characters. "Formosa" is mentioned in the book, hence an anagram of "moors" appears as a substring of Moby Dick. But, not surprisingly, no anagram of "stackoverflow" appears in Moby Dick:
>>> f = open("moby dick.txt")
>>> md = f.read()
>>> f.close()
>>> len(md)
1235186
>>> subAnagram("moors",md)
True
>>> subAnagram("stackoverflow",md)
False
The last call takes roughly 1 second to process the complete text of Moby Dick and verify that no anagram of "stackoverflow" appears in it.

Let L be the length of String1.
Loop over String2 and check if each substring of length L is an anagram of String1.
In your example, String1 = rove and String2 = stackoverflow.
stackoverflow
stac and rove are not anagrams, so move to the next substring of length L.
stackoverflow
tack and rove are not anagrams, and so on till you find the substring.
A faster method would be to check if the last letter in the current substring is present in String1 i.e., once you find that stac and rove are not anagrams, and see that 'c' (which is the last letter of the current substring) is not present in rove, you can simply skip that substring entirely and get the next substring from 'k'.
i.e. stackoverflow
stac and rove are not anagrams. 'c' is not present in 'rove', so simply skip over this substring and check from 'k':
stackoverflow
This will significantly reduce the number of comparisons.
Edit:
Here is a Python 2 implementation of the method explained above.
NOTE: This implementation works under the assumption that all characters in both strings are in lowercase and they consist only of the characters a -z.
def isAnagram(s1, s2):
c1 = [0] * 26
c2 = [0] * 26
# increase character counts for each string
for i in s1:
c1[ord(i) - 97] += 1
for i in s2:
c2[ord(i) - 97] += 1
# if the character counts are same, they are anagrams
if c1 == c2:
return True
return False
def isSubAnagram(s1, s2):
l = len(s1)
# s2[start:end] represents the substring in s2
start = 0
end = l
while(end <= len(s2)):
sub = s2[start:end]
if isAnagram(s1, sub):
return True
elif sub[-1] not in s1:
start += l
end += l
else:
start += 1
end += 1
return False
Output:
>>> print isSubAnagram('rove', 'stackoverflow')
True
>>> print isSubAnagram('rowe', 'stackoverflow')
False

It can be done in O(n^3) pre-processing, and O(klogk) per query where: n is the size of the "given string" (string 2 in your example) and k is the size of the query (string 1 in your example).
Pre process:
For each substring s of string2: //O(n^2) of those
sort s
store s in some data base (hash table, for example)
Query:
given a query q:
sort q
check if q is in the data base
if it is - it's an anagram of some substring
otherwise - it is not.
This answer assumes you are going to check multiple "queries" (string 1's) for a single string (string 2), and thus tries to optimize the complexity for each query.
As discussed in comments, you can do the pro-process step lazily - that means, when you first encounter a query of length k insert to the DS all substrings of length k, and proceed as original suggestion.

You may need to create all the possible combination of String1 that is rove like rove,rvoe,reov.. Then check this any of this combination is in String2.

//Two string are considered and check whether Anagram of the second string is
//present in the first string as part of it (Substring)
//e.g. 'atctv' 'cat' will return true as 'atc' is anagram of cat
//Similarly 'battex' is containing an anagram of 'text' as 'ttex'
public class SubstringIsAnagramOfSecondString {
public static boolean isAnagram(String str1, String str2){
//System.out.println(str1+"::" + str2);
Character[] charArr = new Character[str1.length()];
for(int i = 0; i < str1.length(); i++){
char ithChar1 = str1.charAt(i);
charArr[i] = ithChar1;
}
for(int i = 0; i < str2.length(); i++){
char ithChar2 = str2.charAt(i);
for(int j = 0; j<charArr.length; j++){
if(charArr[j] == null) continue;
if(charArr[j] == ithChar2){
charArr[j] = null;
}
}
}
for(int j = 0; j<charArr.length; j++){
if(charArr[j] != null)
return false;
}
return true;
}
public static boolean isSubStringAnagram(String firstStr, String secondStr){
int secondLength = secondStr.length();
int firstLength = firstStr.length();
if(secondLength == 0) return true;
if(firstLength < secondLength || firstLength == 0) return false;
//System.out.println("firstLength:"+ firstLength +" secondLength:" + secondLength+
//" firstLength - secondLength:" + (firstLength - secondLength));
for(int i = 0; i < firstLength - secondLength +1; i++){
if(isAnagram(firstStr.substring(i, i+secondLength),secondStr )){
return true;
}
}
return false;
}
public static void main(String[] args) {
System.out.println("isSubStringAnagram(xyteabc,ate): "+ isSubStringAnagram("xyteabc","ate"));
}
}

Related

Palindrome rearrangement in Python

I am given a string and I have to determine whether it can be rearranged into a palindrome.
For example: "aabb" is true.
We can rearrange "aabb" to make "abba", which is a palindrome.
I have come up with the code below but it fails in some cases. Where is the problem and how to fix this?
def palindromeRearranging(inputString):
a = sorted(inputString)[::2]
b = sorted(inputString)[1::2]
return b == a[:len(b)]
def palindromeRearranging(inputString):
return sum(map(lambda x: inputString.count(x) % 2, set(inputString))) <= 1
this code counts occurrence for every character in string. in palindromes there is one character with odd occurrence if length of string is odd, if length of string is even then no character has odd occurance.
see here
def palindromeRearranging(inputString):
elements = {c:inputString.count(c) for c in set(inputString)}
even = [e % 2 == 0 for e in elements.values()]
return all(even) or (len(inputString) % 2 == 1 and even.count(False) == 1)
It counts each character number of appearances, and checks whether all elements appear an even number of times or if the length of the input string is odd, checks whether only one character appears an odd number of times.
Python3
def palindromeArrange (string):
string = list(string)
for i in range (len(string)):
"""if the string has even element count"""
if len(string) % 2 == 0 and len(string)/2 == len (set (string)):
return True
"""if the string has odd element count"""
if len(string) - ((len(string)-1)/2) == len (set (string)):
return True
return False
One liner using list comprehension in Python3
return len([x for x in set(inputString) if inputString.count(x) % 2 != 0]) <= 1
Basically counts those characters that have counts that aren't divisible by 2.
For even strings it would be zero, and for odd strings, it would be one.
The solution I can think of right away has time complexity is O(n). The assumption is, palindrome can not be made if there is more than one character with the odd count.
def solution(inputString):
string = list(inputString)
n = len(string)
s_set= set(string)
from collections import Counter
dic = Counter(string)
k =0 #counter for odd characters
for char in s_set:
if dic.get(char)%2!=0:
k+=1
if k>1:
return False
else:
return True

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given a string S of length n. Choose an integer K and two non-empty sub-sequences A and B of length K such that it satisfies the following conditions:
A = B i.e. for each i the ith character in A is same as the ith character in B.
Let's denote the indices used to construct A as a1,a2,a3,...,an where ai belongs to S and B as b1,b2,b3,...,bn where bi belongs to S. If we denote the number of common indices in A and B by M then M + 1 <= K.
Find the maximum value of K such that it is possible to find the sub-sequences A and B which satisfies the above conditions.
Constraints:
0 < N <= 10^5
Things which I observed are:
The value of K = 0 if the number of characters in the given string are all distinct i.e S = abcd.
K = length of S - 1 if all the characters in the string are same i.e. S = aaaa.
The value of M cannot be equal to K because then M + 1 <= K will not be true i.e you cannot have a sub-sequence A and B that satifies A = B and a1 = b1, a2 = b2, a3 = b3, ..., an = bn.
If the string S is palindrome then K = (Total number of times a character is repeated in the string if the repeatation count > 1) - 1. i.e. S = tenet then t is repeated 2 times, e is repeated 2 times, Total number of times a character is repeated = 4, K = 4 - 1 = 3.
I am having trouble designing the algorithm to solve the above problem.
Let me know in the comments if you need more clarification.
(Update: see O(n) answer.)
We can modify the classic longest common subsequence recurrence to take an extra parameter.
JavaScript code (not memoised) that I hope is self explanatory:
function f(s, i, j, haveUncommon){
if (i < 0 || j < 0)
return haveUncommon ? 0 : -Infinity
if (s[i] == s[j]){
if (haveUncommon){
return 1 + f(s, i-1, j-1, true)
} else if (i == j){
return Math.max(
1 + f(s, i-1, j-1, false),
f(s, i-1, j, false),
f(s, i, j-1, false)
)
} else {
return 1 + f(s, i-1, j-1, true)
}
}
return Math.max(
f(s, i-1, j, haveUncommon),
f(s, i, j-1, haveUncommon)
)
}
var s = "aabcde"
console.log(f(s, s.length-1, s.length-1, false))
I believe we are just looking for the closest equal pair of characters since the only characters excluded from A and B would be one of the characters in the pair and any characters in between.
Here's O(n) in JavaScript:
function f(s){
let map = {}
let best = -1
for (let i=0; i<s.length; i++){
if (!map.hasOwnProperty(s[i])){
map[s[i]] = i
continue
}
best = Math.max(best, s.length - i + map[s[i]])
map[s[i]] = i
}
return best
}
var strs = [
"aabcde", // 5
"aaababcd", // 7
"aebgaseb", // 4
"aefttfea",
// aeft fea
"abcddbca",
// abcd bca,
"a" // -1
]
for (let s of strs)
console.log(`${ s }: ${ f(s) }`)
O(n) solution in Python3:
def compute_maximum_k(word):
last_occurences = {}
max_k = -1
for i in range(len(word)):
if(not last_occurences or not word[i] in last_occurences):
last_occurences[word[i]] = i
continue
max_k = max(max_k,(len(word) - i) + last_occurences[word[i]])
last_occurences[word[i]] = i
return max_k
def main():
words = ["aabcde","aaababcd","aebgaseb","aefttfea","abcddbca","a","acbdaadbca"]
for word in words:
print(compute_maximum_k(word))
if __name__ == "__main__":
main()
A solution for the maximum length substring would be the following:
After building a Suffix Array you can derive the LCP Array. The maximum value in the LCP array corresponds to the K you are looking for. The overall complexity of both constructions is O(n).
A suffix array will sort all prefixes in you string S in ascending order. The longest common prefix array then computes the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in the sorted suffix array. Thus the maximum value in this array corresponds to the length of the two maximum length substrings of S.
For a nice example using the word "banana", check out the LCP Array Wikipage
I deleted my previous answer as I don't think we need an LCS-like solution (LCS=longest Common Subsequence).
It is sufficient to find the couple of subsequences (A, B) that differ in one character and share all the others.
The code below finds the solution in O(N) time.
def function(word):
dp = [0]*len(word)
lastOccurences = {}
for i in range(len(dp)-1, -1, -1):
if i == len(dp)-1:
dp[i] = 0
else:
if dp[i+1] > 0:
dp[i] = 1 + dp[i+1]
elif word[i] in lastOccurences:
dp[i] = len(word)-lastOccurences[word[i]]
lastOccurences[word[i]] = i
return dp[0]
dp[i] is equal to 0 when all characters from i to the end of the string are different.
I will explain my code by an example.
For "abcack", there are two cases:
Either the first 'a' will be shared by the two subsequences A and B, in this case the solution will be = 1 + function("bcack")
Or 'a' will not be shared between A and B. In this case the result will be 1 + "ck". Why 1 + "ck" ? It's because we have already satisfied M+1<=K so just add all the remaining characters. In terms of indices, the substrings are [0, 4, 5] and [3, 4, 5].
We take the maximum between these two cases.
The reason I'm scanning right to left is to not have O(N) search for the current character in the rest of the string, I maintain the index of the last visited occurence of the character in the dict lastOccurences.

How do you check if a given input is a palindrome?

I need to check if the input is a palindrome.
I converted the input to a string and compared the input with the reverse of the input using list slicing. I want to learn a different way without converting input to a string.
def palindrome(n):
num = str(n)
if num == num[::-1]:
return True
Assuming that n is a number, you can get digits from right to left and build a number with those digits from left to right:
n = 3102
m = n
p = 0
while m:
p = p*10 + m%10 # add the rightmost digit of m to the right of p
m //= 10 # remove the rightmost digit of m
print(p) # 2013
Hence the function:
def palindrome(n):
m = n
p = 0
while m:
p = p*10 + m%10
m //= 10
return p == n
Note that:
if num == num[::-1]:
return True
will return None if num != num[::-1] (end of the function). You should write:
if num == num[::-1]:
return True
else:
return False
Or (shorter and cleaner):
return num == num[::-1]
There can be 2 more approaches to that as follows:
Iterative Method: Run loop from starting to length/2 and check first character to last character of string and second to second last one and so on. If any character mismatches, the string wouldn’t be palindrome.
Sample Code Below:
def isPalindrome(str):
for i in xrange(0, len(str)/2):
if str[i] != str[len(str)-i-1]:
return False
return True
One Extra Variable Method: In this method, user take a character of string one by one and store in a empty variable. After storing all the character user will compare both the string and check whether it is palindrome or not.
Sample Code Below:
def isPalindrome(str):
w = ""
for i in str:
w = i + w
if (str==w):
return True
return False
You can try the following approach:
Extract all the digits from the number n
In each iteration, append the digit to one list (digits) and at that digit at the beginning of another list (reversed_digits)
Once all digits have been extracted, compare both lists
def palindrome(n):
digits = []
reversed_digits = []
while n > 0:
digit = n % 10
digits.append(digit)
reversed_digits.insert(0, digit)
n //= 10
return digits == reversed_digits
Note: this might not be the most efficient way to solve this problem, but I think it is very easy to understand.

Strings in python 3.7

How to count sub-strings in a string?
Example: findSubstrings("foxcatfox","fox") # should return 2
If recursion is really a must, you can try dividing the problem first.
Say if you found a matching substring at position i, then the total number of substring is 1 + findSub(string[i+1:], sub), so you can write something like this:
def findSubstringsRecursive(string, substring):
counter = 0
substringLength = len(substring)
for i in range(len(string)):
if string[i] == substring[0]:
end = i + substringLength
sub1 = string[i:end]
if substring == sub1:
return 1 + findSubstringsRecursive(string[i+1:], substring)
return 0
The following pure recursive approach is simple enough (apart from the bool->int coercion):
def findRec(s, pat):
if len(s) < len(pat): # base case should be obvious
return 0
return (pat == s[:len(pat)]) + findRec(s[1:], pat) # recurse with smaller size
>>> findSubstrings('foxcatfox', 'fox')
2
>>> findSubstrings('foxcatfox', 'foxc')
1
>>> findSubstrings('foxcat', 'dog')
0
I should note that this counts overlapping occurrences which may or may not be desired. One might also add protection against or define behaviour for an empty substring.

How to determine string S can be made from string T by deleting some characters, but at most K successive characters

Sorry for the long title :)
In this problem, we have string S of length n, and string T of length m. We can check whether S is a subsequence of string T in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K successive characters? For example, if K = 2, we can make "ab" from "accb", but not from "abcccb". I want to check if it's possible very fast.
I could only find obvious O(nm): check if it's possible for every suffix pairs in string S and string T. I thought maybe greedy algorithm could be possible, but if K = 2, the case S = "abc" and T = "ababbc" is a counterexample.
Is there any fast solution to solve this problem?
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S of length n, and a string T of length m. The maximum allowed gap is k - this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0, not 1.
Imagine a table with n+1 rows and m+1 columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r and column c is a binary flag that tells us whether the first r characters of of S is a valid k-subsequence of the first c characters of T. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T (the substring with c characters). For example, if the first r=2 characters of S are "ab" and the first c=3 characters of T are "abb", then there are two possible matches here: The first characters obviously match with each other, but the b can match with either of the latter b. Therefore, we have a choice of leaving one or zero unmatched bs at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S is much shorter than T, then it cannot match. This depends on k also obviously. The maximum length that S can cover is rk, if this is less than c, then we can easily mark (r,c) as -1.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?). If a matching is not possible for the pair of (sub)strings, the state is -. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c) is the function that computes a particular cell, then we are interested only in f(n,m). The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r and c as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c) and finds a ?, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c):
If r==0 and c <= k, then f(r,c) = 0. An empty string can match any string with up to k characters in it.
If r==0 and c > k, then f(r,c) = -1. Too long for a match.
There are only two other ways a cell can have a successful state. We first try:
If S[r]==T[c] and f(r-1,c-1) != -1, then f(r,c) = 0. This is the best case - a match with no trailing gap.
If that didn't work, we try the next best thing. If f(r,c-1) != -1 and f(r,c) < k, then f(r,c) = f(r,c-1)+1.
If neither of those work, then f(r,c) = -1.
The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S and T where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K, String S, and String T. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo). This is the best subsequence because it leaves a gap of just two characters at the start (he). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss#(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss#(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss#(s:ss) tts#(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts#(t:ts) matches a non-empty string. The name of the string is tts. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t) and the remainder of the string (ts). Here ts should be read aloud as the plural of t - the s suffix here means 'plural'. We say have have a t and some ts and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss and tts. But to save us the hassle of writing head sss and tail sss to access the first letter, and the string-remainer, of the string, we simply use #(s:ss) to tell the compiler to store those quantities into variables s and ss. If this was C++ for example, you'd get the same effect with char s = sss[0]; as the first line of your function.
The best situation is that the first characters match s==t and the remainder of the strings are a valid k-subsequence best k sss ts /= -1. This allows us to return 0.
The only other possibility for success if if the current complete string (sss) is a valid k-subsequence of the remainder of the other string (ts). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
An O(p*n) solution where p = number of subsequences possible of S in T.
Scan the string T and maintain a list of possible subsequences of S that would have
1. Index of last character found and
2. Number of characters to be deleted found
Continue to update this list at each character of T.
Not sure if this is what your asking for, but you could create a list of characters from each String, and search for instances of the one list in the other, then if(list2.length-K > list1.length) return false.
Following is a proposed algorithm : - O(|T|*k) average case
1> scan T and store character indices in Hash Table :-
eg. S = "abc" T = "ababbc"
Symbol table entries : -
a = 1 3
b = 2 4 5
c = 6
2.> as we know isValidSub(S,T) = isValidSub(S(0,j),T) && (isValidSub(S(j+1,N),T)||....isValidSub(S(j+K,T),T))
a.> we will use the bottom up approach to solve above problem
b.> we will maintain an valid array Valid(len(S)) where each record points to a Hash Table (Explained as we go along solving further)
c.> Start from the last element of S, Look up for the indices stored corresponding to the character in Symbol Table
eg. in above example S[last] = "c"
in Symbol Table c = 6
Now we put records like (5,6) , (4,6) ,.... (6-k-1,6) into Hash table at Valid(last)
Explanation : - as s(6,len(S)) is valid subsequence hence s(0,6-i) ++ s(6,len(S)) (where i is in range(1,k+1)) is also valid subsequence provided s(0,6-i) is valid subsequence.
3.> start filling up Valid Array from last to 0 element : -
a.> take a indice from hash table entry corresponding to S[j] where j is current indice of Valid Array we are analysing.
b.> Check whether indice is in Valid(j+1) if less then add (indice-i,indice) where i in range(1,k+1) into Valid(j) Hash Table
example:-
S = "abc" T = "ababbc"
iteration 1 :
j = len(S) = 3
S[3] = 'c'
Symbol Table : c = 6
add (5,6),(4,6),(3,6) as K = 2 in Valid(j)
Valid(3) = {(5,6),(4,6),(3,6)}
j = 2
iteration 2 :
S[j] = 'b'
Symbol table: b = 2 4 5
Look up 2 in Valid(3) => not found => skip
Look up 4 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4)}
Look up 5 in Valid(3) => found => add Valid(2) = {(3,4),(2,4),(1,4),(4,5)}
j = 1
iteration 3:
S[j] = "a"
Symbol Table : a = 1 3
Look up 1 in Valid(2) => not found
Look up 3 in Valid(2) => found => stop as it is last iteration
END
as 3 is found in Valid(2) that means there exists a valid subsequence starting at in T
Start = 3
4.> Reconstruct the solution moving downwards in Valid Array :-
example :
Start = 3
Look up 3 in Valid(2) => found (3,4)
Look up 4 in Valid(3) => found (4,6)
END
reconstructed solution (3,4,6) which is indeed valid subsequence
Remember (3,5,6) can also be a solution if we had added (3,5) instead of (3,4) in that iteration
Analysis of Time complexity & Space complexity : -
Time Complexity :
Step 1 : Scan T = O(|T|)
Step 2 : fill up all Valid entries O(|T|*k) using HashTable lookup is aprox O(1)
Step 3 : Reconstruct solution O(|S|)
Overall average case Time : O(|T|*k)
Space Complexity:
Symbol table = O(|T|+|S|)
Valid table = O(|T|*k) can be improved with optimizations
Overall space = O(|T|*k)
Java Implementation: -
public class Subsequence {
private ArrayList[] SymbolTable = null;
private HashMap[] Valid = null;
private String S;
private String T;
public ArrayList<Integer> getSubsequence(String S,String T,int K) {
this.S = S;
this.T = T;
if(S.length()>T.length())
return(null);
S = S.toLowerCase();
T = T.toLowerCase();
SymbolTable = new ArrayList[26];
for(int i=0;i<26;i++)
SymbolTable[i] = new ArrayList<Integer>();
char[] s1 = T.toCharArray();
char[] s2 = S.toCharArray();
//Calculate Symbol table
for(int i=0;i<T.length();i++) {
SymbolTable[s1[i]-'a'].add(i);
}
/* for(int j=0;j<26;j++) {
System.out.println(SymbolTable[j]);
}
*/
Valid = new HashMap[S.length()];
for(int i=0;i<S.length();i++)
Valid[i] = new HashMap<Integer,Integer >();
int Start = -1;
for(int j = S.length()-1;j>=0;j--) {
int index = s2[j] - 'a';
//System.out.println(index);
for(int m = 0;m<SymbolTable[index].size();m++) {
if(j==S.length()-1||Valid[j+1].containsKey(SymbolTable[index].get(m))) {
int value = (Integer)SymbolTable[index].get(m);
if(j==0) {
Start = value;
break;
}
for(int t=1;t<=K+1;t++) {
Valid[j].put(value-t, value);
}
}
}
}
/* for(int j=0;j<S.length();j++) {
System.out.println(Valid[j]);
}
*/
if(Start != -1) { //Solution exists
ArrayList subseq = new ArrayList<Integer>();
subseq.add(Start);
int prev = Start;
int next;
// Reconstruct solution
for(int i=1;i<S.length();i++) {
next = (Integer)Valid[i].get(prev);
subseq.add(next);
prev = next;
}
return(subseq);
}
return(null);
}
public static void main(String[] args) {
Subsequence sq = new Subsequence();
System.out.println(sq.getSubsequence("abc","ababbc", 2));
}
}
Consider a recursive approach: let int f(int i, int j) denote the minimum possible gap at the beginning for S[i...n] matching T[j...m]. f returns -1 if such matching does not exist. Here's the implementation of f:
int f(int i, int j){
if(j == m){
if(i == n)
return 0;
else
return -1;
}
if(i == n){
return m - j;
}
if(S[i] == T[j]){
int tmp = f(i + 1, j + 1);
if(tmp >= 0 && tmp <= k)
return 0;
}
return f(i, j + 1) + 1;
}
If we convert this recursive approach to a dynamic programming approach, then we can have a time complexity of O(nm).
Here's an implementation that usually* runs in O(N) and takes O(m) space, where m is length(S).
It uses the idea of a surveyor's chain:
Imagine a series of poles linked by chains of length k.
Achor the first pole at the beginning of the string.
Now cary the next pole forward until you find a character match.
Place that pole. If there is slack, move on to the next character;
else the previous pole has been dragged forward, and you need to go back
and move it to the next nearest match.
Repeat until you reach the end or run out of slack.
typedef struct chain_t{
int slack;
int pole;
} chainlink;
int subsequence_k_impl(char* t, char* s, int k, chainlink* link, int len)
{
char* match=s;
int extra = k; //total slack in the chain
//for all chars to match, including final null
while (match<=s+len){
//advance until we find spot for this post or run out of chain
while (t[link->pole] && t[link->pole]!=*match ){
link->pole++; link->slack--;
if (--extra<0) return 0; //no more slack, can't do it.
}
//if we ran out of ground, it's no good
if (t[link->pole] != *match) return 0;
//if this link has slack, go to next pole
if (link->slack>=0) {
link++; match++;
//if next pole was already placed,
while (link[-1].pole < link->pole) {
//recalc slack and advance again
extra += link->slack = k-(link->pole-link[-1].pole-1);
link++; match++;
}
//if not done
if (match<=s+len){
//currrent pole is out of order (or unplaced), move it next to prev one
link->pole = link[-1].pole+1;
extra+= link->slack = k;
}
}
//else drag the previous pole forward to the limit of the chain.
else if (match>=s) {
int drag = (link->pole - link[-1].pole -1)- k;
link--;match--;
link->pole+=drag;
link->slack-=drag;
}
}
//all poles planted. good match
return 1;
}
int subsequence_k(char* t, char* s, int k)
{
int l = strlen(s);
if (strlen(t)>(l+1)*(k+1))
return -1; //easy exit
else {
chainlink* chain = calloc(sizeof(chainlink),l+2);
chain[0].pole=-1; //first pole is anchored before the string
chain[0].slack=0;
chain[1].pole=0; //start searching at first char
chain[1].slack=k;
l = subsequence_k_impl(t,s,k,chain+1,l);
l=l?chain[1].pole:-1; //pos of first match or -1;
free(chain);
}
return l;
}
* I'm not sure of the big-O. I initially thought it was something like O(km+N). In testing, it averages less than 2N for good matches and less than N for failed matches.
...but.. there is a strange degenerate case. For random strings selected from an alphabet of size A, it gets much slower when k = 2A+1. Even this case it's better than O(Nm), and the performance returns to O(N) when k is increased or decreased slightly. Gist Here if anyone is curious.

Resources