Non increasing and Non Decreasing Subsequence - string

Finding non-decreasing subsequence is well known problem.
But this Question is a slight variant of the finding longest non-decreasing subsequence. In this problem we have to find the length of longest subsequence which comprises 2 disjoint sequences 1. non decreasing 2. non-increasing.
e.g. in string "aabcazcczba" longest such sequence is aabczcczba. aabczcczba is made up of 2 disjoint subsequence aabcZccZBA. (capital letter shows non-increasing sequence)
My algorithm is
length = 0
For i = 0 to length of given string S
let s' = find the longest non-decreasing subsequence starting at position i
let s" = find the longest non-increasing subsequence from S-s'.
if (length of s' + length of s") > length
length = (length of s' + length of s")
enter code here
But I am not sure whether this would give correct answer or not. Can you find a bug in this algo and if there is bug also suggest correct algorithm. Also I need to optimize the solution. My algorithm would take roughly o(n^4) steps.

Your solution is definitely incorrect. Eg. addddbc. The longest non-decreasing sequence is adddd, but that would never give you a non-increasing sequence. The optimal solution is abc and dddd ( or ab ddddc, or ac ddddb).
One solution is to use dynamic programming.
F(i, x, a, b) = 1, if there is a non-decreasing and non-increasing combo from first i letters of x ( x[:i]) such that last letter of non-decreasing part is a, and non-increasing part is b. Both of these letters equal to NULL if the corresponding sub-sequence is empty.
Otherwise F(i, x, a, b) = 0.
F(i+1,x,x[i+1],b) = 1 if there exists a and b such that
a<=x[i+1] or a=NULL and F(i,x,a,b)=1. 0 otherwise.
F(i+1,x,a,x[i+1]) = 1 if there exists a and b such that
b>=x[i+1] or b=NULL and F(i,x,a,b)=1. 0 otherwise.
Initialize F(0,x,NULL,NULL)=1 and iterate from i=1..n
As you can see, you can get F(i+1, x, a, b) from F(i, x, a, b). Complexity: Linear in length, polynomial in size of the alphabet.

I got the answer, And here is how it works, thanx to #ElKamina
maintain a table of 27X27 dimension. 27 = (1 Null character + 26 (alphabets))
table[i][j] denotes the length of the sub sequence whose non decreasing subsequence has last character 'i' and non increasing subsequence has last character 'j' (0th index denote null character and kth index denotes character 'k')
for i = 0 to length of string S
//subsequence whose non decreasing subsequence's last character is smaller than S[i], find such a subsequence of maximum length. Now S[i] can be part of this subsequence's non-decreasing part.
int lim = S[i] - 'a' + 1;
for(int k=0; k<27; k++){
if(lim == k) continue;
int tmax = 0;
for(int j=0; j<=lim; j++){
if(table[k][j] > tmax) tmax = table[k][j];
}
if(k == 0 && tmax == 0) table[0][lim] = 1;
else if (tmax != 0) table[k][lim] = tmax + 1;
}
//Simillarly for non-increasing subsequence
Time complexity is o(lengthOf(S)*27*27) and space complexity is o(27*27)

Related

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given a string S of length n. Choose an integer K and two non-empty sub-sequences A and B of length K such that it satisfies the following conditions:
A = B i.e. for each i the ith character in A is same as the ith character in B.
Let's denote the indices used to construct A as a1,a2,a3,...,an where ai belongs to S and B as b1,b2,b3,...,bn where bi belongs to S. If we denote the number of common indices in A and B by M then M + 1 <= K.
Find the maximum value of K such that it is possible to find the sub-sequences A and B which satisfies the above conditions.
Constraints:
0 < N <= 10^5
Things which I observed are:
The value of K = 0 if the number of characters in the given string are all distinct i.e S = abcd.
K = length of S - 1 if all the characters in the string are same i.e. S = aaaa.
The value of M cannot be equal to K because then M + 1 <= K will not be true i.e you cannot have a sub-sequence A and B that satifies A = B and a1 = b1, a2 = b2, a3 = b3, ..., an = bn.
If the string S is palindrome then K = (Total number of times a character is repeated in the string if the repeatation count > 1) - 1. i.e. S = tenet then t is repeated 2 times, e is repeated 2 times, Total number of times a character is repeated = 4, K = 4 - 1 = 3.
I am having trouble designing the algorithm to solve the above problem.
Let me know in the comments if you need more clarification.
(Update: see O(n) answer.)
We can modify the classic longest common subsequence recurrence to take an extra parameter.
JavaScript code (not memoised) that I hope is self explanatory:
function f(s, i, j, haveUncommon){
if (i < 0 || j < 0)
return haveUncommon ? 0 : -Infinity
if (s[i] == s[j]){
if (haveUncommon){
return 1 + f(s, i-1, j-1, true)
} else if (i == j){
return Math.max(
1 + f(s, i-1, j-1, false),
f(s, i-1, j, false),
f(s, i, j-1, false)
)
} else {
return 1 + f(s, i-1, j-1, true)
}
}
return Math.max(
f(s, i-1, j, haveUncommon),
f(s, i, j-1, haveUncommon)
)
}
var s = "aabcde"
console.log(f(s, s.length-1, s.length-1, false))
I believe we are just looking for the closest equal pair of characters since the only characters excluded from A and B would be one of the characters in the pair and any characters in between.
Here's O(n) in JavaScript:
function f(s){
let map = {}
let best = -1
for (let i=0; i<s.length; i++){
if (!map.hasOwnProperty(s[i])){
map[s[i]] = i
continue
}
best = Math.max(best, s.length - i + map[s[i]])
map[s[i]] = i
}
return best
}
var strs = [
"aabcde", // 5
"aaababcd", // 7
"aebgaseb", // 4
"aefttfea",
// aeft fea
"abcddbca",
// abcd bca,
"a" // -1
]
for (let s of strs)
console.log(`${ s }: ${ f(s) }`)
O(n) solution in Python3:
def compute_maximum_k(word):
last_occurences = {}
max_k = -1
for i in range(len(word)):
if(not last_occurences or not word[i] in last_occurences):
last_occurences[word[i]] = i
continue
max_k = max(max_k,(len(word) - i) + last_occurences[word[i]])
last_occurences[word[i]] = i
return max_k
def main():
words = ["aabcde","aaababcd","aebgaseb","aefttfea","abcddbca","a","acbdaadbca"]
for word in words:
print(compute_maximum_k(word))
if __name__ == "__main__":
main()
A solution for the maximum length substring would be the following:
After building a Suffix Array you can derive the LCP Array. The maximum value in the LCP array corresponds to the K you are looking for. The overall complexity of both constructions is O(n).
A suffix array will sort all prefixes in you string S in ascending order. The longest common prefix array then computes the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in the sorted suffix array. Thus the maximum value in this array corresponds to the length of the two maximum length substrings of S.
For a nice example using the word "banana", check out the LCP Array Wikipage
I deleted my previous answer as I don't think we need an LCS-like solution (LCS=longest Common Subsequence).
It is sufficient to find the couple of subsequences (A, B) that differ in one character and share all the others.
The code below finds the solution in O(N) time.
def function(word):
dp = [0]*len(word)
lastOccurences = {}
for i in range(len(dp)-1, -1, -1):
if i == len(dp)-1:
dp[i] = 0
else:
if dp[i+1] > 0:
dp[i] = 1 + dp[i+1]
elif word[i] in lastOccurences:
dp[i] = len(word)-lastOccurences[word[i]]
lastOccurences[word[i]] = i
return dp[0]
dp[i] is equal to 0 when all characters from i to the end of the string are different.
I will explain my code by an example.
For "abcack", there are two cases:
Either the first 'a' will be shared by the two subsequences A and B, in this case the solution will be = 1 + function("bcack")
Or 'a' will not be shared between A and B. In this case the result will be 1 + "ck". Why 1 + "ck" ? It's because we have already satisfied M+1<=K so just add all the remaining characters. In terms of indices, the substrings are [0, 4, 5] and [3, 4, 5].
We take the maximum between these two cases.
The reason I'm scanning right to left is to not have O(N) search for the current character in the rest of the string, I maintain the index of the last visited occurence of the character in the dict lastOccurences.

Changing letters of a string to obtain maximum score

You are given a string and can change at most Q letters in the string. You are also given a list of substrings (each two characters long), with a corresponding score. Each occurance of the substring within the string adds to your total score. What is the maximum possible attainable score?
String length <= 150, Q <= 100, Number of Substrings <= 700
Example:
String = bpdcg
Q = 2
Substrings:
bz - score: 2
zd - score: 5
dm - score: 7
ng - score: 10
In this example, you can achieve the maximum score b changing the "p" in the string to a "z" and the "c" to an "n". Thus, your new string is "bzdng" which has a score of 2+5+10 = 17.
I know that given a string which already has the letters changed, the score can be checked in linear time using a dictionary matching algorithm such as aho-corasick (or with a slightly worse complexity, Rabin Karp). However, trying each two letter substitution will take too long and then checking will take too long.
Another possible method I thought was to work backwards, to construct the ideal string from the given substrings and then check whether it differs by at most two characters from the original string. However, I am not sure how to do this, and even if it could be done, I think that it would also take too long.
What is the best way to go about this?
An efficient way to solve this is to use dynamic programming.
Let L be the set of letters that start any of the length-2 scoring substrings, and a special letter "*" which stands for any other letter than these.
Let S(i, j, c) be the maximum score possible in the string (up to index i) using j substitutions, where the string ends with character c (where c in L).
The recurrence relations are a bit messy (or at least, I didn't find a particularly beautiful formulation of them), but here's some code that computes the largest score possible:
infinity = 100000000
def S1(L1, L2, s, i, j, c, scores, cache):
key = (i, j, c)
if key not in cache:
if i == 0:
if c != '*' and s[0] != c:
v = 0 if j >= 1 else -infinity
else:
v = 0 if j >= 0 else -infinity
else:
v = -infinity
for d in L1:
for c2 in [c] if c != '*' else L2 + s[i]:
jdiff = 1 if s[i] != c2 else 0
score = S1(L1, L2, s, i-1, j-jdiff, d, scores, cache)
score += scores.get(d+c2 , 0)
v = max(v, score)
cache[key] = v
return cache[key]
def S(s, Q, scores):
L1 = ''.join(sorted(set(w[0] for w in scores))) + '*'
L2 = ''.join(sorted(set(w[1] for w in scores)))
return S1(L1, L2, s + '.', len(s), Q, '.', scores, {})
print S('bpdcg', 2, {'bz': 2, 'zd': 5, 'dm': 7, 'ng': 10})
There's some room for optimisation:
the computation isn't terminated early if j goes negative
when given a choice, every value of L2 is tried, whereas only letters that can complete a scoring word from d need trying.
Overall, if there's k different letters in the scoring words, the algorithm runs in time O(QN*k^2). With the second optimisation above, this can be reduced to O(QNw) where w is the number of scoring words.

Given a palindromic string, in how many ways we can convert it to a non palindrome by removing one more more characters from it? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Given a palindromic string, in how many ways we can convert it to a non palindrome by removing one more more characters from it?
For example if the string is "b99b". Then we can do it in 6 ways,
i) Remove 1st character : "99b"
ii) Remove 1st, 2nd characters : "9b"
iii) Remove 1st, 3rd characters : "9b"
iv) Remove 2nd, 4th characters : "b9"
v) Remove 3rd, 4th characters : "b9"
vi) Remove 4th character : "b99"
How to approach this one?
PS:Two ways are considered different if there exists an i such that character at index i is removed in one way and not removed in another.
There's an O(n2) dynamic programming algorithm for counting the number of palindromic subsequences of a string; you can use that to count the number of non-palindromic subsequences by subtracting the number of palindromic subsequences from the number of subsequences (which is simply 2n).
This algorithm counts subsequences by the criterion in the OP; two subsequences are considered different if there is a difference in the list of indices used to select the elements, even if the resulting subsequences have the same elements.
To count palindromic subsequences, we build up the count based on intervals of the sequence. Specifically, we define:
Si,j = the substring of S starting at index i and ending at index j (inclusive)
Pi,j = the number of palindromic subsequences of Si,j
Now, every one-element interval is a palindrome, so:
Pi,i &equals; 1 for all i < n
If a substring does not begin and end with the same element (i.e., Si ≠ Sj) then the palindromic subsequences consist of:
Those which contain Si but do not contain Sj
Those which contain Sj but do not contain Si
Those which contain neither Si nor Sj
Now, note that Pi,j-1 includes both the first and the third set of subsequences, while Pi+1,j includes both the second and the third set; Pi+1,j-1 is precisely the third set. Consequently:
Pi,j &equals; Pi+1,j &plus; Pi,j-1 − Pi+1,j-1 if Si ≠ Sj
But what if Si &equals; Sj? In that case, we have to add the palindromes consisting of Si followed by a subsequence palindrome from Si+1,j-1 followed by Sj, as well as the palindromic subsequence consisting of just the start and end characters. (Technically, an empty sequence is a palindrome, but we don't count those here.) The number of subsequences we add is Pi+1,j-1 &plus; 1, which cancels out the subtracted double count in the above equation. So:
Pi,j &equals; Pi+1,j &plus; Pi,j-1 &plus; 1 if Si &equals; Sj.
In order to save space, we can actually compute Pi,i+k for 0 ≤ i < |S|-k for increasing values of k; we only need to retain two of these vectors in order to generate the final result P0,|S|-1.
EDIT:
Here's a little python program; the first one computes the number of palindromic subsequences, as above, and the driver computes the number of non-palindromic subsequences (i.e. the number of ways to remove zero or more elements and produce a non-palindrome; if the original sequence is a palindrome, then it's the number of ways to remove one or more elements.)
# Count the palindromic subsequences of s
def pcount(s):
length = len(s)
p0 = [0] * (length + 1)
p1 = [1] * length
for l in range(1, length):
for i in range(length - l):
p0[i] = p1[i]
if s[i] == s[i + l]:
p1[i] += p1[i+1] + 1
else:
p1[i] += p1[i+1] - p0[i+1]
# The "+ 1" is to account for the empty sequence, which is a palindrome.
return p1[0] + 1
# Count the non-palindromic subsequences of s
def npcount(s):
return 2**len(s) - pcount(s)
this is not a complete answer, just a suggestion.
i would count the number of ways you can remove one or more characters and keep the string a palindrome. then subtract that from the total number of ways you can modify the string.
the most obvious way to modify a palindrome and keep it a palindrome is to remove the i'th and the (n-i)'th characters (n being the length of the string). there are 2^(n/2) ways you can do that.
the problem with this approach is that it assumes only a symmetric modification can keep the string a palindrome, you need to find a way to handle cases such as "aaaa" where any sort of modification will still result in a palindrome.
Brute force with memoization is pretty straightforward:
numWays(str): return 0 if str is empty
return memo[str] if it exists
memo[str] = numWays(str - firstChar) +
numWays(str - secondChar) +
... +
1 if str is not a palindrome
return memo[str]
Basically, you remove every character in turn and save the answer for the resulting string. The more identical characters you have in the string, the faster this is.
I'm not sure how to do it more efficiently, I will update this if I figure it out.
For a string with N elements, there are 2^N possible substrings (including the whole string and the empty substring). Thus we can encode every substring by a number with a '1' bit at the bitposition for every omitted (or present) character, and a '0' bit otherwise. (assuming the length of the string is smaller then the number of bits in an int (size_t here), otherwise you would need an other representation for the bitstring):
#include <stdio.h>
#include <string.h>
char string[] = "AbbA";
int is_palindrome (char *str, size_t len, size_t mask);
int main(void)
{
size_t len,mask, count;
len = strlen(string);
count =0;
for (mask = 1; mask < (1ul <<len) -1; mask++) {
if ( is_palindrome (string, len, mask)) continue;
count++;
}
fprintf(stderr, "Len:=%u, Count=%u \n"
, (unsigned) len , (unsigned) count );
return 0;
}
int is_palindrome (char *str, size_t len, size_t mask)
{
size_t l,r,pop;
for (pop=l=0, r = len -1; l < r; ) {
if ( mask & (1u <<l)) { l++; continue; }
if ( mask & (1u <<r)) { r--; continue; }
if ( str[l] == str[r] ) return 1;
l++,r--; pop++;
}
return (pop <1) ? 1: 0;
}
Here's a Haskell version:
import Data.List
listNonPalindromes string =
filter (isNotPalindrome) (subsequences string)
where isNotPalindrome str
| fst substr == snd substr = False
| otherwise = True
where substr = let a = splitAt (div (length str) 2) str
in (reverse (fst a), if even (length str)
then snd a
else drop 1 (snd a))
howManyNonPalindromes string = length $ listNonPalindromes string
*Main> listNonPalindromes "b99b"
["b9","b9","b99","9b","9b","99b"]
*Main> howManyNonPalindromes "b99b"
6

String permutations rank + data structure

The problem at hand is:
Given a string. Tell its rank among all its permutations sorted
lexicographically.
The question can be attempted mathematically, but I was wondering if there was some other algorithmic method to calculate it ?
Also if we have to store all the string permutations rankwise , how can we generate them efficiently (and what would be the complexity) . What would be a good data structure for storing the permutations and which is also efficient for retrieval?
EDIT
Thanks for the detailed answers on the permutations generation part, could someone also suggest a good data structure? I have only been able to think of trie tree.
There is an O(n|Σ|) algorithm to find the rank of a string of length n in the list of its permutations. Here, Σ is the alphabet.
Algorithm
Every permutation which is ranked below s can be written uniquely in the form pcx; where:
p is a proper prefix of s
c is a character ranked below the character appearing just after p in s. And c is also a character occurring in the part of s not included in p.
x is any permutation of the remaining characters occurring in s; i.e. not included in p or c.
We can count the permutations included in each of these classes by iterating through each prefix of s in increasing order of length, while maintaining the frequency of the characters appearing in the remaining part of s, as well as the number of permutations x represents. The details are left to the reader.
This is assuming the arithmetic operations involved take constant time; which it wont; since the numbers involved can have nlog|Σ| digits. With this consideration, the algorithm will run in O(n2 log|Σ| log(nlog|Σ|)). Since we can add, subtract, multiply and divide two d-digit numbers in O(dlogd).
C++ Implementation
typedef long long int lli;
lli rank(string s){
int n = s.length();
vector<lli> factorial(n+1,1);
for(int i = 1; i <= n; i++)
factorial[i] = i * factorial[i-1];
vector<int> freq(26);
lli den = 1;
lli ret = 0;
for(int i = n-1; i >= 0; i--){
int si = s[i]-'a';
freq[si]++;
den *= freq[si];
for(int c = 0; c < si; c++)
if(freq[c] > 0)
ret += factorial[n-i-1] / (den / freq[c]);
}
return ret + 1;
}
This is similar to the quickselect algorithm. In an unsorted array of integers, find the index of some particular array element. The partition element would be the given string.
Edit:
Actually it is similar to partition method done in QuickSort. The given string is the partition element.Once all permutations are generated, the complexity to find the rank for strings with length k would be O(nk). You can generate string permutations using recursion and store them in a linked list. You can pass this linked list to the partition method.
Here's the java code to generate all String permutations:
private static int generateStringPermutations(String name,int currIndex) {
int sum = 0;
for(int j=name.length()-1;j>=0;j--) {
for(int i=j-1;((i<j) && (i>currIndex));i--) {
String swappedString = swapCharsInString(name,i,j);
list.add(swappedString);
//System.out.println(swappedString);
sum++;
sum = sum + generateStringPermutations(swappedString,i);
}
}
return sum;
}
Edit:
Generating all permutations is costly. If a string contains distinct characters, the rank can be determined without generating all permutations. Here's the link.
This can be extended for cases where there are repeating characters.
Instead of x * (n-1)! which is for distinct cases mentioned as in the link,
For repeating characters it will be:
if there is 1 character which is repeating twice,
x* (n-1)!/2!
Let's take an example. For string abca the combinations are:
aabc,aacb,abac,abca,acab,acba,baac,baca,bcaa,caab,caba,cbaa (in sorted order)
Total combinations = 4!/2! = 12
if we want to find rank of 'bcaa' then we know all strings starting with 'a' are before which is 3! = 6.
Note that because 'a' is the starting character, the remaining characters are a,b,c and there are no repetitions so it is 3!. We also know strings starting with 'ba' will be before which is 2! = 2 so it's rank is 9.
Another example. If we want to find the rank of 'caba':
All strings starting with a are before = 6.
All strings starting with b are before = 3!/2! = 3 (Because once we choose b, we are left with a,a,c and because there are repetitions it is 3!/2!.
All strings starting with caa will be before which is 1
So the final rank is 11.
From GeeksforGeeks:
Given a string, find its rank among all its permutations sorted
lexicographically. For example, rank of “abc” is 1, rank of “acb” is
2, and rank of “cba” is 6.
For simplicity, let us assume that the string does not contain any
duplicated characters.
One simple solution is to initialize rank as 1, generate all
permutations in lexicographic order. After generating a permutation,
check if the generated permutation is same as given string, if same,
then return rank, if not, then increment the rank by 1. The time
complexity of this solution will be exponential in worst case.
Following is an efficient solution.
Let the given string be “STRING”. In the input string, ‘S’ is the
first character. There are total 6 characters and 4 of them are
smaller than ‘S’. So there can be 4 * 5! smaller strings where first
character is smaller than ‘S’, like following
R X X X X X I X X X X X N X X X X X G X X X X X
Now let us Fix S’ and find the smaller strings staring with ‘S’.
Repeat the same process for T, rank is 4*5! + 4*4! +…
Now fix T and repeat the same process for R, rank is 4*5! + 4*4! +
3*3! +…
Now fix R and repeat the same process for I, rank is 4*5! + 4*4! +
3*3! + 1*2! +…
Now fix I and repeat the same process for N, rank is 4*5! + 4*4! +
3*3! + 1*2! + 1*1! +…
Now fix N and repeat the same process for G, rank is 4*5! + 4*4 + 3*3!
+ 1*2! + 1*1! + 0*0!
Rank = 4*5! + 4*4! + 3*3! + 1*2! + 1*1! + 0*0! = 597
Since the value of rank starts from 1, the final rank = 1 + 597 = 598

How to find all combinations of a multiset in a string in linear time?

I am given a bag B (multiset) of characters with the size m and a string text S of size n. Is it possible to find all substrings that can be created by B (4!=24 combinations) in S in linear time O(n)?
Example:
S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}
The fastest solution I found is to keep a counter for each character and compare it with the Bag in each step, thus the runtime is O(n*m). Algorithm can be shown if needed.
There is a way to do it in O(n), assuming we're only interested in substrings of length m (otherwise it's impossible, because for the bag that has all characters in the string, you'd have to return all substrings of s, which means a O(n^2) result that can't be computed in O(n)).
The algorithm is as follows:
Convert the bag to a histogram:
hist = []
for c in B do:
hist[c] = hist[c] + 1
Initialize a running histogram that we're going to modify (histrunsum is the total count of characters in histrun):
histrun = []
histrunsum = 0
We need two operations: add a character to the histogram and remove it. They operate as follows:
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then:
histrun[c] = histrun[c] + 1
histrunsum = histrunsum + 1
remove(c):
if histrun[c] > 0 then:
histrun[c] = histrun[c] - 1
histrunsum = histrunsum + 1
Essentially, histrun captures the amount of characters that are present in B in current substring. If histrun is equal to hist, our substring has the same characters as B. histrun is equal to hist iff histrunsum is equal to length of B.
Now add first m characters to histrun; if histrunsum is equal to length of B; emit first substring; now, until we reach the end of string, remove the first character of the current substring and add the next character.
add, remove are O(1) since hist and histrun are arrays; checking if hist is equal to histrun is done by comparing histrunsum to length(B), so it's also O(1). Loop iteration count is O(n), the resulting running time is O(n).
Thanks for the answer. The add() and remove() methods have to be changed to make the algorithm work correctly.
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] + 1
remove(c):
if histrun[c] > hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] - 1
Explanation:
histrunsum can be seen as a score of how identical both multisets are.
add(c): when there are less occurrences of a char in the histrun multiset than in the hist multiset, the additional occurrence of that char has to be "rewarded" since the histrun multiset is getting closer to the hist multiset. If there are at least equal or more chars in the histrun set already, and additional char is negative.
remove(c): like add(c), where a removal of a char is weighted positively when it's number in the histrun multiset > hist multiset.
Sample Code (PHP):
function multisetSubstrings($sequence, $mset)
{
$multiSet = array();
$substringLength = 0;
foreach ($mset as $char)
{
$multiSet[$char]++;
$substringLength++;
}
$sum = 0;
$currentSet = array();
$result = array();
for ($i=0;$i<strlen($sequence);$i++)
{
if ($i>=$substringLength)
{
$c = $sequence[$i-$substringLength];
if ($currentSet[$c] > $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]--;
}
$c = $sequence[$i];
if ($currentSet[$c] < $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]++;
echo $sum."<br>";
if ($sum==$substringLength)
$result[] = $i+1-$substringLength;
}
return $result;
}
Use hashing. For each character in the multiset, assign a UNIQUE prime number. Compute the hash for any string by multiplying the prime number associated with a number, as many times as the frequency of that number.
Example : CATTA. Let C = 2, A=3, T = 5. Hash = 2*3*5*5*3 = 450
Hash the multiset ( treat it as a string ). Now go through the input string, and compute the hash of each substring of length k ( where k is the number of characters in the multiset ). Check if this hash matches the multiset hash. If yes, then it is one such occurence.
The hashes can be computed very easily in linear time as follows :
Let multiset = { A, A, B, C }, A=2, B=3, C=5.
Multiset hash = 2*2*3*5 = 60
Let text = CABBAACCA
(i) CABB = 5*2*3*3 = 90
(ii) Now, the next letter is A, and the letter discarded is the first one, C. So the new hash = ( 90/5 )*2 = 36
(iii) Now, A is discarded, and A is also added, so new hash = ( 36/2 ) * 2= 36
(iv) Now B is discarded, and C is added, so hash = ( 36/3 ) * 5 = 60 = multiset hash. Thus we have found one such required occurence - BAAC
This procedure will obviously take O( n ) time.

Resources