Minimum no of operations required to create String A By appending subsequence of String B to a empty string C - string

You have given two strings A and B. You have some empty string C. In one operation You can remove any no of characters (from anywhere) from String B and append it to string C. Minimum no of operations required to convert String C to String A.
e.g if
A is "ABCDE" and
B is "ABDEC" then
In 1st operation you will choose subsequence ABC from B and in 2nd operation DE.
So two operations are required.
if
A is "ABCDE"
B is "EDCBA" then
operations required 5.
Linear complexity expected O(n)

Just use a greedy algorithm.
1 - Let i = 0
2 - Let j = 0
3 - Search for the first A[i] in B after j
4 - If it exists, let j be its index in B, remove it from B, append it to C, increment i, and repeat from 3
5 - If it doesn't exist, repeat from 2
Each time you get to 5 corresponds to one operation.
Assuming all the characters of A (and B) are different, then here is a solution with linear complexity. You need a hashmap or something similar, as well as an array of indices, Y, of equal length to A and B.
1 - Put each character of A in the hashmap as key, with its index as value.
2 - Look up each character of B in the hashmap to get the value i, and put its index into Y at the position i.
3 - Go through Y counting the number of times that Y[i] < Y[i-1]. That's your number of operations.

Related

linear time algorithm for finding most frequent m-letter substring in a string

Suppose we have a n letter string and we are searching for most repeated m letter substring (1=<m =< n).
I am just searching for an algorithm which solves this problem in linear time. And I have reached to suffix tree. But how can I solve it by suffix tree?
Thanks a lot.
Idea
You can also solve it with hash function.
We can convert strings to base b numbers where b is a prime number.
For example, if the strings only consist of lowercase alphabet (26 characters a-z) then we can choose b equals 29.
Then we map string characters to corresponding numbers. For example:
a -> 1
b -> 2
c -> 3
...
z -> 26
String abc will equals 29^2*1 + 29^1*2 + 29^0*3 = 899 in base 29.
We should map a -> 1 but not a -> 0 since hash value of aaa and aa will be equal in base b, which shouldn't be.
Now instead of compare two strings, we can compares their hash value in base b. If their hash value are equal then we can say they are equal.
Since hash value can be very large, you can use it's module a large prime number, for example mod 1e9+7. The posibility of two different strings have same hash value is very low in this case.
Algorithm
The algorithm can be described as bellow:
Let n-letter string be S
Let hash(s) be function to get hash value of string s
For each m-letter-substring of S, call it s
Increase the number of occurrences of hash(s), let call its o(hash(s))
Result will be the m-letter-substring s with the maximum o(hash(s))
To calculate hash(s), first we build array H where:
H[i] = (b^(i-1)*S[1] + b^(i-2)*S[2] + b^(i-3)*S[3] + ... + b^0*S[i]) % mod
Here S[i] is the mapped number of character i-th of string S.
To calculate b^x, we can calculate array powb where:
powb[0] = 1; powb[i] = (powb[i - 1] * b) % mod
Then for a substring s[l..r] of string S,
hash(s[l..r]) = (H[r] - H[l-1]*b^(r-l+1)) % mod
As we can see, hash(s) can be negative, in this case we should add mod to hash(s) (hash(s) += mod).
Complexity
O(N) to calculate H, powb
O(N) to iterate every substring s
For each s
O(1) to calculate hash(s)
O(log(N)) to calculate total occurrences of hash value (C++ map)
Total complexity: O(N log N)

Why we return max({count,v1,v2}) in Longest Common Substring

why we return max({count,v1,v2}) in longest common substring as i did the longest common subsequence first then did the longest common substring , but get confused , as in longest common subsequence if x[n-1] == y[m-1] then i return but in case of longest common substring i didn't return value instead i wait for other 2 recursion calls then return the max. I m really confused.
In the LC-Subsequence algorithm, we increase the count in the caller since the aggregation of the counts can happen in levels down the tree that do not necessarily need to be the immediate child of that tree - in other words, we can skip levels till we find the next match and the count is an aggregation of these levels where we found a match. Here we don't look for discontinuation.
ex: x: a,b,c,d,e,f
y: a,b,f,c,e
lc-subsequence: a,b,c,e
max: 4
here you can jump from b -> c in string y. Similarly, c -> e in string x.
But in the case of the LC-Substring, we pass in the count to the next level since we want to increase the count if and only if we have a match at the immediate next level. That is, here we look for discontinuation.
ex: x: a,b,c,d,e,f
y: a,b,f,c,e
lc-substring: a,b | c | e
max: 2
here you can't jump from b -> c in string y and and c->e in string x respectively.
Therefore, count represent the string that matches exactly in both strings x and y. That's why you take max of (count,v1,v2)

Number of substrings of a string with given count of each character

Given a string (s) and an integer k, we need to find number of sub strings in which all the different characters occurs exactly k times.
Example: s = "aabbcc", k = 2
Output : 6
The substrings [aa, bb, cc, aabb, bbcc and aabbcc] contains distinct characters with frequency 2.
The approach I can think of is to traverse through all sub strings and store frequency of current sub string, and increment the result when frequency equals k. This will result in worst case complexity of O(n*n), where n is the length of the string s.
Is there any better approach for this problem?
We can solve this in O(n * log(size_of_alphabet)). Let f(i) represent the most valid substrings ending at the ith character. Then:
f(i) ->
1 + f(j - 1)
where j is the rightmost index smaller
than or equal to i where s[j..i] is a
valid substring and (j - 1) is inside
the current window. Call s[j..i] the
"minimal" valid substring ending at
index i.
An invariant for our window is that if a character is seen k + 1 times, we move the left bound just past that character's leftmost instance in the window. This guarantees that any two substrings in a string of concatenated, valid substrings in the current window cannot have a shared character, and thus remain a valid concatenation.
Each time we reach the kth instance of character c, the rightmost index smaller than or equal to i where s[j..i] is a valid substring must start to the right of all characters in the window who's count is less than k. To find the rightmost such index, we may also need to move ahead of valid neighbouring substrings already seen in the window.
To find that index, we can maintain a max indexed-heap that stores the rightmost instance of each distinct character in our window currently with counts less than k, prioritised by their index, such that our j is always to the right of the heap's root (or the heap is empty). The heap is indexed, which alllows us to remove specific elements in O(log(size_of_alphabet)).
We also keep the right and left boundary indexes of valid minimal substrings already seen in the window. We can use a double ended queue for that for O(1) updates since a valid substring can appear to the right of another or envelope existing ones. And we keep a hashmap of the left boundaries for O(1) lookup.
Additionally, we must keep a count of each distinct character in the window in order to maintain our invariant, no such count above k, and their leftmost index in the window for the valid substring precondition.
Procedure:
for each index i in s:
let c be the character s[i]
if s[i] is the (k+1)th instance of c in the window:
move the left bound of the window
just past the leftmost instance of
c in the window, removing all
elements in the heap who's rightmost
instance we passed while updating
our window; and adding to the heap
the rightmost instance of characters
who's count has fallen below k
as we move the left bound of
the window. If the boundary moves
past the left bound of valid minimal
substrings, remove their boundaries
from the queue, and their left bound
from the hashmap.
if s[i] is the kth instance of c:
remove the previous instance of c
from the heap.
if the leftmost instance of c in the
window is to the right of the heap
root:
if (root_index + 1) is the
left bound of a valid minimal
substring in our queue:
we must be adding to the right
of all of them, so add a new
valid minimal substring, starting
at the next index after the
rightmost of those that ends
at i
otherwise:
add a new valid minimal substring,
starting at (root_index + 1)
and ending at i
otherwise:
remove the previous instance of c
in the heap and insert this one.
For example:
01234567
acbbaacc k = 2
0 a heap: (0 a)
1 c heap: (1 c) <- (0 a)
2 b heap: (2 b) <- (1 c) <- (0 a)
3 b kth instance, remove (2 b)
heap: (1 c) <- (0 a)
leftmost instance of b is to the
right of the heap root.
check root + 1 = 2, which points
to a new valid substring, add the
substring to the queue
queue: (2, 3)
result: 1 + 0 = 1
4 a kth instance, remove (0 a)
heap: (1 c)
queue: (2, 3)
result: 1
leftmost instance of a is left
of the heap root so continue
5 a (k+1)th instance, move left border
of the window to index 1
heap: (1 c)
queue: (2, 3)
result: 1
(5 a) is now the kth instance of
a and its leftmost instance is to
the right of the heap root.
check root + 1 = 2, which points
to a valid substring in the queue,
add new substring to queue
heap: (1 c)
queue: (2, 3) -> (4, 5)
result: 1 + 1 + 1 = 3
6 c kth instance, remove (1 c)
heap: empty
add new substring to queue
queue: (1) -> (2, 3) -> (4, 5) -> (6)
(for simplicity, the queue here
is not labeled; labels may be needed
for the split intervals)
result: 3 + 1 + 0 = 4
7 c (k+1)th instance, move left border
of the window to index 2, update queue
heap: empty
queue: (2, 3) -> (4, 5)
result: 4
(7 c) is now the kth instance of c
heap: empty
add new substring to queue
queue: (2, 3) -> (4, 5) -> (6, 7)
result: 4 + 1 + 2 = 7
The length of such strings simply has to be exactly a multiple of K. This slashes the depth of the search.
{Indeed, it can only be K multiplied by one of the integers up to the count of distinct characters.}

Linear space data-structure supporing subsequence query on a static string

Build a data-stucture from a given string S of length n which supports fast queries for checking whether an input string J of length m is a subsequence of S.
S is a static string and pre-processing time of the data-structure can be ignored.
Requirements:
The space consumption should be linear O(n)
The runtime of subsequence(J) should depend on m - not necessarily O(m) but, the faster the better.
What is subsequence?
A is a subsequence of B if A can be constructed by removing zero or more characters from B. I.e ABA is a subsequence of ADBDBAC
What I tried
A data-structure which supports the Subsequence(J) query stores pointers from each letter in S to the next occurrence in S of every letter in the alphabet.
Let A be an array of length n + 1. A contains hash-tables hashed over alphabet, σ. Each key-value pair (k,v) in the hash-table contains some letter k as key and it's next occurrence as value v.
The hash-table A_0 contains the first occurrence of every letter in the alphabet.
The hash-table A_1 contains the index of second occurrence for the letter at S_0 along with the first occurrence of the other letters.
The hash-table A_2 contains the index of second occurrence for the letters S_1 and S_2 assuming they are different letters - otherwise A_2 will contain the third index of the letter at S_1 - along with the first occurrence of the other letters and so on...
Example: If T is B C A D F B, ¥ represents the hashtable A_0 and represents a Ø null pointer, the data-structure would look like:
|0 1 2 3 4 5
|¥ B C A D B
A|3 3 3 Ø Ø Ø
B|1 5 5 5 5 Ø
C|2 2 Ø Ø Ø Ø
D|4 4 4 4 Ø Ø
The alphabet \sigma is built from the letters in T and is static. Therefore, perfect hashing (FKS) can be used.
Running the query
To perform the Subsequence(J) query with the string J, we lookup the A-index of the first occurrence J_0 in S using A_0.
In the example we could query Subsequence("BAB") to test if BAB is a subsequence:
* look-up B in column 0 which returns index 1
* look-up A in column 1 which returns index 3
* look-up B in column 3 which returns index 5
As long as we don't pass a null-pointer, the string is subsequence. The hash-lookups take constant time and we have to perform at most |J| of them the runtime is O(|J|).
The space consumption is O(|J|·|S|)
The simple and slow way to check whether or not J is a subsequence of S is:
Start at the beginning of S
For each character c in J, in order, move forward in S to the next occurrence of c.
Iff you make it to the end and find a match for every character, then J is a subsequence of S.
You can accelerate these searches by building a map from each character that occurs in S to a sorted array of the positions at which that character occurs.
Then, to find the next occurrence of a character in step (2), you can lookup the position array for that character and do a binary search in the array for the next occurrence after the current position.
Total worst-case complexity to do a subsequence check would be O(m log n).

Algorithm to form a given pattern using some strings

Given are 6 strings of any length. The words are to be arranged in the pattern shown below. They can be arranged either vertically or horizontally.
--------
| |
| |
| |
---------------
| |
| |
| |
--------
The pattern need not to be symmetric and there need to be two empty areas as shown.
For example:
Given strings
PQF
DCC
ACTF
CKTYCA
PGYVQP
DWTP
The pattern can be
DCC...
W.K...
T.T...
PGYVQP
..C..Q
..ACTF
where dot represent empty areas.
The other example is
RVE
LAPAHFUIK
BIRRE
KZGLPFQR
LLHU
UUZZSQHILWB
Pattern is
LLHU....
A..U....
P..Z....
A..Z....
H..S....
F..Q....
U..H....
I..I....
KZGLPFQR
...W...V
...BIRRE
If multiple patterns are possible then pattern with lexicographically smallest first line, then second line and so on is to be formed. What algorithm can be used to solve this?
Find strings which suits to this constraint:
strlen(a) + strlen(b) - 1 = strlen(c)
strlen(d) + strlen(e) - 1 = strlen(f)
After that try every possible situation if they are valid. For example;
aaa.....
d.f.....
d.f.....
d.f.....
cccccccc
..f....e
..f....e
..bbbbbb
There will be 2*2*2 = 8 different situation.
There are a number of heuristics that you can apply, but before that, let's go over some properties of the puzzle.
+aa+
c f
+ee+eee+
f d
+bbb+
Let us call the length of the string with the same character as appeared in the diagram above. We have:
a + b - 1 = e
c + d - 1 = f
I will refer to the 2 strings for the cross in the middle as middle strings.
We also infer that the length of the string cannot be less than 2. Therefore, we can infer:
e > a, e > b
f > c, f > d
From this, we know that the 2 shortest strings cannot be middle strings, due to the inequality above.
The 3 largest strings cannot be equal also, since after choosing any of 3 string as middle string, we are left with 2 largest strings that are equal, and it is impossible according to the inequality above.
The puzzle is only tricky when the lengths are regular. When the lengths are irregular, you can do direct mapping from length to position.
If we have the 2 largest strings being equal, due to the inequality above, they are the 2 middle strings. The worst case for this one is a "regular" puzzle, where the length a, b, c, d are equal.
If the 2 largest strings are unequal, the largest string's position can be determined immediately (since its length is unique in the puzzle) - as one of the middle string. In worst case, there can be 3 candidates for the other middle string - just brute force and check all of them.
Algorithm:
Try to map unique length string to the position.
Brute force the 2 strings in the middle (taken into consideration what I mentioned above), and brute force to fill in the rest.
Even with stupid brute force, there are only 6! = 720 cases, if the string can only go from left to right, up to down (no reverse). There will be 46080 cases (* 2^6) if the string is allowed to be in any direction.

Resources