Find the minimal lexographical string formed by merging two strings - string

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.

You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.

It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

Related

Minimum number of characters to be inserted in a string to convert it to palindrome

I need to find the minimal number of insertions needed to convert a string into a palindrome. Note: the insertions can happen at any place, at the end, or within. If it was only at the end, we have a question here.
So I found out that this can be done in O(N**2) time by this simple trick:
Let the string be s1. Reverse it. Let it be s2. Say the length is l.
Now find the longest common subsequence of s1 and s2. Let its length be x.
The answer is l-x.
For example, suppose s1 = abcda. Therefore s2 = adcba. Length is 5. Longest common subsequence is aba of length 3. So the minimal number of insertions is 5-3 = 2, which is the actual answer, with the resulting string - adcbcda.
However, I cannot understand the logic behind it. Can anyone explain it to me why it works?
And, is there any O(N) solution possible for this?
I don't know whether there is a O(N) solution but by comparing with the reverse, you find a subsequence which is a palindrome. Then you have l-x letters that are not paired. (You can consider a letter's pair as its reflection if you have a mirror right at the middle of the word. e.g. ab|ba) Later, by insertions you just complete the picture.
Now,firstly, how do we find a (maximum)subsequence that is common to two strings? There is a polynomial algorithm for finding it see it here
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
When we try to find the longest common subsequence(lcs) between s1 and s2(reverse of s1) we actually find lcs between the first half of s1 and first half s2, also second half of s1 and second half of s2.
Assume
s1 = abcddzac
so s2 = cazddcba. Here we can see it as comparison of abcd with cazd(first half) plus comparison of dzac with dcba(second half). We can see that both of comparisons are the same except they are reverse of each other so their concatenation has to be palindrome, so lcs of s1 and s2 has to be palindrome.
Once we have the lcs(ad|da) which is of length 4, we have 4 more letters that break the symmetry(b,c,z,c). Then we insert one letter for each of them to make a symmetry, i.e. a palindrome. We set our middle point as the middle point of the lcs and consider that we break s1 into two from that middle point so we have
s1 = a bc d|d z a c and we break it like a stick into two from d|d and we end up with:
dzac
dcba
now we simply fill between the letters of lcs so that they are the same. In our case steps are as follows:
dzac
dcba
dzac
dzcba
dzcac
dzcba
dzcbac
dzcba
dzcbac
dzcbac
Now we unbreak it from the same point and we have
cabczddzcbac which is a palindrome.
Note: cddc is also an ldc but that doesn't change the number of steps.

algorithm for finding all substrings from a specific alphabet in a string in O(m+n) time

Given a string S . Find all maximal substrings that contains chars from alphabet A in O(|S|+|A|) time. "Maixmal susbstring" is a substring of S, surrounded by chars that are not in alphabet A, or string boundaries.
example:
S = rerwmkwerewkekbvverqwewevbvrewqwmkwe
A = {w,r,e}
answer: rerw, werew, e, er, wewe, rew, w, we
Can you help?
Mapping your input to the output that you've provided here is one way to do it.
Just take the string characters one at a time and keep matching it to the alphabets in A.
Use a binary hash-table having 26 values based on alphabet.
Note: If capitals are included too hash them to their small letter counterparts for case-insensitivity and and double the hash table size for case-sensitivity.
If a value matches move on and concatenate this to previous sub-string.
If there is a miss, then break the sub-string, save it and start fresh with the next match.
Without the hash-table it would take O(m*n) time but now it'll take O(m) for hashing plus O(n) for traversing that is O(m+n) time.
Similar to what others have suggested, but in pseudocode form:
A = boolean array
for each c in the alphabet
set A[c] = true
L = stack of strings containing your solution
for each character c of S
if A contains c
append c to the top string of stack L
else
push empty string onto stack L
return L
Creating A will take O(n) and iteration through S will take O(m).

Finding length of substring

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S.
For example, I have given the following 5 strings:
AATT
CGTT
CAGT
ACGT
ATGC
Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence.
To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?
This is a NP-complete problem called multiple sequence alignment.
The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.
The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.
My Approach: using Trie
Building a Trie from the given words.
create empty string (S)
create empty string (prev)
for each layer in the trie
create empty string (curr)
for each character used in the current layer
if the character not used in the previous layer (not in prev)
add the character to S
add the character to curr
prev = curr
Hope this helps :)
1 Definitions
A sequence of length n is a concatenation of n symbols taken from an alphabet .
If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous.
A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T.
A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.
2 The problem
The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).
3 Techniques
Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]
4 Implemented heuristics
4.1 The trivial solution
The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.
4.2 Majority merge heuristic
The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
WHILE there are non-empty input sequences
s <- The most frequent symbol at the start of non-empty input-sequences.
Add s to the end of S.
Remove s from the beginning of each input sequence that starts with s.
END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.
5 My approach - Local search
My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.
Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.
I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.
That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:
xi {k, l}, where k L and l L(k)
To be sure that any solution is valid we need to introduce the following constraints:
Every symbol in every sequence may only have one xi mapped to it.
If xi ss,k and xj ss,l and k < l then i < j.
If xi ss,k and xj ss,l and k > l then i > j.
The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.
5.1 The initial solution
There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.
The first one is to create an initial solution by simply concatenating all the sequences.
The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.
5.2 Local change and the neighbourhood
A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.
There are two variants I have tried.
In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.
In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.
A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.
The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.
5.3 Evaluation
Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.
The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.
The value of the solution S is obtained as |T|.
With Many Many Thanks to : Andreas Westling

complexity of constructing an inverted index list

Given n strings S1, S2, ..., Sn, and an alphabet set A={a_1,a_2,....,a_m}. Assume that the alphabets in each string are all distinct. Now I want to create an inverted-index for each a_i (i=1,2...,m). My inverted-index has also something special: The alphabets in A are in some sequential order, if in the inverted-index a_i has included one string (say S_2), then a_j (j=i+1,i+2,...,m) don't need to include S_2 any more. In short, every string just appears in the inverted list only once. My question is how to build such list in a fast and efficient way? Any time complexity is bounded?
For example, A={a,b,e,g}, S1={abg}, S2={bg}, S3={gae}, S4={g}. Then my inverted-list should be:
a: S1,S3
b: S2 (since S1 has appeared previously, so we don't need to include it here)
e:
g: S4
If I understand your question correctly, a straightforward solution is:
for each string in n strings
find the "smallest" character in the string
put the string in the list for the character
The complexity is proportional to the total length of the strings, multiplying by a constant for the order testing.
If there is a simple way for testing, (e.g. the characters are in alphabetical order and all lower-case, a < will be enough), simply compare them; otherwise, I suggest using a hash table, each pair of which is a character and its order, later simply compare them.

Burrows-Wheeler Transform without EOF character

I need to perform a well-known Burrows-Wheeler Transform in linear time. I found a solution with suffix sorting and EOF character, but appending EOF changes the transformation. For example: consider the string bcababa and two rotations
s1 = abababc
s2 = ababcab
it's clear that s1 < s2. Now with an EOF character:
s1 = ababa#bc
s2 = aba#bcab
and now s2 < s1. And the resulting transformation will be different. How can I perform BWT without EOF?
You can perform the transform in linear time and space without the EOF character by computing the suffix array of the string concatenated with itself. Then iterate over the suffix array. If the current suffix array value is less than n, add to your output array the last character of the rotation starting at the position denoted by the current value in the suffix array. This approach will produce a slightly different BWT transform result, however, since the string rotations aren't sorted as if the EOF character were present.
A more thorough description can be found here: http://www.quora.com/Algorithms/How-I-can-optimize-burrows-wheeler-transform-and-inverse-transform-to-work-in-O-n-time-O-n-space
You need to have EOF character in the string for BWT to work, because otherwise you can't perform the inverse transform to get the original string back. Without EOF, both strings "ba" and "ab" have the same transformed version ("ba"). With EOF, the transforms are different
ab ba
a b | a | b
b | a b a |
| a b | b a
i.e. ab transforms to "|ab" and ba to "b|a".
EOF is needed for BWT because it marks the point where the character cycle starts.
Re: doing it without the EOF character, according to Wikipedia,
Since any rotation of the input string will lead to the same
transformed string, the BWT cannot be inverted without adding an 'EOF'
marker to the input or, augmenting the output with information, such
as an index, that makes it possible to identify the input string from
the class of all of its rotations.
There is a bijective version of the transform, by which the
transformed string uniquely identifies the original. In this version,
every string has a unique inverse of the same length.
The bijective transform is computed by first factoring the input into
a non-increasing sequence of Lyndon words; such a factorization exists
by the Chen–Fox–Lyndon theorem, and can be found in linear time.
Then, the algorithm sorts together all the rotations of all of these
words; as in the usual Burrows–Wheeler transform, this produces a
sorted sequence of n strings. The transformed string is then obtained
by picking the final character of each of these strings in this sorted
list.
I know this thread is quite old but I had the same problem and came up with the following solution:
Find the lexicographical minimal string rotation and save the offset (needed to reverse) (I use the lydon factorization)
Use the normal bwt algorithms on the rotated string (this produces the right output because all algos asume that the string is followed by the lexicographically minimal char)
To reverse: unbwt using e.g. backward search starting at index 0 and write the corrosponding char to the saved offset

Resources