Shortest uncommon prefix from a set of strings - string

Given a string A and a set of string S. Need to find an optimum method to find a prefix of A which is not a prefix of any of the strings in S.
Example
A={apple}
S={april,apprehend,apprehension}
Output should be "appl" and not "app" since "app" is prefix of both "apple" and "apprehension" but "appl" is not.
I know the trie approach; by making a trie of set S and then traversing in the trie for string A.
But what I want to ask is can we do it without trie?
Like can we compare every pair (A,Si), Si = ith string from set S and get the largest common prefix out of them.In this case that would be "app" , so now the required ans would be "appl".
This would take 2 loops(one for iterating through S and another for comparing Si and A).
Can we improve upon this??
Please suggest an optimum approach.

I'm not sure exactly what you had in mind, but here's one way to do it:
Keep a variable longest, initialised to 0.
Loop over all elements S[i] of S,
setting longest = max(longest, matchingPrefixLength(S[i], A)).
Return the prefix from A of length longest+1.
This uses O(1) space and takes O(length(S)*average length of S[i]) time.
This is optimal (at least for the worst case) since you can't get around needing to look at every character of every element in S.
Example:
A={apple}
S={april,apprehend,apprehension}
longest = 0
The longest prefix for S[0] and A is 2
So longest = max(0,2) = 2
The longest prefix for S[1] and A is 3
So longest = max(2,3) = 3
The longest prefix for S[2] and A is 3
So longest = max(3,3) = 3
Now we return the prefix of length longest+1 = 4, i.e. "appl"
Note that there are actually 2 trie-based approaches:
Store only A in the trie. Iterate through the trie for each element from S to eliminate prefixes.
This uses much less memory than the second approach (but still more than the approach above). At least assuming A isn't much, much longer than S[i], but you can optimise to stop at the longest element in S or construct the tree as we go to avoid this case.
Store all elements from S in the trie. Iterate through the trie with A to find the shortest non-matching prefix.
This approach is significantly faster if you have lots of A's that you want to query for a constant set S (since you only have to set up the trie once, and do a single lookup for each A, where-as you have to create a new trie and run through each S[i] for each A for the first approach).

What is your input size?
Let's model your input as being of N+1 strings whose lengths are about M characters. Your total input size is about M(N+1) character, plus some proportional amount of apparatus to encode that data in a usable format (data structure overhead).
Your algorithm ...
maxlen = 0
for i = 1 to N
for j = 1 to M
if A[j] = S[i][j] then
if j > maxlen then maxlen = j
break
print A[1...maxlen]
... performs up M x N iterations of the innermost loop, reading two characters each time, for a total of 2MN characters read.
Recall our input data size was about M(N+1) also. So our question now is whether we can solve this problem, in the worst case, looking at asymptotically less than the total input (you do a little less than looking at all the input twice, or linear in the input size). The answer is no. Consider this worst case:
length of A is M'
length of all strings in S is M'
A differs from N-1 strings in S by the last two characters
A differs from 1 string in S by only the last character
Any algorithm must look at M'-1 characters of N-1 strings, plus M' characters of 1 string, to correctly determine the answer of this problem instance is A.
(M'-1)(N'-1) + N = M'N - M' - N + 1 + N = M'N - M' + 1
For N >= 2, the dominant terms in both M'(N+1) and M'N' are both M'N, meaning that for N >= 2, both the input size and the amount of that input any correct algorithm must read is O(MN). Your algorithm is O(MN). Any other algorithm cannot be asymptotically better.

Related

Count number of wonderful substrings

I found below problem in one website.
A wonderful string is a string where at most one letter appears an odd number of times.
For example, "ccjjc" and "abab" are wonderful, but "ab" is not.
Given a string word that consists of the first ten lowercase English letters ('a' through 'j'), return the number of wonderful non-empty substrings in word. If the same substring appears multiple times in word, then count each occurrence separately.
A substring is a contiguous sequence of characters in a string.
Example 1 :
Input: word = "aba"
Output: 4
Explanation: The four wonderful substrings are a , b , a(last character) , aba.
I tried to solve it. I implemented a O(n^2) solution (n is input string length). But expected time complexity is O(n). I could not solve it in O(n). I found below solution but could not understood it. Can you please help me to understand below O(n) solution for this problem or come up with an O(n) solution?
My O(N^2) approach - for every substring check whether it has at most one odd count char. This check can be done in O(1) time using an 10 character array.
class Solution {
public:
long long wonderfulSubstrings(string str) {
long long ans=0;
int idx=0; long long xorsum=0;
unordered_map<long long,long long>mp;
mp[xorsum]++;
while(idx<str.length()){
xorsum=xorsum^(1<<(str[idx]-'a'));
// if xor is repeating it means it is having even ouccrences of all elements
// after the previos ouccerence of xor.
if(mp.find(xorsum)!=mp.end())
ans+=mp[xorsum];
mp[xorsum]++;
// if xor will have at most 1 odd character than check by xoring with (a to j)
// check correspondingly in the map
for(int i=0;i<10;i++){
long long temp=xorsum;
temp=temp^(1<<i);
if(mp.find(temp)!=mp.end())
ans+=mp[temp];
}
idx++;
}
return ans;
}
};
There's two main algorithmic tricks in the code, bitmasks and prefix-sums, which can be confusing if you've never seen them before. Let's look at how the problem is solved conceptually first.
For any substring of our string S, we want to count the number of appearances for each of the 10 possible letters, and ask if each number is even or odd.
For example, with a substring s = accjjc, we can summarize it as: odd# a, even# b, odd# c, even# d, even# e, even# f, even# g, even# h, even# i, even# j. This is kind of long, so we can summarize it using a bitmask: for each letter a-j, put a 1 if the count is odd, or 0 if the count is even. This gives us a 10-digit binary string, which is 1010000000 for our example.
You can treat this as a normal integer (or long long, depending on how ints are represented). When we see another character, the count flips whether it was even or odd. On bitmasks, this is the same as flipping a single bit, or an XOR operation. If we add another 'a', we can update the bitmask to start with 'even# a' by XORing it with the number 1000000000.
We want to count the number of substrings where at most one character count is odd. This is the same as counting the number of substrings whose bitmask has at most one 1. There are 11 of these bitmasks: the ten-zero string, and each string with exactly one 1 for each of the ten possible spots. If you interpret these as integers, the last ten strings are the first ten powers of 2: 1<<0, 1<<1, 1<<2, ... 1<<9.
Now, we want to count the bitmasks for all substrings in O(n) time. First, solve a simpler problem: count the bitmasks for just all prefixes, and store these counts in a hashmap. We can do this by keeping a running bitmask from the start, and performing updates by an XOR of the bit corresponding to that letter: xorsum=xorsum^(1<<(str[idx]-'a')). This can clearly be done in a single, O(n) time pass through the string.
How do we get counts of arbitrary substrings? The answer is prefix-sums: the count of letters in any substring can be expressed as a different of two prefix-counts. For example, with s = accjjc, suppose we want the bitmask corresponding to the substring 'jj'. This substring can be written as the difference of two prefixes: 'jj' = 'accjj' - 'acc'.
In the same way, we want to subtract the counts for the two prefix strings. However, we only have the bitmasks telling us whether each letter has an even or odd frequency. In the arithmetic of bitmasks, we treat each position mod 2, so coordinate-wise subtraction becomes XOR.
This means counts(jj) = counts(accjj) - counts(acc) becomes
bitmask(jj) = bitmask(accjj) ^ bitmask(acc).
There's still a problem: the algorithm I've described is still quadratic. If, at every prefix, we iterate over all previous prefix-bitmasks and check if our mask XOR the old mask is one of the 11 goal-bitmasks, we still have a quadratic runtime. Instead, you can use the fact that XOR is its own inverse: if a ^ b = c, then b = a ^ c. So, instead of doing XORs with old prefix masks, you XOR with the 11 goal masks and add the number of times we've seen that mask: ans+=mp[xorsum] counts the substrings ending at our current index whose bitmask is xorsum ^ 0000000000 = xorsum. The loop after that counts substrings whose bitmask is one of the ten goal bitmasks.
Lastly, you just have to add your current prefix-mask to update the counts: mp[xorsum]++.

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.
I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.
Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.
Here's one fairly simple algorithm, which should be reasonably fast.
Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).
Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)
At the end, the last hash table will contain all the common k-length substrings.
The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.
Some thoughts (N is number of strings, M is average length, K is needed substring size):
Approach 1:
Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})
time O(NxM), space O(NxM)
Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)
Approach 2:
Build suffix array for every string
time O(N x MlogM) space O(N x M)
Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on
I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).
I would then sort each collection into ascending order.
Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.
(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).
Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.
Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).
I would try a simple method using HashSets:
Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set.
Lookup the term in the other sets.
The first step takes care of repetitions in each long string.
The second ensures the minimum number of comparisons.
let getHashSet k (lstr:string) =
let strs = System.Collections.Generic.HashSet<string>()
for i in 0..lstr.Length - k do
strs.Add lstr.[i..i + k - 1] |> ignore
strs
let getCommons k lstrs =
let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
match strss |> Seq.tryHead with
| None -> [||]
| Some h ->
let rest = Seq.tail strss |> Seq.toArray
[| for s in h do
if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
|]
Test:
let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
[| for i in 1..n do
yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
|] |> System.Text.Encoding.ASCII.GetString
[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l
result:
found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]
Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT

Finding the longest double suffix in linear time

Given a string s, find the longest double suffix in time complexity O(|s|).
Example: for string banana, the LDS is na. For abaabaa it's baa.
Obviously I thought about using a suffix tree, but I'm having trouble to find double suffix in it.
Reverse the string and build sparse array P[i][j], where i is from 0 to log(n), j is from 0 to n-1, n is the length of the string. P[i][j] refers to the rank of the suffix starting from position j and length 2^i. So if P[i][j]=P[i][k], the first 2^i chars of the suffixes at indexes j and k are equal.
Now your problem reduces to finding a Longest Common Prefix for 0(start of the reversed string) and another suffix at index i, such that LCP >= i.
Where LCP can be computed by simply using P array in log(n) time, by comparing first 2^x chars of these two suffixes and gradually reducing x.
Total complexity is n*log(n)*log(n).
Here is the working C++ source code: https://ideone.com/aJCAYG
I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.
But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your abaabaa example:
$ - ##
b a a - $ - ## // Longest double suffix: P is the first dash, N the second
b a a $ - ## // N' is the dash
a - $ - ##
a - $ - ##
b a a $ - ##
b a a - $ - ##
b a a $ - ##
When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.
How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:
P has a transition to the suffix node N that only contains the end token ($ above) of the string.
Let suffix be the substring represented by the node P with an appended end token (baa$ in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
The substring represented by the node P is the double suffix (baa in our case).
We have the equalities |N'| = 2.|P| + 1 and |N| = |P| + 1
Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.
Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)
Complexity analysis
Building the suffix tree is O(|S|).
Let N' be a suffix node and N be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:
The suffixes can be stored in an array/vector in increasing order because the creation of the tree adds them in increasing order of length (at least with the Ukkonen's algorithm).
Thus accessing the suffix of length k is O(1)
Accessing the substring represented by a node of the tree is O(1), in particular, this applies to P the parent node of N and N'
Finding out if the transition from P to N only contains the end token ($) is O(1)
Checking if |N'| = 2.|P| + 1 is indeed O(1)
Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the N' suffixes (the doubled suffix, ie baabaa$ in your example), so we just have to:
Get N the suffix node such that |N'| = 2.|N| - 1: O(1)
Get P the parent of the suffix node N: O(1)
Check that the transition from P to N contains only the end token $: O(1)
Proof: (We ignore the end token in the following proof)
The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.
Since we will do this step for at most (|S|/2 + 1)/2 suffixes, the identification step is therefore O(|S|) in the worst case.
The overall complexity is thus O(|S|).

Finding length of substring

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S.
For example, I have given the following 5 strings:
AATT
CGTT
CAGT
ACGT
ATGC
Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence.
To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?
This is a NP-complete problem called multiple sequence alignment.
The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.
The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.
My Approach: using Trie
Building a Trie from the given words.
create empty string (S)
create empty string (prev)
for each layer in the trie
create empty string (curr)
for each character used in the current layer
if the character not used in the previous layer (not in prev)
add the character to S
add the character to curr
prev = curr
Hope this helps :)
1 Definitions
A sequence of length n is a concatenation of n symbols taken from an alphabet .
If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous.
A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T.
A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.
2 The problem
The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).
3 Techniques
Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]
4 Implemented heuristics
4.1 The trivial solution
The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.
4.2 Majority merge heuristic
The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
WHILE there are non-empty input sequences
s <- The most frequent symbol at the start of non-empty input-sequences.
Add s to the end of S.
Remove s from the beginning of each input sequence that starts with s.
END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.
5 My approach - Local search
My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.
Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.
I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.
That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:
xi {k, l}, where k L and l L(k)
To be sure that any solution is valid we need to introduce the following constraints:
Every symbol in every sequence may only have one xi mapped to it.
If xi ss,k and xj ss,l and k < l then i < j.
If xi ss,k and xj ss,l and k > l then i > j.
The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.
5.1 The initial solution
There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.
The first one is to create an initial solution by simply concatenating all the sequences.
The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.
5.2 Local change and the neighbourhood
A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.
There are two variants I have tried.
In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.
In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.
A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.
The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.
5.3 Evaluation
Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.
The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.
The value of the solution S is obtained as |T|.
With Many Many Thanks to : Andreas Westling

find number of string without certain substring

the problem is given N (1 <= N <= 10) string with length no more than 6, how can i calculate number of string with length L (1 <= L <= 1000000) without any of the n string as the substring.
every string only contain uppercase letter.
the best i can think is using dp L * (26^5) but i don't think this will pass the time limit :( can anyone share some idea ? btw here's the original problem http://www.spoj.com/problems/GEN/ if you don't understand what i write above
First, create an NFA (nondeterministic finite automaton) that accepts all of the "bad" strings. Then convert it to a DFA using the subset construction. Then compute the complement of that DFA.
Counting the number of strings accepted by a DFA is rather straightforward; the number of strings of length k+1 ending in a given state can be computed by summing the number of strings of length k ending in each predecessor state.
This will likely run in time if you have a decent implementation. However, if it doesn't, you can use the automaton from Aho-Corasick preprocessing instead of the DFA.

Resources