build shortest string from a collection of its subsequence - string

Given a collection of sub-sequences from a string.
For example:
abc
acd
bcd
The problem is, how to determine the shortest string from these sequences?
For the above example, the shortest string is abcd.
Here sub-sequences means part of a string but not necessarily consecutive. like acd is a sub-sequence of string abcd.
Edit: This problem actually comes from Project Euler problem 79, in that problem, we have 50 subsequences, each has the length of 3. and all characters are digits.

This problem is well studied and coined "Shortest common supersequence"
For two strings, it is in O(n). Suppose n is the maximum length of string. O(n) is used to build the suffix array and then O(n) to solve the Longest Common Subsequence problem.
For more than two strings, it is NP-Complete.

Complexity
In common case, as mentioned above, this problem will be NP-hard. There is a way to resolve a matter with suffix structure, but also you can use directed graph to do that. I can not say for sure if that is better in some sense - but may be some advantage may be found in difficult corner cases.
Graph solution
To realize how you can use graph - you need just to build it properly. Let letters of strings be vertexes and order of letter will define edges. That means, ab will mean vertexes a and b and connection between a and b. Of course, connections are directed (i.e. if a is connected to b it doesn't mean that b is connected to a, so ab is a --> b)
After these definitions you'll be able to build all graph. For your sample it's:
-simple enough. But abcd can also be represented with strings of two length as ab, ac, ad, bc, bd, cd - so I'll show graph for that too (it's just for more clarity):
Then, to find your solution, you need in this graph to find a path with maximal length. Obviously, there's the place from which NP-hardness is "derived". From both cases above maximum length is 4 which will be reached if we'll start from a vertex and traverse graph till found a->b->c->d path.
Corner cases
Non-unique solution: in fact, you may face string sets which can't strictly define a superset. Example: ab, ad, bd, ac, cd. Both abcd and acbd will fit those substrings. But, actually, this is not too bad problem. Look at the graph:
(I've chosen that with reason: it's like second graph, but without one connection, that's why result is ambiguous). The maximum path length is now 3 - but it can be reached with two paths: abd and acd. So how to restore at least one solution? Simple: since result strings will have same length (that comes from definition of way which we've found them) - we can just walk from start of first string and check symbols in second string. If they're matching, then it's a strict symbol position. But if they not - then we're free to chose any order between current symbols. So that will be:
[a] matches [a], so current result is a
[b] mismatches [c], so we can place either bc or cb. Let it be first. Then result is abc
[d] matches [d], so current result is abc+d, i.e. abcd
This is kind of "merge" where we are free to choose any result. However, this is a bit twisted case, because now we can't use just any found path - we should find all paths with maximum length (otherwise we'll not be able to restore full supersequence)
Next, non-existent solution: there may be cases when order of symbols in some strings can not be used to reproduce supersequence. Obviously, that means that one order is violating other order, thus, there will be two strings in which some two symbols have different order. Again, simplest case: abc, cbd. On the graph:
-so imminent consequence will be graph loop - it may not be so simple (like in graph above) - but it always be if order is broken. Thus, all you need to realize that is to find graph cycle. In fact. this isn't a thing that you must do separately. You'll just add cycle detection in graph longest path search.
Third: repeated symbols: this is the most tricky case. If string contains repeated symbols, then graph creation is more complicated, but still can be done. For example, let we have three conditions: aac, abc, bbc. Solution for this is aabbc. How to build graph? Now we can't just add links (because of loops at least). But I suggest following way:
Let we process all our strings, assigning indexes to symbols. Index is reset per string. If symbol appears only once, index will be 1. For our strings that means: a1a2c1, a1b1c1 and b1b2c1. After that, store maximum index for each symbol we've found. For sample above that will be 2 for a, 2 for b and 1 for c
If two connected indexed symbols have same original symbol, then connection is done "as is". For example, a1a2 will produce only one connection from a1 to a2
If two connected indexed symbols have different original symbols, then any first indexed symbol may be connected to any second indexed symbol. For example, b1c1 will result in (b1 to c1) connection and (b2 to c1) connection. How do we know about how many indexed symbols may be? That we've done on first step (i.e. found maximum indexes)
Thus, graph for sample above will look like:
+------------+
| |
| |
+------>[a2]------+ |
| | | |
| V V |
[a1]---->[c1]<----[b2] |
| ^ ^ |
| | | |
+------>[b1]------+ |
^ |
| |
+------------+
-so we'll have maximum length 5 for paths a1a2b2b1c1 and a1a2b1b2c1. We can chose any of them, ignoring indexes in result string aabbc.

Maybe there is no general efficient algorithm this kind of problem. But this is a solution for this particular problemprojecteuler 79. You want to observe the input file. You will find 7 only appears at the beginning of all sequences, so 7 should be put at the beginning. Then you will find digit 3 is the only character at the beginning, then you put 3 on the second position. So on and so forth, you can get 73162890. The special case is the last 2 digits, both 0 and 9 are at the beginning, then you have 2 choices. then you try both and get 90 is the optimal solution.
More generally, you can use the depth-first-search algorithm. The trick is when you find there is one character which only appears at the beginning, just choose it, it will
path to the optimal solution.

I think we can start with graphs .
I am not sure of correctness , but say if we build a graph with a->b if b comes after a , with all paths of length 1.
Now , we have to find longest distance (DFS can help) .
I will try to provide example .
say strings are :
abc
acd
bcd
bce
we form a graph :
Now main thing left will be to combine nodes e and d because the required string can be abcde or abced.
This is what i am not sure how to do , so maybe somebody can help !
Sorry for posting this as answer , comments can't include pictures.

Construct the graph out of the subsequences and then find the topological sort of the graph in O(E) using DFS to get the desired string of shortest length which has all subsequence in it. But as you would have noticed the topo-sort is invalid if the graph has cycles in that cases there are repetitions need for characters in the cycles which is difficult to solve.
Conclusion:- You get lucky if there are no cycles in graph and solve it in O(E) else get unlucky and end up doing brute force.

Related

Collections: How will you find the top 10 longest strings in a list of a billion strings?

I was recently asked a question in an interview. How will you find the top 10 longest strings in a list of a billion strings?
My Answer was that we need to write a Comparator that compares the lengths of 2 strings and then Use the TreeSet(Comparator) constructor.
Once you start adding the strings in the Treeset it will sort as per the sorting order of the comparator defined.
Then just pop the top 10 elements of the Treeset.
The Interviewer wasn't happy with that. The argument was that, to hold billion strings I will have to use a super computer.
Is there any other data stucture than can deal with this kind of data?
Given what you stated about the interviewer saying you would need a super computer, I am going to assume that the strings would come in a stream one string at a time.
Given the immense size due to no knowledge of how large the individual strings are (they could be whole books), I would read them in one at a time from the stream. I would then compare the current string to an ordered list of the top ten longest strings found before it and place it accordingly in the ordered list. I would then remove the smallest length one from the list and proceed to read the next string. That would mean only 11 strings were being stored at one time, the current top 10 and the one currently being processed.
Most languages have a built in sort that is pretty speedy.
stringList.sort(key=len)
in python would work. Then just grab the first 10 elements.
Also your interviewer does sounds behind the times. One billion strings is pretty small now a days
I remember studying similar data structure for such scenarios called as Trie
The height of the tree will give the longest string always.
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
The point is you do not need to STORE all strings.
Let's think a simplified version: Find the longest 2 string (assuming no tie case)
You can always do a online algorithm like using 2 variables s1 & s2, where s1 is longest string you encountered so far, s2 is the second longest
Then you use O(N) to read the strings one by one, replace s1 or s2 when it can. This use O(2N) = O(N)
For top 10 strings, it is as dumb as the top 2 case. You can still do it in O(10N) = O(N) and store only 10 strings.
There is a faster way describe as follow but for given constant like 2 or 10, you may not need it.
For top-K strings in general, you can use structure like set in C++ (with longer having higher priority) to store the top-K strings, when a new string comes, you simply insert it, and remove the last one, both use O(lg K). So total you can do it in O(N lg K) with O(K) space.

clustering strings - what algorithm is suitable?

I have some strings and characters will not be repeated in a single string.
for example: "AABC" is not possible.
I want to cluster them into sets by their common sub-strings.
for example: "ABC, CDF, GHP" will be cluster into two sets
{ABC,CDF},{GHP}.
several strings with one or more common sub-strings will be in one set.
a string which has no common sub-string with any other strings will be a set itself.
so keep the number of sets smallest.
for example:
1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}.
2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.
Finding a string which has nothing common with others is easy I think;
for(i=0; i< strings.num; i++)
{ str1 = strings[i];
bool m_com=false;
for(j=0;j < strings.num; j++ )
{
str2=strings[j];
if(hascommon(str1,str2))
m_com=true;
}
if(!m_com)
{
str1 has no common substring with any string,
}
}
now I am thinking about others, how to classify them, is there any algorithm suitable for this?
Input:
strings (characters are not be repeated)
output:
sets (keep number of sets as small as possible)
I know this involves with finding common sub-string problem and clustering.
but I am not familiar with clustering techniques, so I am hoping some one
could recommend me such algorithm.
while I am looking for good ways to do this, I also appreciate suggestions from others.
Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist.
and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.
I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths.
or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.
I thought I could solve this by finding common subs-strings, then I decide where to remove edges.
One thing should be pointed out first:
For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".
Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.
What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.
There can be further optimizations, but I guess it will depend on the nature of your input data.
You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm
As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.
As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.
If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

Rotate String in-place with O(N)

I encountered a simple (as I thought) exercise of rotating string:
To rotate string by K characters means to cut these characters from
the beginning and transfer them to the end. If K is negative,
characters, on contrary should be transferred from the end to the
beginning.
I at once invented solution which uses concatenation of two substrings. However, I found the following addition:
if you want more serious challenge, you are encouraged to perform rotation "in-place"
task is from here - be sure it is not my college assignment etc
I could not see any simple way to do this (I suppose there should be O(N) algorithm). If I copy 0-th char to temporary variable and copy K-th char to its place, then 2K-th char to the place of K-th etc. - I will succeed provided that K and string length are co-prime. I think I can manage with other Ks adding outer loop to repeat the process from 1-th char, then 2-th etc - I think up to GCD(K, strlen(S))
But it looks too clumsy.
Let's look at rotating it by 6 characters:
Original: 123456789
Desired: 789 123456
Now look what happens when we reverse both parts of the desired string:
987 654321
This is just the reverse of the original string.
So, what we need to do is first reverse the string, then reverse both parts.
It's easy to reverse a string in linear time in-place.
This is a duplicate of another question very recently asked, answering it seems like less effort than finding the duplicate - I'll just make this community wiki.

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something
READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

Resources