String operations - string

Can anyone help me with this question:.This is not homework,I am preparing for my technical interview.
Given a series of N strings, find a set of repeating string of size 3
e.g. ababadefb

I think we might suffer from not knowing the full problem. I am going to direct you to a blog entry by a friend of mine where he talks about his interview with Microsoft.

A simple solution would be to construct a Suffix array from the string, sort it and compute the longest common prefix between the current suffix and the one before it. Now all LCPs of length 3 or more will give you the answer (aba in this case).
ababadefb 0
abadefb 3
adefb 1
b 0
babadefb 1
badefb 2
defb 0
efb 0
fb 0
As an alternate solution you can build a Radix tree from all suffixes then get all edges that are labeled with strings of length 3 or more.

Related

find the most divisible number in an array?

Got a question in interview, find most divisible by other numbers in the array, say [2,4,8], 8 is able to be divided by 3 numbers, and that is the ans. I have a O(N^2) solution, but is there a better solution than O(N^2)?
I think, something like quick sort will make sense, but not yet get the solution, like if a % b, b%c => a%c, but % operation is not transitional like > operation.
You can work with trees with O(NlogN).
The fastest algorithm i think is to make a heap (AVL tree is the best choice) tree with this rule:
For input x: find x%node[i] == 0 or node[i]%x == 0
If you found this node add x to node[i]'s children collection or replace Node[i] with x and add Node[i] to x's Children Collection.
else add this node to the root Node.

Can dynamic programming problems always be represented as DAG

I am trying to draw a DAG for Longest Increasing Subsequence {3,2,6,4,5,1} but cannot break this into a DAG structure.
Is it possible to represent this in a tree like structure?
As far as I know, the answer to the actual question in the title is, "No, not all DP programs can be reduced to DAGs."
Reducing a DP to a DAG is one of my favorite tricks, and when it works, it often gives me key insights into the problem, so I find it always worth trying. But I have encountered some that seem to require at least hypergraphs, and this paper and related research seems to bear that out.
This might be an appropriate question for the CS Stack Exchange, meaning the abstract question about graph reduction, not the specific question about longest increasing subsequence.
Assuming following Sequence, S = {3,2,6,4,5,1,7,8} and R = root node. Your tree or DAG will look like
R
3 2 4 1
6 5 7
8
And your result is the longest path (from root to the node with the maximum depth) in the tree (result = {r,1,7,8}).
The result above show the longest increasing sequence in S. The Tree for the longest increasing subsequence in S look as follows
R
3 2 6 4 5 1 7 8
6 4 7 5 7 7 8
7 5 8 7 8 8
8 7 8
8
And again the result is the longest path (from root to the node with the maximum depth) in the tree (result = {r,2,4,5,7,8}).
The answer to this question should be YES.
I'd like to cite the following from here: The Soul of Dynamic Programming
Formulations and Implementations.
A DP must have a corresponding DAG (most of the time implicit), otherwise we cannot find a valid order for computation.
For your case, Longest Increasing Subsequence can be represented as some DAG like the following:
The task is amount to finding the longest path in that DAG. For more information please refer to section 6.2 of Algorithms, Dynamic programming.
Yes, It is possible to represent longest Increasing DP Problem as DAG.
The solution is to find the longest path ( a path that contains maximum nodes) from every node to the last node possible for that particular node.
Here, S is the starting node, E is the ending node and C is count of nodes between S and E.
S E C
3 5 3
2 5 3
6 6 1
4 5 2
5 5 1
1 1 1
so the answer is 3 and it is very easy to generate solution as we have to traverse the nodes only.
I think it might help you.
Reference: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-20-dynamic-programming-ii-text-justification-blackjack/

build shortest string from a collection of its subsequence

Given a collection of sub-sequences from a string.
For example:
abc
acd
bcd
The problem is, how to determine the shortest string from these sequences?
For the above example, the shortest string is abcd.
Here sub-sequences means part of a string but not necessarily consecutive. like acd is a sub-sequence of string abcd.
Edit: This problem actually comes from Project Euler problem 79, in that problem, we have 50 subsequences, each has the length of 3. and all characters are digits.
This problem is well studied and coined "Shortest common supersequence"
For two strings, it is in O(n). Suppose n is the maximum length of string. O(n) is used to build the suffix array and then O(n) to solve the Longest Common Subsequence problem.
For more than two strings, it is NP-Complete.
Complexity
In common case, as mentioned above, this problem will be NP-hard. There is a way to resolve a matter with suffix structure, but also you can use directed graph to do that. I can not say for sure if that is better in some sense - but may be some advantage may be found in difficult corner cases.
Graph solution
To realize how you can use graph - you need just to build it properly. Let letters of strings be vertexes and order of letter will define edges. That means, ab will mean vertexes a and b and connection between a and b. Of course, connections are directed (i.e. if a is connected to b it doesn't mean that b is connected to a, so ab is a --> b)
After these definitions you'll be able to build all graph. For your sample it's:
-simple enough. But abcd can also be represented with strings of two length as ab, ac, ad, bc, bd, cd - so I'll show graph for that too (it's just for more clarity):
Then, to find your solution, you need in this graph to find a path with maximal length. Obviously, there's the place from which NP-hardness is "derived". From both cases above maximum length is 4 which will be reached if we'll start from a vertex and traverse graph till found a->b->c->d path.
Corner cases
Non-unique solution: in fact, you may face string sets which can't strictly define a superset. Example: ab, ad, bd, ac, cd. Both abcd and acbd will fit those substrings. But, actually, this is not too bad problem. Look at the graph:
(I've chosen that with reason: it's like second graph, but without one connection, that's why result is ambiguous). The maximum path length is now 3 - but it can be reached with two paths: abd and acd. So how to restore at least one solution? Simple: since result strings will have same length (that comes from definition of way which we've found them) - we can just walk from start of first string and check symbols in second string. If they're matching, then it's a strict symbol position. But if they not - then we're free to chose any order between current symbols. So that will be:
[a] matches [a], so current result is a
[b] mismatches [c], so we can place either bc or cb. Let it be first. Then result is abc
[d] matches [d], so current result is abc+d, i.e. abcd
This is kind of "merge" where we are free to choose any result. However, this is a bit twisted case, because now we can't use just any found path - we should find all paths with maximum length (otherwise we'll not be able to restore full supersequence)
Next, non-existent solution: there may be cases when order of symbols in some strings can not be used to reproduce supersequence. Obviously, that means that one order is violating other order, thus, there will be two strings in which some two symbols have different order. Again, simplest case: abc, cbd. On the graph:
-so imminent consequence will be graph loop - it may not be so simple (like in graph above) - but it always be if order is broken. Thus, all you need to realize that is to find graph cycle. In fact. this isn't a thing that you must do separately. You'll just add cycle detection in graph longest path search.
Third: repeated symbols: this is the most tricky case. If string contains repeated symbols, then graph creation is more complicated, but still can be done. For example, let we have three conditions: aac, abc, bbc. Solution for this is aabbc. How to build graph? Now we can't just add links (because of loops at least). But I suggest following way:
Let we process all our strings, assigning indexes to symbols. Index is reset per string. If symbol appears only once, index will be 1. For our strings that means: a1a2c1, a1b1c1 and b1b2c1. After that, store maximum index for each symbol we've found. For sample above that will be 2 for a, 2 for b and 1 for c
If two connected indexed symbols have same original symbol, then connection is done "as is". For example, a1a2 will produce only one connection from a1 to a2
If two connected indexed symbols have different original symbols, then any first indexed symbol may be connected to any second indexed symbol. For example, b1c1 will result in (b1 to c1) connection and (b2 to c1) connection. How do we know about how many indexed symbols may be? That we've done on first step (i.e. found maximum indexes)
Thus, graph for sample above will look like:
+------------+
| |
| |
+------>[a2]------+ |
| | | |
| V V |
[a1]---->[c1]<----[b2] |
| ^ ^ |
| | | |
+------>[b1]------+ |
^ |
| |
+------------+
-so we'll have maximum length 5 for paths a1a2b2b1c1 and a1a2b1b2c1. We can chose any of them, ignoring indexes in result string aabbc.
Maybe there is no general efficient algorithm this kind of problem. But this is a solution for this particular problemprojecteuler 79. You want to observe the input file. You will find 7 only appears at the beginning of all sequences, so 7 should be put at the beginning. Then you will find digit 3 is the only character at the beginning, then you put 3 on the second position. So on and so forth, you can get 73162890. The special case is the last 2 digits, both 0 and 9 are at the beginning, then you have 2 choices. then you try both and get 90 is the optimal solution.
More generally, you can use the depth-first-search algorithm. The trick is when you find there is one character which only appears at the beginning, just choose it, it will
path to the optimal solution.
I think we can start with graphs .
I am not sure of correctness , but say if we build a graph with a->b if b comes after a , with all paths of length 1.
Now , we have to find longest distance (DFS can help) .
I will try to provide example .
say strings are :
abc
acd
bcd
bce
we form a graph :
Now main thing left will be to combine nodes e and d because the required string can be abcde or abced.
This is what i am not sure how to do , so maybe somebody can help !
Sorry for posting this as answer , comments can't include pictures.
Construct the graph out of the subsequences and then find the topological sort of the graph in O(E) using DFS to get the desired string of shortest length which has all subsequence in it. But as you would have noticed the topo-sort is invalid if the graph has cycles in that cases there are repetitions need for characters in the cycles which is difficult to solve.
Conclusion:- You get lucky if there are no cycles in graph and solve it in O(E) else get unlucky and end up doing brute force.

Shortest Common Superstring algorithm?

I'm trying to code this problem here:
but I'd like to find an algorithm that breaks down the steps for solving the problem. I can't seem to find anything too useful online so I've come here to ask if anyone knows of a resource which I can use to refer to an algorithm that solves this problem.
This is called the shortest common supersequence problem. The idea is that in order for the supersequence to be the shortest, we want to find as many shared bits of a and b as possible. We can solve the problem in two steps:
Find the longest common subsequence of a and b.
Insert the remaining bits of a and b while preserving the order of these bits.
We can solve the longest common subsequence problem using dynamic programming.
I agree with Terry Li: it is only NP-complete to find the SCS of multiple sequences. For 2 sequences (say s is of length n and t is of length m), my solution (doesn't use LCS but uses something similar) is done in O(nm) time:
1) Run a global alignment, in which you disallow mismatches, don't penalize indels, and give a positive score to matches (I did +1 for matches, -10 for mismatches, and 0 for indels, but these can be adjusted). (This is O(nm))
2) Iterate over the global alignment for both output strings v and w. If v[i] isn't a gap, output it. Otherwise, output w[i]. (This is O(n+m)).

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something
READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

Resources