Longest repeated substring with at least k occurrences correctness - string

The algorithms for finding the longest repeated substring is formulated as follows
1)build the suffix tree
2)find the deepest internal node with at least k leaf children
But I cannot understand why is this works,so basically what makes this algorithm correct?Also,the source where I found this algorithm says that is find the repeated substring in O(n),where n is the length of the substring,this is also not clear to me!Let's consider the following tree,here the longest repeated substring is "ru" and if we apply DFS it will find it in 5 step but not in 2
Can you explain this stuff to me?
Thanks
image

I suppose you perfectly know O(n) (Big O notation) refers to the order of growth of some quantity as a function of n, and not the equivalence of the quantity with n.
I write this becase reading the question I was in doubt...
I'm writing this as an aswer and not a comment since it's a bit too long for a comment (I suppose...)

Given a string S of N characters, building the corresponding suffix tree is O(N) (using an algorithm such as Ukkonen's).
Now, such a suffix tree can have at most 2N - 1 nodes (root and leaves included).
If you traverse your tree and compute the number of leaves reachable from a given node along with its depth, you'll find the desired result. To do so, you start from the root and explore each of its children.
Some pseudo-code:
traverse(node, depth):
nb_leaves <-- 0
if empty(children(node)):
nb_leaves <-- 1
else:
for child in children(node):
nb_leaves <-- nb_leaves + traverse(child, depth+1)
node.setdepth(depth)
node.setoccurrences(nb_leaves)
return nb_leaves
The initial call is traverse(root, 0). Since the structure is a tree, there is only one call to traverse for each node. This means the maximum number of call to traverse is 2N - 1, therefore the overall traversal is only O(N). Now you just have to keep track of the node with the maximum depth that also verifies: depth > 0 && nb_leaves >= k by adding the relevant bookkeeping mechanism. This does not hinder the overall complexity.
In the end, the complexity of the algorithm to find such a substring is O(N) where N is the length of the input string (and not the length of the matching substring!).
Note: The traversal described above is basically a DFS on the suffix tree.

Related

Algorithm, that finds the k-greatest number in O(n*log(k))

was wondering, if you have given an unsorted list of arrays of any length n >= k,
what is your idea, to find the k-greatest number in O(n*log(k)) time. So the k = 2 -greatest number of an Array containing the numbers 1 to 9 would be 8 for example.
I'm trying to code this in python, if you have an idea how in that time complexity :)
My answer is not python-specific, however you should be able to implement the used concepts in python, or find libraries already implementing them.
The basic idea is to iterate over the list and store the current greatest, second greatest, ... , k-greatest number in a separate data structure. Since you will be iterating over all n entries in your array, the complexity of this is in O(n * insertion_step_complexity)
As seen above, the insertion step needs to not exceed a complexity of O(log(k)) to achieve this you can use a AVL-Tree that has a complexity of O(log(m)) for inserting and deleting items, where m is the number of items stored within the avl-tree.
An algorithm would look like this:
def find_k_greatest_number(k, array):
avl_tree = initialize AVL tree here
avl_items = 0
for number in array:
if (number > avl_tree.smallest_number()):
if (avl_itmes >= k):
avl_tree.delete_smallest_number()
else:
avl_items++
avl_tree.insert(number)
return avl_tree.smallest_number()
Finding the smallest number in a sorted tree is dependent on its height. Since the AVL tree can't exceed the height of log(k) the complexity of finding the smallest number is O(log(k)).

How to determine the smallest common divisor of a string?

I was asked the following question during a job interview and was stumped by it.
Part of the problem I had is making up my mind about what problem I was solving. At first I didn't think the question was internally consistent but then I realized it is asking you to solve two different things - the first task is to figure out whether one string contains a multiple of another string. But the second task is to find a smaller unit of division within both strings.
It's a bit more clear to me now with the pressure of the interview room behind me but I'm still not sure what the ideal algorithm would be here. Any suggestions?
Given two strings s & t, determine if s is divisible by t.
For example: "abab" is divisible by "ab"
But "ababab" is not divisible by "abab".
If it isn't divisible, return -1.
If it is, return the length of the smallest common divisor:
So, for "abababab" and "abab", return 2 as s is divisible
by t and the smallest common divisor is "ab" with length 2.
Oddly, you're asked to return -1 unless s is divisible by t (which is easy to check), and then you're only left with cases where t divides s.
If t divides s, then the smallest common divisor is just the smallest divisor of t.
The simplest way to find the smallest divisor of t is to check all the factors of its length to see if the prefix of that length divides t.
You can do it in linear time by building the Knuth-Morris-Pratt search table for t: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
This will tell you all the suffixes of t that are also prefixes of t. If the length of the remainder divides the length of t, then the remainder divides t.
let n is the length of the string s and m is the length of string t, then first we find the gcd(greatest common divisor) of n & m(the largest length that divides both n & m), now we find the all the divisors of gcd in O(square root of gcd) then, we start checking each of them in increasing order whether the starting substring of s or t of length l(divisors of gcd) exist n/l && (m/l) times(using kmp algorithm or robin karp hashing method or rolling hash), if yes, then we break and return length l otherwise we keep checking it until we run out of the divisors and return -1 if nothing is found.

How does Duval's algorithm handle odd-length strings?

Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?
One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.
OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}

Longest Suffix-Prefix Overlap Algorithm

I have been tasked with identifying an efficient algorithm [O(n*log(n))] that, given a set of k Strings S = {s-1, s-2, s-3, ..., s-k}, will identify the longest substring T for each pair of strings (s-i, s-j), such that T is a suffix of s-i and a prefix of s-j, as well as the longest substring T for each pair of strings (s-j, s-i). n represents the added lengths of all k strings (n = |s-1| + |s-2| + |s-3| + ... + |s-k|).
Any thoughts? A link to a solution would be fine as well. Thanks in advance!
Algorithm 4.10 on page 61 of the book Algorithmic Aspects of Bioinformatics gives a method of computing the longest common substring of a set of given strings using suffix trees
The article also explains how finding the longest common substring is possible
in linear time with respect to the size of the suffix tree, i.e. in O(n log n).

Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
Thanks!
This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
#article{DBLP:journals/corr/GutinJSW14,
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {http://arxiv.org/abs/1402.2137},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Edit:
Please refer to David's answer for a fuller and correct explanation of the problem.
Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.

Resources