Heftiest repeated substring

Heftiest repeated substring - string

I am looking for naming/literature/implementations for a variation on the longest repeated substring problem. In the cited problem you find the longest (consecutive) substring with at least 2 (non-overlapping) repetitions:
max len(s) | rep(s) > 1
In my problem, I am looking for a substring with length greater than 1, that is repeated at least 2 times, and has the greatest (length times repetitions), thus "heftiest" (but surely there is a better name):
max len(s)*rep(s) | rep(s) > 1, len(s) > 1

This can also be done with a suffix tree. Annotate each node in the suffix tree with the number of leaves in its subtree. That takes time O(n) since the tree has n nodes. Now, run a DFS over the tree, along the lines of the classical longest repeated substring algorithm, keeping track of the length of the string represented by each node. As you do, compute the product of that length with the number of leaves in the subtree. That will give you the product of the number of occurrences of the substring and the length of the substring. Finally, return the max out of everything you find.

Related

what will be the dp and transitions in this problem

Vasya has a string s of length n consisting only of digits 0 and 1. Also he has an array a of length n.
Vasya performs the following operation until the string becomes empty: choose some consecutive substring of equal characters, erase it from the string and glue together the remaining parts (any of them can be empty). For example, if he erases substring 111 from string 111110 he will get the string 110. Vasya gets ax points for erasing substring of length x.
Vasya wants to maximize his total points, so help him with this!
https://codeforces.com/problemset/problem/1107/E
i was trying to get my head around the editorial,but couldn't understand it... can anyone tell an easy way to do it?
input:
7
1101001
3 4 9 100 1 2 3
output:
109
Explanation
the optimal sequence of erasings is: 1101001 → 111001 → 11101 → 1111 → ∅.

Here, we consider removing prefixes instead of substrings. Why?
We try to remove a consecutive prefix of a particular state which is actually a substring in the main string. So, our DP states will be start index, end index, prefix length.
Let's consider an example str = "1010110". Here, initially start=0, end=7, and prefix=1(the first '1' will be the only prefix now). we iterate over all the indices in the current state except the starting index and check if str[i]==str[start]. Here, for example, str[4]==str[0]. Now we divide the string into "010" with prefix=1(010) && "110" with prefix=2(1010110). These two are now two individual subproblems. So, when there remains a string with length 1, we return aprefix.
Here is my code.

Is it possible to count the number of distinct substrings in a string in O(n)?

Given a string s of length n, is it possible to count the number of distinct substrings in s in O(n)?
Example
Input: abb
Output: 5 ('abb', 'ab', 'bb', 'a', 'b')
I have done some research but i can't seem to find an algorithm that solves this problem in such an efficient way. I know a O(n^2) approach is possible, but is there a more efficient algorithm?
I don't need to obtain each of the substrings, just the total number of distinct ones (in case it makes a difference).

You can use Ukkonen's algorithm to build a suffix tree in linear time:
https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm
The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.
For instance, your example produces a suffix tree like:
/\
b a
| b
b b
5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.
More precisely:
Every substring is the prefix of some suffix of the string;
All the suffixes are in the trie;
So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
each distinct non-empty path ends at a distinct position after its last letter; and
the path to the the position following each letter is unique
NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:
There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.

Construct the LCP array and subtract its sum from the number of substrings (n(n+1)/2).

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.

Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.

For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.

For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.

For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908

Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Given string s, find the shortest string t, such that, t^m=s

Given string s, find the shortest string t, such that, t^m=s.
Examples:
s="aabbb" => t="aabbb"
s="abab" => t = "ab"
How fast can it be done?
Of course naively, for every m divides |s|, I can try if substring(s,0,|s|/m)^m = s.
One can figure out the solution in O(d(|s|)n) time, where d(x) is the number of divisors of s. Can it be done more efficiently?

This is the problem of computing the period of a string. Knuth, Morris and Pratt's sequential string matching algorithm is a good place to get started. This is in a paper entitled "Fast Pattern Matching in Strings" from 1977.
If you want to get fancy with it, then check out the paper "Finding All Periods and Initial Palindromes of a String in Parallel" by Breslauer and Galil in 1991. From their abstract:
An optimal O(log log n) time CRCW-PRAM algorithm for computing all
periods of a string is presented. Previous parallel algorithms compute
the period only if it is shorter than half of the length of the
string. This algorithm can be used to find all initial palindromes of
a string in the same time and processor bounds. Both algorithms are
the fastest possible over a general alphabet. We derive a lower bound
for finding palindromes by a modification of a previously known lower
bound for finding the period of a string [3]. When p processors are
available the bounds become \Theta(d n p e + log log d1+p=ne 2p).

I really like this thing called the z-algorithm: http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm
For every position it calculates the longest substring starting from there, that is also a prefix of the whole string. (in linear time of course).
a a b c a a b x a a a z
1 0 0 3 1 0 0 2 2 1 0
Given this "z-table" it is easy to find all strings that can be exponentiated to the whole thing. Just check for all positions if pos+z[pos] = n.
In our case:
a b a b
0 2 0
Here pos = 2 gives you 2+z[2] = 4 = n hence the shortest string you can use is the one of length 2.
This will even allow you to find cases where only a prefix of the exponentiated string matches, like:
a b c a
0 0 1
Here (abc)^2 can be cut down to your original string. But of course, if you don't want matches like this, just go over the divisors only.

Yes you can do it in O(|s|) time.
You can search for a "target" string of length n in a "source" string of length m in O(n+m) time. Build a solution based on that.
Let both source and target be s. An additional constraint is that 1 and any positions in the source that do not divide |s| are not valid starting positions for the match. Of course the search per se will always fail. But if there's a partial match and you have reached the end of the sourse string, then you have a solution to the original problem.

a modification to Boyer-Moore could possibly handle this in O(n) where n is length of s
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string