Number of distinct rotated strings - string

We have a string S and we want to calculate the number of distinct strings that can be formed by rotating the string.
For example :-
S = "aaaa" , here it would be 1 string {"aaaa"}
S = "abab" , here it would be 2 strings {"abab" , "baba"}
So ,is there an algorithm to solve this in O(|S|) complexity where |S| is the length of string.

Suffix trees, baby!
If string is S. Construct the Suffix Tree for SS (S concatenated to S).
Find number of unique substrings of length |S|. The uniqueness you get automatically. For length |S| you might have to change the suffix tree algo a little (to maintain depth info), but is doable.
(Note that the other answer by johnsoe is actually quadratic, or worse, depending on the implementation of Set).

You can solve this with rolling hash functions used in the Rabin-Karp algorithm.
You can use the rolling hash to update the hash table for all substrings of size |S| (obtained by sliding a |S| window across SS) in constant time (so, O(|S|) in total).
Assuming your string comes from an alphabet of constant size, you can inspect the hash table in constant time to obtain the required metric.

Something like this should do the trick.
public static int uniqueRotations(String phrase){
Set<String> rotations = new HashSet<String>();
rotations.add(phrase);
for(int i = 0; i < phrase.length() - 1; i++){
phrase = phrase.charAt(phrase.length() - 1) + phrase.substring(0, phrase.length() - 1);
rotations.add(phrase);
}
return rotations.size();
}

Related

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.
I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.
Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.
Here's one fairly simple algorithm, which should be reasonably fast.
Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).
Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)
At the end, the last hash table will contain all the common k-length substrings.
The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.
Some thoughts (N is number of strings, M is average length, K is needed substring size):
Approach 1:
Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})
time O(NxM), space O(NxM)
Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)
Approach 2:
Build suffix array for every string
time O(N x MlogM) space O(N x M)
Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on
I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).
I would then sort each collection into ascending order.
Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.
(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).
Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.
Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).
I would try a simple method using HashSets:
Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set.
Lookup the term in the other sets.
The first step takes care of repetitions in each long string.
The second ensures the minimum number of comparisons.
let getHashSet k (lstr:string) =
let strs = System.Collections.Generic.HashSet<string>()
for i in 0..lstr.Length - k do
strs.Add lstr.[i..i + k - 1] |> ignore
strs
let getCommons k lstrs =
let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
match strss |> Seq.tryHead with
| None -> [||]
| Some h ->
let rest = Seq.tail strss |> Seq.toArray
[| for s in h do
if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
|]
Test:
let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
[| for i in 1..n do
yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
|] |> System.Text.Encoding.ASCII.GetString
[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l
result:
found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]
Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT

Using a trie for string segmentation - time complexity?

Problem to be solved:
Given a non-empty string s and a string array wordArr containing a list
of non-empty words, determine if s can be segmented into a
space-separated sequence of one or more dictionary words. You may
assume the dictionary does not contain duplicate words.
For example, given s = "leetcode", wordArr = ["leet", "code"].
Return true because "leetcode" can be segmented as "leet code".
In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.
This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.
For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.
Is this correct?
Your example would indeed suggest a linear time complexity, but look at this example:
s = "hello"
wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]
Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".
So, no this does not have O(n) time complexity.
I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.
This is similar to a regex-match problem: the example you give is like testing an input word against
^(he|ll|ee|zz|o)+$
(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.
I did find this answer which says:
Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
So maybe it is O(n^2) with reduced construction effort.
Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.
Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root.
T(n) = 2×T (n-1)+c
That gives us O(2^n)
Indeed not O(n), But you can do better using Dynamic programming.
We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.
The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).
Code form algorithms.tutorialhorizon.com:
Map<String, String> memoized;
Set<String> dict;
String SegmentString(String input) {
if (dict.contains(input)) return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null) {
memoized.put(input, prefix + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
And you can do better!
Map<String, String> memoized;
Trie<String> dict;
String SegmentString(String input)
{
if (dict.contains(input))
return input;
if (memoized.containsKey(input)
return memoized.get(input);
int len = input.length();
foreach (StringBuilder word in dict.GetAll(input))
{
String prefix = input.substring(0, word.length);
String suffix = input.substring(word.length, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null)
{
memoized.put(input, word.ToString() + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
retrun null;
}
Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:
Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.
For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.
Is there any way to solve this reasonably quickly for strings containing a few thousand characters?
(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)
Here's an example for a string of length 11, which you can generalize
Set chunk length to floor(11/2) = 5
Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons
Left Right
Offset Offset
0 5
0 6
1 5
If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.
Here's some (obviously untested) pseudocode:
String s
int len = floor(s.length/2)
for int i=len; i>0; i--
for j=0; j<=len-(2*i); j++
for k=j+i; k<=len-i; k++
if s.substr(j,j+i) == s.substr(k,k+i)
return s.substr(j,j+i)
return null
There may be an off-by-one error in there, but the approach should be sound (and minimal).
it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.
i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring
Looks like it can be done in O(n).
First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity?
The code is something like:
for(int i=0;i<input.length();++i)
{
int l = i;
int r = input.length();
while(l <= r)
{
int middle = l + ((r - l) >> 1);
Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
if (found)
l = middle + 1;
else
r = middle - 1;
}
}
Make sense?
This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf
might be what you are looking for

What's the worst case complexity for KMP when the goal is to find all occurrences of a certain string?

I would also like to know which algorithm has the worst case complexity of all for finding all occurrences of a string in another. Seems like Boyer–Moore's algorithm has a linear time complexity.
The KMP algorithm has linear complexity for finding all occurrences of a pattern in a string, like the Boyer-Moore algorithm¹. If you try to find a pattern like "aaaaaa" in a string like "aaaaaaaaa", once you have the first complete match,
aaaaaaaaa
aaaaaa
aaaaaa
^
the border table contains the information that the next longest possible match (corresponding to the widest border of the pattern) of a prefix of the pattern is just one character short (a complete match is equivalent to a mismatch one past the end of the pattern in this respect). Thus the pattern is moved one place further, and since from the border table it is known that all characters of the pattern except possibly the last match, the next comparison is between the last pattern character and the aligned text character. In this particular case (find occurrences of am in an), which is the worst case for the naive matching algorithm, the KMP algorithm compares each text character exactly once.
In each step, at least one of
the position of the text character compared
the position of the first character of the pattern with respect to the text
increases, and neither ever decreases. The position of the text character compared can increase at most length(text)-1 times, the position of the first pattern character can increase at most length(text) - length(pattern) times, so the algorithm takes at most 2*length(text) - length(pattern) - 1 steps.
The preprocessing (construction of the border table) takes at most 2*length(pattern) steps, thus the overall complexity is O(m+n) and no more m + 2*n steps are executed if m is the length of the pattern and n the length of the text.
¹ Note that the Boyer-Moore algorithm as commonly presented has a worst-case complexity of O(m*n) for periodic patterns and texts like am and an if all matches are required, because after a complete match,
aaaaaaaaa
aaaaaa
aaaaaa
^
<- <-
^
the entire pattern would be re-compared. To avoid that, you need to remember how long a prefix of the pattern still matches after the shift following a complete match and only compare the new characters.
There is a long article on KMP at http://en.wikipedia.org/wiki/Knuth-morris-pratt which ends with saying
Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k).
These complexities are the same, no matter how many repetitive patterns are in W or S.
(end quote)
So the total cost of a KMP search is linear in the number of characters of string and pattern. I think this holds even if you need to find multiple occurrences of the pattern in the string - and if not, just consider searching for patternQ, where Q is a character that does not occur in the text, and noting down where the KMP state shows that it has matched everything up to the Q.
You can count Pi function for a string in O(length). KMP builds a special string that has length n+m+1, and counts Pi function on it, so in any case complexity will be O(n+m+1)=O(n+m)
If you think about it, the worst case for matching the pattern is the one in which you've to visit each index of the LPS array, when mismatch occurs. For example, pattern "aaaa" which creates LPS arrays as [0,1,2,3] makes it possible.
Now, for the worst case matching in the text, we want to maximize the such mismatches that forces us to visit all the indices of the LPS array. That would be a text with repeated pattern, but with the last character as a mismatch. For example, "aaabaaacaaabaaacaaabaaac".
Let the length of the text be n and that of pattern be m. Number of the occurences of such pattern in the text is n/m. And for each of these occurences, we are performing m comparisions. Not to forget that we are also traversing n characters of the text.
Therefore, the worst case time for KMP matching would be O(n + (n/m)*m), which is basically O(n).
Total worst case time complexity, including LPS creation, would be O(n+m).
KMP Code (for reference):
void createLPS(char[] pattern,int[] lps){
int m = pattern.length;
int i=1;
int j=0;
lps[j]=0;
while(i<m){
if(pattern[j]==pattern[i]){
lps[i]=j+1;
i++;
j++;
}else{
if(j!=0){
j = lps[j-1];
}else{
lps[i]=0;
i++;
}
}
}
}
List<Integer> match(char[] str, char[] pattern, int[] lps){
int m = pattern.length;
int n = str.length;
int i=0, j=0;
List<Integer> idxs = new ArrayList<>();
while(i<n){
if(pattern[j]==str[i]){
j++;
i++;
}else{
if(j!=0){
j = lps[j-1];
}else{
i++;
}
}
if(j==m){
idxs.add(i-m);
j = lps[j-1];
}
}
return idxs;
}

How to find all cyclic shifted strings in a given input?

This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).

Resources