Find the smallest period of input string in O(n)? - string

Given the following problem :
Definition :
Let S be a string over alphabet Σ .S' is the smallest period of S
if S' is the smallest string such that :
S = (S')^k (S'') ,
where S'' is a prefix of S. If no such S' exists , then S is
not periodic .
Example : S = abcabcabcabca. Then abcabc is a period since S =
abcabc abcabc a, but the smallest period is abc since S = abc abc
abc abc a.
Give an algorithm to find the smallest period of input string S or
declare that S is not periodic.
Hint : You can do that in O(n) ...
My solution : We use KMP , which runs in O(n) .
By the definition of the problem , S = (S')^k (S'') , then I think that if we create
an automata for the shortest period , and find a way to find that shortest period , then I'm done.
The problem is where to put the FAIL arrow of the automata ...
Any ideas would be greatly appreciated ,
Regards

Alright so this problem can definitely be solved in O(n), we just have to cleverly use KMP as you suggested.
Solving the longest proper prefix which is also a suffix problem is a vital part of KMP that we will make use of.
The longest proper prefix which is also a suffix problem is a mouthful so let's just call it the prefix suffix problem for now.
The prefix suffix problem can be pretty hard to understand so I'll include some examples.
The prefix suffix solution for "abcabc" is
"abc" since that is the longest string which is both a proper prefix
and a proper suffix (proper prefixes and suffixes cannot be the entire
string).
The prefix suffix solution for "abcabca" is "a"
Hmmmmmmmmm wait a minute if we just chop off "a" from the end of "abcabca" we are left with "abcabc" and if we get the solution("abc") for this new string and chop it off again we are left with "abc" Hmmmmmmmmm. Very interesting.(This is pretty much the solution but I will talk about why this works)
Alright let's try to formalize this intuition a bit more and see if we can arrive at a solution.
I will use one key assumption in my argument:
The smallest period of our pattern is a valid period of every larger period in our pattern
Let us store the prefix suffix solution for the first i characters of our pattern in lps[i]. This lps array can be calculated in O(n) and it is used in the KMP algorithm, you can read more about how to calculate it in O(n) here: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
Just so we are clear I will list some examples of some lps arrays
Pattern:"aaaaa"
lps: [0, 1, 2, 3, 4]
Pattern:"aabbcc"
lps: [0, 1, 0, 0, 0, 0]
Pattern:"abcabcabc"
lps: [0, 0, 0, 1, 2, 3, 4, 5, 6]
Alright now lets define some variables, to help us find out why this lps array is useful.
Let l be the length of our pattern, and let k be the last value in our lps array(k=lps[l-1])
The value k tells us that the first k characters of our string are the same as the last k characters of our string. And we can use this fact to find a period!
Using this information we can now show that the prefix consisting of the first l-k characters of our string form a valid period. This is clear because the next k characters which are not in our prefix must match the first k characters of our prefix, because of how we defined our lps array. The first k characters that from our prefix must be the same as the last k characters which form our suffix.
In practice you can implement this with a simple while loop as shown below where index marks the end of the suffix you are currently considering to be the smallest period.
public static void main(String[] args){
String pattern="abcabcabcabca";
int[] lps= calculateLPS(pattern);
//start at the end of the string
int index=lps.length-1;
while(lps[index]!=0){
//shift back
index-=lps[index];
}
System.out.println(pattern.substring(0,index+1));
}
And since calculating lps happens in O(n), and you are always moving at least 1 step back in the while loop the time complexity for the whole procedure is simply O(n)
I borrowed heavily from the geeksForGeeks implementation of KMP in my calculateLPS() method if you would like to see my exact code it is below, but I reccomend that you also look at their explanation: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
static int[] calculateLPS(String pat) {
int[] lps = new int[pat.length()];
int len = 0;
int i = 1;
lps[0] = 0;
while (i < pat.length()) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
}
else {
if (len != 0) {
len = lps[len - 1];
}
else {
lps[i] = len;
i++;
}
}
}
System.out.println(Arrays.toString(lps));
return lps;
}
Last but not least, thanks for posting such an interesting problem it was pretty fun to figure out! Also I am new to this so please let me know if any part of my explanation doesn't make sense.

I'm not sure that I understand your attempted solution. KMP is a useful subroutine, though -- the smallest period is how far KMP moves the needle string (i.e., S) after a complete match.

this problem can be solved using the Z function , this tutorial can help you .

This problem can easily be solved by KMP
Concatenate the string to itself and run KMP on it.
Let n be the length of the original string.
Search for the first value >= n in the KMP array. That value must be at a position k >= n (0-based).
Then k - n + 1 is the length of the shortest period of the string.
Example:
Original string = abaaba
n = 6
New string = abaabaabaaba
KMP values for this new string: 0 0 1 1 2 3 4 5 6 7 8 9
The first value >= n is 6 which is at position 8. 8 - 6 + 1 = 3 is the length of the shortest period of the string (aba).

See if this solution works for O(n). I used rotation of strings.
public static int stringPeriod(String s){
String s1= s;
String s2= s1;
for (int i=1; i <s1.length();i++){
s2=rotate(s2);
if(s1.equals(s2)){
return i;
}
}
return -1;
}
public static String rotate(String s1){
String rotS= s1;
rotS = s1.substring(1)+s1.substring(0,1);
return rotS;
}
The complete program is available in this github repository

Related

find number of repeating substrings in a string

I am looking for an algorithm that will find the number of repeating substrings in a single string.
For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.
Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.
For input AAAA, output would be 4, since A is repeated 4 times.
For input ASDF, output would be 1, since every individual character is repeated 1 time only.
I hope that someone can point me in the right direction. Thank you.
I am taking the following assumptions:
The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).
Under these assumptions, the algorithm is as follows:
Let the input string be denoted as inputString.
Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.
The overall time complexity would be O(n), where n is the number of characters in the input string.
A sample code for calculating the KMP failure array is given here.
For example,
Let the input string be abcaabcaabca.
Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].
So, our len = (12 - 8) = 4.
And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.
The solution for this with C# is:
class Program
{
public static string CountOfRepeatedSubstring(string str)
{
if (str.Length < 2)
{
return "-1";
}
StringBuilder substr = new StringBuilder();
// Length of the substring cannot be greater than half of the actual string
for (int i = 0; i < str.Length / 2; i++)
{
// We will iterate through half of the actual string and
// create a new string by appending the current character to the previous character
substr.Append(str[i]);
String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");
// We will remove the newly created substring from the actual string and
// check if the length of the actual string, cleared of the newly created substring, is 0.
// If 0 it tells us that it is only made of its substring
if (clearedOfNewSubstrings.Length == 0)
{
// Next we will return the count of the newly created substring in the actual string.
var countOccurences = Regex.Matches(str, substr.ToString()).Count;
return countOccurences.ToString();
}
}
return "-1";
}
static void Main(string[] args)
{
// Input: {"abcdaabcdaabcda"}
// Output: 3
// Input: { "abcdaabcdaabcda" }
// Output: -1
// Input: {"barrybarrybarry"}
// Output: 3
var s = "asdf"; // Output will be -1
Console.WriteLine(CountOfRepeatedSubstring(s));
}
}
How do you want to specify the "repeating string"? Is it simply the first group of characters up until either a) the first character is found again, b) the pattern begins to repeat, or c) some other criteria?
So, if your string is "ABBAABBA", is that a 2 because "ABBA" repeats twice or is it 1 because you have "ABB" followed by "AAB"? What about "ABCDABCE" -- does "ABC" count (despite the "D" in between repetitions?) In "ABCDABCABCDABC", is the repeating string "ABCD" (1) or "ABCDABC" (2)?
What about "AAABBAAABB" -- is that 3 ("AAA") or 2 ("AAABB")?
If the end of the repeating string is another instance of the first letter, it's pretty simple:
Work your way through the string character by character, putting each character into another variable as you go, until the next character matches the first one. Then, given the length of the substring in your second variable, check the next bit of your string to see if it matches. Continue until it doesn't match or you hit the end of the string.
If you just want to find any length pattern that repeats regardless of whether the first character is repeated within the pattern, it gets more complicated (but, fortunately, it's the sort of thing computers are good at).
You'll need to go character by character building a pattern in another variable as above, but you'll also have to watch for the first character to reappear and start building a second substring as you go, to see if it matches the first. This should probably go in an array as you might encounter a third (or more) instance of the first character which would trigger the need to track yet another possible match.
It's not difficult but there is a lot to keep track of and it's a rather annoying problem. Is there a particular reason you're doing this?

Using a trie for string segmentation - time complexity?

Problem to be solved:
Given a non-empty string s and a string array wordArr containing a list
of non-empty words, determine if s can be segmented into a
space-separated sequence of one or more dictionary words. You may
assume the dictionary does not contain duplicate words.
For example, given s = "leetcode", wordArr = ["leet", "code"].
Return true because "leetcode" can be segmented as "leet code".
In the above problem, would it work to build a trie that has each string in wordArr. Then, for each char in given string s, work down the trie. If a trie branch terminates, then this substring is complete so pass the remaining string up to the root and do the exact same thing recursively.
This should be O(N) time and O(N) space correct? I ask because the problem I'm working on says this will be O(N^2) time in the most optimal way and I'm not sure what's wrong with my approach.
For example, if s = "hello" and wordArr = ["he", "ll", "ee", "zz", "o"], then "he" will be completed in the first branch of the trie, "llo" will be passed up to the root recursively. Then, "ll" will be completed, so "o" gets passed up to root of trie. Then "o" is completed, which is the end of s, so return true. If the end of s isn't completed, return false.
Is this correct?
Your example would indeed suggest a linear time complexity, but look at this example:
s = "hello"
wordArr = ["hell", "he", "e", "ll", "lo", "l", "h"]
Now, first "hell" is tried, but in the next recursion cycle, no solution is found (there is no "o"), so the algorithm needs to backtrack and assume "hell" is not suitable (pun not intended), so you try "he", and in the next level you find "ll", but then again it fails, as there is no "o". Again backtracking is needed. Now start with "h", then "e" and then again a failure is coming: you try "ll" without success, so backtracking to use "l" instead: the solution is now available: "h e l lo".
So, no this does not have O(n) time complexity.
I suspect off-hand that the issue is backtracking. What if the word is not segmentable based on a particular dictionary, or what if there are multiple possible substrings with a common prefix? E.g., suppose the dictionary contains he, llenic, and llo. Failure down one branch of the trie would require backtracking, with some corresponding increase in time complexity.
This is similar to a regex-match problem: the example you give is like testing an input word against
^(he|ll|ee|zz|o)+$
(any number of dictionary members, in any order, and nothing else). I don't know the time complexity of regex matchers offhand, but I know backtracking can get you into serious time trouble.
I did find this answer which says:
Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
So maybe it is O(n^2) with reduced construction effort.
Let's start by converting the trie to a nfa. We create an accept node on the root and add an edge that moves from every word end of the dictionary in the trie to the root node for the empty char.
Time complexity: since each step in the trie we can move only to one edge that represent the current char in the input string and the root.
T(n) = 2×T (n-1)+c
That gives us O(2^n)
Indeed not O(n), But you can do better using Dynamic programming.
We will use top-down approach.
Before we solve it for any string check if we have already solve it.
We can use another HashMap to store the result of already solved strings.
Whenever any recursive call returns false, store that string in HashMap.
The idea is to calculate every suffix of the word only once. We have only n suffixes and It will end up with O(n^2).
Code form algorithms.tutorialhorizon.com:
Map<String, String> memoized;
Set<String> dict;
String SegmentString(String input) {
if (dict.contains(input)) return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null) {
memoized.put(input, prefix + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
And you can do better!
Map<String, String> memoized;
Trie<String> dict;
String SegmentString(String input)
{
if (dict.contains(input))
return input;
if (memoized.containsKey(input)
return memoized.get(input);
int len = input.length();
foreach (StringBuilder word in dict.GetAll(input))
{
String prefix = input.substring(0, word.length);
String suffix = input.substring(word.length, len);
String segSuffix = SegmentString(suffix);
if (segSuffix != null)
{
memoized.put(input, word.ToString() + " " + segSuffix);
return prefix + " " + segSuffix;
}
}
retrun null;
}
Using the Trieto find the recursive calls only when Trie reach a word end you will get o (z×n) where z is the length of the Trie.

Convert string S to another string T by performing exactly K operations (append to / delete from the end of the string S)

I am trying to solve a problem. But I am missing some corner case. Please help me. The problem statement is:
You have a string, S , of lowercase English alphabetic letters. You can perform two types of operations on S:
Append a lowercase English alphabetic letter to the end of the string.
Delete the last character in the string. Performing this operation on an empty string results in an empty string.
Given an integer, k, and two strings, s and t , determine whether or not you can convert s to t by performing exactly k of the above operations on s.
If it's possible, print Yes; otherwise, print No.
Examples
Input Output
hackerhappy Yes
hackerrank
9
5 delete operations (h,a,p,p,y) and 4 append operations (r,a,n,k)
aba Yes
aba
7
4 delete operations (delete on empty = empty) and 3 append operations
I tried in this way (C language):
int sl = strlen(s); int tl = strlen(t); int diffi=0;
int i;
for(i=0;s[i]&&t[i]&&s[i]==t[i];i++); //going till matching
diffi=i;
((sl-diffi+tl-diffi<=k)||(sl+tl<=k))?printf("Yes"):printf("No");
Please help me to solve this.
Thank You
You also need the remaining operations to divide in 2, because you need to just add and remove letters to waste the operations.
so maybe:
// c language - strcmp(s,t) returns 0 if s==t.
if(strcmp(s,t))
((sl-diffi+tl-diffi<=k && (k-(sl-diffi+tl-diffi))%2==0)||(sl+tl<=k))?printf("Yes"):printf("No");
else
if(sl+tl<=k||k%2==0) printf("Yes"); else printf("No");
You can do it one more way using binary search.
Take the string of smaller length and take sub-string(pattern) of length/2.
1.Do a binary search(by character) on both of the string if u get a match append length/4 more character to the pattern if it matches add more by length/2^n else append one character to the original(pattern of length/2) and try .
2.If u get a mismatch for pattern of length/2 reduce length of the pattern to length/4 and if u get a match append next character .
Now repeat the steps 1 and 2
If n1+n2 <= k then the answer is Yes
else the answer is no
Example:
s1=Hackerhappy
s2=Hackerrank
pattern=Hacker // length = 10 (s2 is smaller and length of s2=10 length/2 =5)
//Do a binary search of the pattern you will get a match by steps 1 and 2
n1 number of mismatched characters is 5
n2 number of mismatched characters is 4
Now n1+n2<k // its because we will need to do these much operation to make these to equal.
So Yes
This should work for all cases:
int sl = strlen(s); int tl = strlen(t); int diffi=0;
int i,m;
for(i=0;s[i]&&t[i]&&s[i]==t[i];i++); //going till matching
diffi=i;
m = sl+tl-2*diffi;
((k>=m&&(k-m)%2==0)||(sl+tl<=k))?printf("Yes"):printf("No");

Find subsequence ocurrence in multiple strings

Given one string say S of length m and a set of other strings R all with lengths equal or bigger than m. Find the strings in the set that have S as a subsequence.
So, if S is blr and the set of strings is:
bangalore
booleer
bamboo
It should return the first two strings.
I'm aware that I can find if a string S of length m is a subsecuence of other string T of length n in time complexity O(n+m). So I know I could just do this algorithm for each element in the set, but that would be a time complexity of O(k*(n+m)), being k the size of the set (and assuming all the strings have the same length). This makes me wonder if there's some kind of preprocessing that helps me solve this problem with multiple strings.
So, is there any preprocessing or structure I can use to solve this problem?
What's the best time complexity I can achieve?
Are there any other approachs to solve this problem?
For tow string ch and s if you want to find if ch is in the set that have S as a subsequence, the algorithm will have the complexity O(n)
public bool function(string ch, string s)
{
if (ch.Length < s.Length)
return false;
int j = 0;
for (int i = 0; i < ch.Length; i++)
{
if (ch[i] == s[j])
{
j++;
if (s.Length == j)
{
return true;
}
}
}
return false;
}
After thatyou have to apply it for all your string in R
I haven't got a code implementation, but I did manage to find a 1984 paper "Computing a longest common subsequence for a set of strings", by W. J. Hsu and M. W. Du.
Their conclusion is that by doing O(L) preprocessing time (where L is the aggregate length of all the strings in the set), it is possible to perform each search in O(P), where P is the number of times the needle appears in the haystacks.

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:
Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.
For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.
Is there any way to solve this reasonably quickly for strings containing a few thousand characters?
(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)
Here's an example for a string of length 11, which you can generalize
Set chunk length to floor(11/2) = 5
Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons
Left Right
Offset Offset
0 5
0 6
1 5
If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.
Here's some (obviously untested) pseudocode:
String s
int len = floor(s.length/2)
for int i=len; i>0; i--
for j=0; j<=len-(2*i); j++
for k=j+i; k<=len-i; k++
if s.substr(j,j+i) == s.substr(k,k+i)
return s.substr(j,j+i)
return null
There may be an off-by-one error in there, but the approach should be sound (and minimal).
it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.
i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring
Looks like it can be done in O(n).
First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity?
The code is something like:
for(int i=0;i<input.length();++i)
{
int l = i;
int r = input.length();
while(l <= r)
{
int middle = l + ((r - l) >> 1);
Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
if (found)
l = middle + 1;
else
r = middle - 1;
}
}
Make sense?
This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf
might be what you are looking for

Resources