Detect periodical string - string

I am trying to solve this problem, I couldn't get to linear time.
A string T is called periodical if it can be represented in the form
of T=PPP..P.
Design a linear time algorthim for deciding whether a
given T is periodical,and if it's true, find the shortest period.
My approach:
if T=AB=BA then T is periodical, my algorithm keep checking if string can be represented like that,if yes then I check for half of it.
It takes O(n*log(n)) time.
Thanks guys

KMP search algorithm computes, in part, the longest substring that's both a prefix and a suffix (shorter than the entire string).
If you apply it to a periodical string you'll get
len(string) - len(substring) = period
len(substring) must be > len(string) / 2, otherwise there's no period.
The period found will also be the shortest period.
KMP is linear.
So check it out (wikipedia).

Related

What is the difference between KMP and Z algorithm of string pattern matching?

In KMP algorithm we preprocess the pattern to find the longest prefix which we used to skip characters while matching.
while in Z- algorithm we first make a new string
new_string = pattern + 'x' + string
where x = character that doesn't exist in both pattern and string
After making the new_string we preprocess the new_string to find longest prefix and if the lent of prefix is equal to pattern length then we found the pattern
Both have time complexity of O(m+n).
so what is the difference between these two algorithms and which one is best to use?
Is not always about time complexity, the storage complexity are the playing role here:
Knuth Morris Pratt:
Worst-case performance : Θ(m) preprocessing + Θ(n) matching
Worst-case space complexity : Θ(m)
Z algorithm:
Worst-case performance: Θ(m+n) preprocessing and matching
worst-case space complexity: Θ(n+m)
Besides, you can use the idea of searching prefix and suffix for other usages beside searching a pattern, so you could have other reasons to do the analytics on particular information
Also I would recommend for other matching algorithms for some tasks even if they have worse time complexity like Boyer-moore, it all depend on the situation

Difference between Knuth–Morris–Pratt (KMP) and suffix tree using Ukkonen's algorithm for time complexity.

Is it possible to find Longest Common Substring, Longest Palindromic Substring, Longest Repeated Substring, Searching All Patterns and Substring Check by both KMP and suffix tree using Ukkonen's algorithm? If yes then which one should I use since both algorithms have a linear-time complexity?
For finding the longest common substring, I would use Kadane's algorithm which has linear complexity. For the longest Palindromic Substring, the choice would be Manacher's algorithm which also has linear complexity. For repeated string and searching all patterns, yes the choice would boil down between KMP and Boyer-Moore.
As to which one, Boyer-Moore's matches the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. KMP searches for occurrences of a word W within a main text string S by employing the observation that when a mismatch occurs, thus bypassing re-examination of previously matched characters.
This makes KMP slightly better optimized for small sets like ACTGT.

Number of distinct palindromic substrings

Given a string, I know how to find the number of palindromic substrings in linear time using Manacher's algorithm. But now I need to find the number of distinct/unique palindromic substrings. Now, this might lead to an O(n + n^2) algorithm - one 'n' for finding all such substrings, and n^2 for comparing each of these substrings with the ones already found, to check if it is unique.
I am sure there is an algorithm with better complexity. I was thinking of maybe trying my luck with suffix trees? Is there an algorithm with better time complexity?
I would just put substrings you found into the hash table to prevent holding the same results twice.
The access time to hash table is O(1).
As of 2015, there is a linear time algorithm for computing the number of distinct palindromic substrings of a given string S. You can use a data structure known as an eertree (or palindromic tree), as described in the linked paper. The idea is fairly complicated, but the premise is to build a trie of palindromes, and augment it with longest proper palindromic suffixes in a similar manner to the failure function of the Aho-Corasick Algorithm. See the original paper for more details: https://arxiv.org/pdf/1506.04862.pdf

Longest common substring in two sequences of strings

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:
Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.
I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:
Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.
Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).
I thought of a second approach based on suffix trees/arrays.
Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.
If there is a better approach, I'd love to know.
EDIT: Apologies if it seemed like I abandoned the topic.
What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.
Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?
It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?
First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.
A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.
Here is my idea about a solution of your problem; I am not sure about everything so comments are welcome to improve it if you think it worths the effort.
Begin with computing all common substrings of all strings in Y. First take two strings, and build a tree of all common substrings. Then, for each other string in Y, remove from the map every substring that does not appear in this string. The complexity is linear with the number of strings in Y, but I can't figure out how many elements might be in the tree so I cannot draw an estimation of the final complexity.
Then find the longest string in X which is a substring of one in the tree.
There must be some improvements to do to keep the tree as small as possible, such as keeping only substrings that are not substrings of others.
Writing |Y| for the number of strings in the set Y, and len(Y) for their total length:
Process the strings in Y into a generalized suffix tree (for example, using Ukkonen's algorithm). Takes time O(len(Y)), assuming a constant-size alphabet.
Mark each node in the suffix tree according to whether the string identified by that node belongs to all the strings in Y. Takes time O(|Y| len(Y)).
For each string in X, look it up in the suffix tree and see if the node is marked as belonging to all the strings in Y. Output the longest such marked string. Takes time O(len(X)).
Total time: O(|Y| len(Y)) + O(len(X)).

An efficient algorithm for finding smallest pangrammatic windows?

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:
I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'
The smallest pangrammatic window in the text is this string:
g very well; but he just looked up into my face with a very quizzical ex
Which indeed contains every letter at least once.
My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?
I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.
There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.
We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.
A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.
Are there any better algorithms out there?
For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.
Complexity O(n). O(nlog(m)) if you consider alphabet size m.
This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):
Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
Use difference between iterators to update best-so-far result and continue with step 1.
This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

Resources