Longest palindromic substring and suffix trie

Longest palindromic substring and suffix trie - string

I was Googling about a rather well-known problem, namely: the longest palindromic substring
I have found links that recommend suffix tries as a good solution to the problem.
Example SO and Algos
The approach is (as I understand it) e.g. for a string S create Sr (which is S reversed) and then create a generalized suffix trie.
Then find the longest common sustring of S and Sr which is the path from the root to the deepest node that belongs both to S and Sr.
So the solution using the suffix tries approach essentially reduces to Find the longest common substring problem.
My question is the following:
If the input string is: S = “abacdfgdcaba” so , Sr = “abacdgfdcaba” the longest common substring is abacd which is NOT a palindrome.
So my question is: Is the approach of using suffix tries erroneous? Am i missunderstanding/misreading here?

Yes, finding longest palindrome by using LCS like algorithms is not a good way, I didn't read referenced answer carefully but this line in the answer is completely wrong:
So the longest contained palindrome within a string is exactly the longest common substring of this string and its reverse
but if you read it and you have a counter example don't worry about it (you are right in 99%), this is common mistake, But simple way is as follow:
Write down the string (barbapapa) as follow: #b#a#r#b#a#p#a#p#a#, now traverse each character of this new string from left to right, check its left and right to check whether it's a palindrome center or not. This algorithm is O(n^2) in worst case and works perfectly correct. but normally will finds palindrome in O(n) (sure proving this in average case is hard). Worst case is in strings with too many long palindromes like aaaaaa...aaaa.
But there is better approach which takes O(n) time, base of this algorithm is by Manacher. Related algorithm is more complicated than what I saw in your referenced answer. But what I offered is base idea of Manacher algorithm, with clever changes in algorithm you can skip checking all left and rights (also there are algorithms by using suffix trees).
P.S: I couldn't see your Algo link because of my internet limitations, I don't know it's correct or not.
I added my discussion with OP to clarify the algorithm:
let test it with barbapapa-> #b#a#r#b#a#p#a#p#a#, start from first #
there is no left so it's center of palindrome with length 1.
Now "b",has # in left and # in right, but there isn't another left to match with right
so it's a center of palindrome with length 3.
let skip other parts to arrive to first "p":
first left and right is # second left and right is "a", third left and
right is # but forth left and right are not equal so it's center of palindrome
of length 7 #a#p#a# is palindrome but b#a#p#a#p is not
Now let see first "a" after first "p" you have, #a#p#a#p#a# as palindrome and this "a"
is center of this palindrome with length 11 if you calculate all other palindromes
length of all of them are smaller than 11
Also using # is because considering palindromes of even length.
After finding center of palindrome in newly created string, find related palindrom (by knowing the center and its length), then remove # to find out biggest palindrome.

Related

How to reverse a suffix tree of a string (findind the string it represents)

Given a (modified/broken) suffix tree, which stores in each edge the beginning and ending of the current substring, but not the substring itself, i.e a suffix tree that looks like this:
this tree represents the string "banana" over the alphabet: {a, b, n}.
The algorithm I'm looking for is to find the string that a tree of that sort represents, for the example above, I would like the algorithm to find "banana".
I would like to that in a complexity of O(|string|) where |string| is the length of the string that is being searched.
It can be assumed that:
The size of the alphabet is constant and that every string starts from index 1.

Let's start with some polynomial time solution:
Let's divide all characters in the string into classes of equivalence.
We already know: it is a special $ symbol.
Induction hypothesis: let's assume that we have properly divided all characters of the suffix of length k into classes of equivalence. We can do it properly for the suffix of length k + 1, too.
Proof: let's iterate over all suffices of length i <- 1...k and check if the length of longest common prefix of the suffix of length k and the suffix of length i is not zero. It is non-zero iff the lowest common ancestor of the corresponding leaves is not the root of the tree. If we have found such a suffix, we know that it's first letter is equal to the first letter of the current suffix. So we can add the first letter of the suffix of length k + 1 to the appropriate class of equivalence. Otherwise, it belongs to its own equivalence class.
When all characters are divided into equivalence classes, we just need to assign a unique symbol to each class(if we need to maintain a correct lexicographical order, we can check which one of them goes earlier. To do this, we need to look at the order of edges that go from the root).
The time complexity is O(n ^ 3)(there are n suffices, we iterate over O(n) other suffices for each of them and we compute their lca in O(n)(I assume that we use a naive algorithm here)). So far, so good.
Now let's use several observation to get a linear solution:
We don't really need the lca. We just need to check that it is not the root. Thus, we can divide all leaves into classes of equivalence based on their ancestor which is an immediate child of the root. It can done in linear time using a depth-first search. The longest common prefix of two suffices is non-empty iff they are in the same class.
We don't actually need to check all shorter suffices. We only need to check the closest one to the left and to the right in depth first search order. Finding the closest smaller number to the left and to the right from the given is a standard problem and it has a linear solution with a stack.
That's it: we check at most two other suffices for the given one and each check is O(1). We have a linear solution now.
This solution uses an assumption that such a string does exist. If this assumption is not feasible, we can construct some string using this algorithm, then build a suffix tree in linear for it using Ukkonnen's algorithm and check that it is exactly the same as the given one.

find most common substring in given string? overlapping is allow

I already searched for posts on this question. But none of them have clear answers.
Find the occurrence of most common substring with length n in given string.
For example, "deded", we set the length of substring to be 3. "ded" will be the most common substring and its occurrence is 2.
Few post suggest using suffix tree and the time complexity is O(nlgn), space complexity is O(n).
First, I'm not familiar with suffix tree. My idea is to use hashmap store the occurrence of each substring with length of 3. The time is O(n) while space is also O(n). Is this better than suffix tree? Should I take hashmap collison into account?
Extra: if above problem is addressed, how can we solve the problem that length of substring doesn't matter. Just find the most common substring in given string.

If the length of the most common substring doesn't matter (but say, you want it to be greater than 1) then the best solution is to look for the most common substring of length 2. You can do this with a suffix tree in linear time, if you look up suffix trees then it will be clear how to do this. If you want the length M of the most common substring to be an input parameter, then you can hash all substrings of length M in linear time using hashing with multiply-and-add where you multiply the previous string hash value by a constant and then add the value for the next least significant value in the string, and take the modulus modulo a prime P. If you pick your modulus P for the computed string integers to be a randomly chosen prime P such that you can store O(P) memory, then this will do the trick, in linear time if you assume that your hashing has no collisions. If you assume that your hashing might have a lot of collisions, and the substring is of length M and the total string length is N, then the running time would be O(MN) because you have to check all collisions, which in the worst case could be checking all substrings of length M for example if your string is a string of all one character. Suffix trees are better in the worst case, let me know if you want some details (but not completely, because suffix trees are complicated) and I can explain at a high level how to get a faster solution with suffix trees.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.

You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).

Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.

You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).

We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.

Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.

I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something

READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

String pattern matching with one or zero mismatch

Given a string and a pattern to be matched, how efficiently can the matches be found having zero or one mismatch.
e.g)
S = abbbaaabbbabab
P = abab
Matches are abbb(index 0),aaab(index 4),abbb(index 6),abab(index 10)
I tried to modify KMP algorithm but I'm not sure about the approach.
Please give me idea to proceed with the problem.
Thanks.

Ok I found it! I found the best algorithm!
This might sound a bit brave, but as long as the algorithm I am going to propose has both running time O(m + n) and memory consumption O(m + n) and the entry data itself has the same properties the algorithm can be optimized only in constant.
Algorithms used
I am going to use mix-up between KMP and Rabin Karp algorithms for my solution. Rabin Karp uses rolling hashes for comparing substrings of the initial strings. It requires linear in time precomputing that uses linear additional memory, but from then on the comparison between substrings of the two strings is constant O(1) (this is amortized if you handle collisions properly).
What my solution will not do
My solution will not find all the occurrences in the first string that match the second string with at most 1 difference. However, the algorithm can be modified so that for every starting index in the first string if there is such matching at least one of them will be found (this is left to the reader).
Observations
Let m be the length of the second string and n - the length of the first string. I am going to split the task in two parts: if I am aiming to find a matching with at most one difference, I want to find to substrings of the first string: PREF is going to be the substring before the single difference and SUFF the substring after the difference. I want len(PREF) + len(SUFF) + 1 = m, where PREF or SUFF will be artificially shortened if required (when the strings match without difference).
I am going to base my solution on one very important observation: suppose there is a substring of the first string starting at index i with length m that matches the second string with at most one difference. Then if we take PREF as long as possible there will still be solution for SUFF. This is obvious: I am just pushing the difference as much to the end as possible.
The algorithm
And now follows the algorithm itself. Start off with usual KMP. Every time when the extension of the prefix fails and the fail links are to be followed, first check whether if you skip the next letter the remaining suffix will match the remaining of the second string. If so the sought match with at most one character difference is found. If not - we go on with the ordinary KMP making the Rabin Karp check every time a fail link is to be followed.
Let me clarify further the Rabin Karp check with an example. Suppose we are at certain step of the KMP and we have found that first.substring[i, i + k - 1] matches the first k letters of the second string. Suppose also that the letter first[i + k] is different from second[k]. Then you check whether first.substring[i + k + 1, i + m - 1] matches exactly second.substring[k + 1, m - 1] using Rabin Karp. This is exactly the case in which you have extended the starting prefix form index i as much as possible and you try now whether there is a match with at most one difference.
Rabin Karp will be used only when a fail link is followed, which moves the starting index of the prefix with at least one, which means that at most O(n) Rabin Karp calls are used, every one with complexity O(1) for a total of linear complexity.

This is known as the approximate string matching problem. In your particular case, you want a maximum edit distance of 1.
The bitap algorithm is a fairly fast way of solving it.

To find all submatches including one mismatch you need 2 z-functions (one for the original P, and another for reversed P).
After that buld array of longest prefix submatches for the original and reversed string S.
Later you need to reverse the second array.
And in the end everything is easy: run through the first array and check if the length of longest prefix is equal to the length of P. If it is, then it is a match without mistakes.
If it is shorter, then check the second array at position (i + length(P) - 1). If sum of
two values is equal to length(P) - 1, then it is a submatch with one mistake.
Complexity is O(len(P) + len(S))

A comprehensive overview of the various algorithms and how they compare to each other is given by Gonzalo Navarro in his A guided tour to approximate string matching. Pages 80, 81 and 82 show complexity results, including worst and average cases, and space requirements for the various algorithms.
(In the notation used there, n refers to the length of the text you search, m to the length of the pattern, σ to the size of the alphabet, and k to the maximum edit distance, which is 1 in your case.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string