Number of distinct palindromic substrings - string

Given a string, I know how to find the number of palindromic substrings in linear time using Manacher's algorithm. But now I need to find the number of distinct/unique palindromic substrings. Now, this might lead to an O(n + n^2) algorithm - one 'n' for finding all such substrings, and n^2 for comparing each of these substrings with the ones already found, to check if it is unique.
I am sure there is an algorithm with better complexity. I was thinking of maybe trying my luck with suffix trees? Is there an algorithm with better time complexity?

I would just put substrings you found into the hash table to prevent holding the same results twice.
The access time to hash table is O(1).

As of 2015, there is a linear time algorithm for computing the number of distinct palindromic substrings of a given string S. You can use a data structure known as an eertree (or palindromic tree), as described in the linked paper. The idea is fairly complicated, but the premise is to build a trie of palindromes, and augment it with longest proper palindromic suffixes in a similar manner to the failure function of the Aho-Corasick Algorithm. See the original paper for more details: https://arxiv.org/pdf/1506.04862.pdf

Related

What is the difference between KMP and Z algorithm of string pattern matching?

In KMP algorithm we preprocess the pattern to find the longest prefix which we used to skip characters while matching.
while in Z- algorithm we first make a new string
new_string = pattern + 'x' + string
where x = character that doesn't exist in both pattern and string
After making the new_string we preprocess the new_string to find longest prefix and if the lent of prefix is equal to pattern length then we found the pattern
Both have time complexity of O(m+n).
so what is the difference between these two algorithms and which one is best to use?
Is not always about time complexity, the storage complexity are the playing role here:
Knuth Morris Pratt:
Worst-case performance : Θ(m) preprocessing + Θ(n) matching
Worst-case space complexity : Θ(m)
Z algorithm:
Worst-case performance: Θ(m+n) preprocessing and matching
worst-case space complexity: Θ(n+m)
Besides, you can use the idea of searching prefix and suffix for other usages beside searching a pattern, so you could have other reasons to do the analytics on particular information
Also I would recommend for other matching algorithms for some tasks even if they have worse time complexity like Boyer-moore, it all depend on the situation

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

Difference between Knuth–Morris–Pratt (KMP) and suffix tree using Ukkonen's algorithm for time complexity.

Is it possible to find Longest Common Substring, Longest Palindromic Substring, Longest Repeated Substring, Searching All Patterns and Substring Check by both KMP and suffix tree using Ukkonen's algorithm? If yes then which one should I use since both algorithms have a linear-time complexity?
For finding the longest common substring, I would use Kadane's algorithm which has linear complexity. For the longest Palindromic Substring, the choice would be Manacher's algorithm which also has linear complexity. For repeated string and searching all patterns, yes the choice would boil down between KMP and Boyer-Moore.
As to which one, Boyer-Moore's matches the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. KMP searches for occurrences of a word W within a main text string S by employing the observation that when a mismatch occurs, thus bypassing re-examination of previously matched characters.
This makes KMP slightly better optimized for small sets like ACTGT.

converting and priniting LCS palindrome with minimum insertions

I am working on a algorithm converting string to palindrome with minimum insertions. I found the LCS approach most understandable (http://isharemylearning.blogspot.com/2012/08/minimum-number-of-insertions-in-string.html) and plenty of implementation can be found to calculate the number but almost no mention as how to generate the actual palindrome. I couldn't figure out the way to generate the result based on backtracking the LCS scoring matrix, particularly when only insertion is allowed. If someone can help me out, that would be great.

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

Resources