Can someone tell me why is Longest Palindromic Substring dynamic programming. I cannot see how we can model this question wrt to 1- Optimal Substructure. 2- Overlapping subproblems.
Related
i did a course on hamming and correcting code two years ago, and I had to go back on it recently for another course. I must admit I forget some part of it, I know the meaning of the terms and how to do the operations separately, but I can't figure how to visualize the linear code for these questions if someone could help me.
Questions
I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).
Is it possible to find Longest Common Substring, Longest Palindromic Substring, Longest Repeated Substring, Searching All Patterns and Substring Check by both KMP and suffix tree using Ukkonen's algorithm? If yes then which one should I use since both algorithms have a linear-time complexity?
For finding the longest common substring, I would use Kadane's algorithm which has linear complexity. For the longest Palindromic Substring, the choice would be Manacher's algorithm which also has linear complexity. For repeated string and searching all patterns, yes the choice would boil down between KMP and Boyer-Moore.
As to which one, Boyer-Moore's matches the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. KMP searches for occurrences of a word W within a main text string S by employing the observation that when a mismatch occurs, thus bypassing re-examination of previously matched characters.
This makes KMP slightly better optimized for small sets like ACTGT.
Sentences are just sequences of words. These sequences can have a lot of ambiguities. One of the main goals of natural languages processing is to represent sentences as something that has more structure and less ambiguities.
So, my question is: What are the ways to represent sentences? I assume that there are many alternative approaches to that. What are the difference between them? Do they have their advantages and disadvantages?
This is a very broad question, but probably a sufficient answer is: discrete v.s. continuous representations are two different paradigms. Discrete version is where words are represented by indexes corresponding to i.e. their position in a dictionary. This leads to having a vector representation for each sentence where the vector (dimension: |vocabulary|) is very sparse and has 1s for its words and zero elsewhere.
Another paradigm is to replace the vector representation of discrete values with a vector of continuos real values learned via neural network. This started from LSA, and was the general idea behind word2vec, and the basis for many great works over the past 2-3 years in the nlp community.
Given a string, I know how to find the number of palindromic substrings in linear time using Manacher's algorithm. But now I need to find the number of distinct/unique palindromic substrings. Now, this might lead to an O(n + n^2) algorithm - one 'n' for finding all such substrings, and n^2 for comparing each of these substrings with the ones already found, to check if it is unique.
I am sure there is an algorithm with better complexity. I was thinking of maybe trying my luck with suffix trees? Is there an algorithm with better time complexity?
I would just put substrings you found into the hash table to prevent holding the same results twice.
The access time to hash table is O(1).
As of 2015, there is a linear time algorithm for computing the number of distinct palindromic substrings of a given string S. You can use a data structure known as an eertree (or palindromic tree), as described in the linked paper. The idea is fairly complicated, but the premise is to build a trie of palindromes, and augment it with longest proper palindromic suffixes in a similar manner to the failure function of the Aho-Corasick Algorithm. See the original paper for more details: https://arxiv.org/pdf/1506.04862.pdf