converting and priniting LCS palindrome with minimum insertions - string

I am working on a algorithm converting string to palindrome with minimum insertions. I found the LCS approach most understandable (http://isharemylearning.blogspot.com/2012/08/minimum-number-of-insertions-in-string.html) and plenty of implementation can be found to calculate the number but almost no mention as how to generate the actual palindrome. I couldn't figure out the way to generate the result based on backtracking the LCS scoring matrix, particularly when only insertion is allowed. If someone can help me out, that would be great.

Related

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

Number of distinct palindromic substrings

Given a string, I know how to find the number of palindromic substrings in linear time using Manacher's algorithm. But now I need to find the number of distinct/unique palindromic substrings. Now, this might lead to an O(n + n^2) algorithm - one 'n' for finding all such substrings, and n^2 for comparing each of these substrings with the ones already found, to check if it is unique.
I am sure there is an algorithm with better complexity. I was thinking of maybe trying my luck with suffix trees? Is there an algorithm with better time complexity?
I would just put substrings you found into the hash table to prevent holding the same results twice.
The access time to hash table is O(1).
As of 2015, there is a linear time algorithm for computing the number of distinct palindromic substrings of a given string S. You can use a data structure known as an eertree (or palindromic tree), as described in the linked paper. The idea is fairly complicated, but the premise is to build a trie of palindromes, and augment it with longest proper palindromic suffixes in a similar manner to the failure function of the Aho-Corasick Algorithm. See the original paper for more details: https://arxiv.org/pdf/1506.04862.pdf

PartitionProblem variation - fixed size of subsets

I have a problem which is a variation of the partition problem which is NP-complete. This is an optimization problem, not a decision problem.
Problem: Partition a list of numbers into two subsets such that their difference of sums is minimum, and find the two subsets. If n even, then the sizes should be n/2, and if odd, then floor[n/2] and ceil[n/2].
Assuming that the pseudo polynomial time DP algorithm is the best for an exact solution, how can it be modified to solve this? And what would be the best approximate algorithms to solve this?
Since you didn't specified which algorithm to use i'll assume you use the one defined here:
http://www.cs.cornell.edu/~wdtseng/icpc/notes/dp3.pdf
Then using this algorithm you add a variable to track the best result, initialize it to N (sum of all the numbers in the list as you can always take one subset to be the empty set) and every time you update T (e.g: T[i]=true) you do something like bestRes = abs(i-n/2)<bestRes : abs(i-n/2) : bestRes. And you return bestRes. This of course doesn't change the complexity of the algorithm.
I've got no idea about your 2nd question.

Speed up text comparisons (with sparse matrices)

I have a function which takes two strings and gives out the cosine similarity value which shows the relationship between both texts.
If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.
Is there a way to reduce this number of comparisons? For example sparse matrices or k-means?
I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.
What Ben says it's true, to get better help you need to tell us what's the goal.
For example, one possible optimization if you want to find similar strings is storing the string vectors in a spatial data structure such as a quadtree, where you can outright discard the vectors that are too far away from each other, avoiding many comparisons.
If your algorithm is pair-wise, then you probably can't reduce the number of comparisons, by definition.
You'll need to use a different algorithm, or at the very least pre-process your input if you want to reduce the number of comparisons.
Without the details of your function, it's difficult to give any concrete help.

Resources