Longest Common Subsequence for a series of strings

Longest Common Subsequence for a series of strings - string

For the Longest Common Subsequence of 2 Strings I have found plenty examples online and I believe that I understand the solution.
What I don't understand is, what is the proper way to apply this problem for N Strings? Is the same solution somehow applied? How? Is the solution different? What?

This problem becomes NP-hard when input has arbitrary number of strings. This problem becomes tractable only when input has fixed number of strings. If input has k strings, we could apply the same DP technique in by using a k dimensional array to stored optimal solutions of sub-problems.
Reference: Longest common subsequence problem

To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array.
You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.
This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.
Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:
string lcs(string[] strings)
{
if (strings.Length == 0)
return "";
if (strings.Length == 1)
return strings[0];
int max = -1;
int cacheSize = 1;
for (int i = 0; i < strings.Length; i++)
{
cacheSize *= strings[i].Length;
if (strings[i].Length > max)
max = strings[i].Length;
}
string[] cache = new string[cacheSize];
int[] indexes = new int[strings.Length];
for (int i = 0; i < indexes.Length; i++)
indexes[i] = strings[i].Length - 1;
return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
for (int i = 0; i < indexes.Length; i++ )
if (indexes[i] == -1)
return "";
bool match = true;
for (int i = 1; i < indexes.Length; i++)
{
if (strings[0][indexes[0]] != strings[i][indexes[i]])
{
match = false;
break;
}
}
if (match)
{
int[] newIndexes = new int[indexes.Length];
for (int i = 0; i < indexes.Length; i++)
newIndexes[i] = indexes[i] - 1;
string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
cache[calcCachePos(indexes, strings)] = result;
return result;
}
else
{
string[] subStrings = new string[strings.Length];
for (int i = 0; i < strings.Length; i++)
{
if (indexes[i] <= 0)
subStrings[i] = "";
else
{
int[] newIndexes = new int[indexes.Length];
for (int j = 0; j < indexes.Length; j++)
newIndexes[j] = indexes[j];
newIndexes[i]--;
int cachePos = calcCachePos(newIndexes, strings);
if (cache[cachePos] == null)
subStrings[i] = lcsBack(strings, newIndexes, cache);
else
subStrings[i] = cache[cachePos];
}
}
string longestString = "";
int longestLength = 0;
for (int i = 0; i < subStrings.Length; i++)
{
if (subStrings[i].Length > longestLength)
{
longestString = subStrings[i];
longestLength = longestString.Length;
}
}
cache[calcCachePos(indexes, strings)] = longestString;
return longestString;
}
}
int calcCachePos(int[] indexes, string[] strings)
{
int factor = 1;
int pos = 0;
for (int i = 0; i < indexes.Length; i++)
{
pos += indexes[i] * factor;
factor *= strings[i].Length;
}
return pos;
}
My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.
On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"
The LCS returned is "54442"

Related

Profit Maximization based on dynamix programming

I have been trying to solve this problem :
" You have to travel to different villages to make some profit.
In each village, you gain some profit. But the catch is, from a particular village i, you can only move to a village j if and only if and the profit gain from village j is a multiple of the profit gain from village i.
You have to tell the maximum profit you can gain while traveling."
Here is the link to the full problem:
https://www.hackerearth.com/practice/algorithms/dynamic-programming/introduction-to-dynamic-programming-1/practice-problems/algorithm/avatar-and-his-quest-d939b13f/description/
I have been trying to solve this problem for quite a few hours. I know this is a variant of the longest increasing subsequence but the first thought that came to my mind was to solve it through recursion and then memoize it. Here is a part of the code to my approach. Please help me identify the mistake.
static int[] dp;
static int index;
static int solve(int[] p) {
int n = p.length;
int max = 0;
for(int i = 0;i<n; i++)
{
dp = new int[i+1];
Arrays.fill(dp,-1);
index = i;
max = Math.max(max,profit(p,i));
}
return max;
}
static int profit(int[] p, int n)
{
if(dp[n] == -1)
{
if(n == 0)
{
if(p[index] % p[n] == 0)
dp[n] = p[n];
else
dp[n] = 0;
}
else
{
int v1 = profit(p,n-1);
int v2 = 0;
if(p[index] % p[n] == 0)
v2 = p[n] + profit(p,n-1);
dp[n] = Math.max(v1,v2);
}
}
return dp[n];
}

I have used extra array to get the solution, my code is written in Java.
public static int getmaxprofit(int[] p, int n){
// p is the array that contains all the village profits
// n is the number of villages
// used one extra array msis, that would be just a copy of p initially
int i,j,max=0;
int msis[] = new int[n];
for(i=0;i<n;i++){
msis[i]=p[i];
}
// while iteraring through p, I will check in backward and find all the villages that can be added based on criteria such previous element must be smaller and current element is multiple of previous.
for(i=1;i<n;i++){
for(j=0;j<i;j++){
if(p[i]>p[j] && p[i]%p[j]==0 && msis[i] < msis[j]+p[i]){
msis[i] = msis[j]+p[i];
}
}
}
for(i=0;i<n;i++){
if(max < msis[i]){
max = msis[i];
}
}
return max;
}

Find the prefix which is also a suffix

I'm looking for an optimal solution for this problem.
Given a string s of length n, find a prefix left-to-right equivalent to a suffix right-to-left.
The prefix and suffix can overlap.
Example: given abababa, prefix is [ababa]ba, suffix is ab[ababa].
I am able to go till this
for each i = 0 to n-1, take the prefix ending at i and find if we have an appropriate suffix. It's time is O(n^2) time and O(1) space.
I came up with an optimization where we index the positions of all the characters. This way, we can eliminate set of sample spaces from 1/. Again, the worst case complexity is O(n^2) with O(n) additional space.
Are there any better algorithm for this ?

Make use of the KMP algorithm. The state of the algorithm determines "the longest suffix of the haystack that's still a prefix of the needle". So just take your string as needle and the string without the first character as haystack. Runs in O(N) time and O(N) space.
An implementation with some examples :
public static int[] create(String needle) {
int[] backFunc = new int[needle.length() + 1];
backFunc[0] = backFunc[1] = 0;
for (int i = 1; i < needle.length(); ++i) {
int testing = i - 1;
while (backFunc[testing] != testing) {
if (needle.charAt(backFunc[testing]) == needle.charAt(i-1)) {
backFunc[i] = backFunc[testing] + 1;
break;
} else {
testing = backFunc[testing];
}
}
}
return backFunc;
}
public static int find(String needle, String haystack) {
// some unused character to ensure that we always return back and never reach the end of the
// needle
needle = needle + "$";
int[] backFunc = create(needle);
System.out.println(Arrays.toString(backFunc));
int curpos = 0;
for (int i = 0; i < haystack.length(); ++i) {
while (curpos != backFunc[curpos]) {
if (haystack.charAt(i) == needle.charAt(curpos)) {
++curpos;
break;
} else {
curpos = backFunc[curpos];
}
}
if (curpos == 0 && needle.charAt(0) == haystack.charAt(i)) {
++curpos;
}
System.out.println(curpos);
}
return curpos;
}
public static void main(String[] args) {
String[] tests = {"abababa", "tsttst", "acblahac", "aaaaa"};
for (String test : tests) {
System.out.println("Length is : " + find(test, test.substring(1)));
}
}

Simple implementation in C#:
string S = "azffffaz";
char[] characters = S.ToCharArray();
int[] cumulativeCharMatches = new int[characters.Length];
cumulativeCharMatches[0] = 0;
int prefixIndex = 0;
int matchCount = 0;
// Use KMP type algorithm to determine matches.
// Search for the 1st character of the prefix occurring in a suffix.
// If found, assign count of '1' to the equivalent index in a 2nd array.
// Then, search for the 2nd prefix character.
// If found, assign a count of '2' to the next index in the 2nd array, and so on.
// The highest value in the 2nd array is the length of the largest suffix that's also a prefix.
for (int i = 1; i < characters.Length; i++)
{
if (characters[i] == characters[prefixIndex])
{
matchCount += 1;
prefixIndex += 1;
}
else
{
matchCount = 0;
prefixIndex = 0;
}
cumulativeCharMatches[i] = matchCount;
}
return cumulativeCharMatches.Max();

See:
http://algorithmsforcontests.blogspot.com/2012/08/borders-of-string.html
for an O(n) solution
The code actually calculates the index of the last character in the prefix. For the actual prefix/suffix, you will need to extract the substring from 0 to j (both included, length is j+1)

Find the index of a specific combination without generating all ncr combinations

I am trying to find the index of a specific combination without generating the actual list of all possible combinations. For ex: 2 number combinations from 1 to 5 produces, 1,2;1,3,1,4,1,5;2,3,2,4,2,5..so..on. Each combination has its own index starting with zero,if my guess is right. I want to find that index without generating the all possible combination for a given combination. I am writing in C# but my code generates all possible combinations on fly. This would be expensive if n and r are like 80 and 9 and i even can't enumerate the actual range. Is there any possible way to find the index without producing the actual combination for that particular index
public int GetIndex(T[] combination)
{
int index = (from i in Enumerable.Range(0, 9)
where AreEquivalentArray(GetCombination(i), combination)
select i).SingleOrDefault();
return index;
}

I found the answer to my own question in simple terms. It is very simple but seems to be effective in my situation.The choose method is brought from other site though which generates the combinations count for n items chosen r:
public long GetIndex(T[] combinations)
{
long sum = Choose(items.Count(),atATime);
for (int i = 0; i < combinations.Count(); i++)
{
sum = sum - Choose(items.ToList().IndexOf(items.Max())+1 - (items.ToList().IndexOf(combinations[i])+1), atATime - i);
}
return sum-1;
}
private long Choose(int n, int k)
{
long result = 0;
int delta;
int max;
if (n < 0 || k < 0)
{
throw new ArgumentOutOfRangeException("Invalid negative parameter in Choose()");
}
if (n < k)
{
result = 0;
}
else if (n == k)
{
result = 1;
}
else
{
if (k < n - k)
{
delta = n - k;
max = k;
}
else
{
delta = k;
max = n - k;
}
result = delta + 1;
for (int i = 2; i <= max; i++)
{
checked
{
result = (result * (delta + i)) / i;
}
}
}
return result;
}

Longest Common Prefix property

I was going through suffix array and its use to compute longest common prefix of two suffixes.
The source says:
"The lcp between two suffixes is the minimum of the lcp's of all pairs of adjacent suffixes between them on the array"
i.e. lcp(x,y)=min{ lcp(x,x+1),lcp(x+1,x+2),.....,lcp(y-1,y) }
where x and y are two index of the string from where the two suffix of the string starts.
I am not convinced with the statement as in example of string "abca".
lcp(1,4)=1 (considering 1 based indexing)
but if I apply the above equation then
lcp(1,4)=min{lcp(1,2),lcp(2,3),lcp(3,4)}
and I think lcp(1,2)=0.
so the answer must be 0 according to the equation.
Am i getting it wrong somewhere?

I think the index referred by the source is not the index of the string itself, but index of the sorted suffixes.
a
abca
bca
ca
Hence
lcp(1,2) = lcp(a, abca) = 1
lcp(1,4) = min(lcp(1,2), lcp(2,3), lcp(3,4)) = 0

You can't find LCP of any two suffixes by simply calculating the minimum of the lcp's of all pairs of adjacent suffixes between them on the array.
We can calculate the LCPs of any suffixes (i,j)
with the Help of Following :
LCP(suffix i,suffix j)=LCP[RMQ(i + 1; j)]
Also Note (i<j) as LCP (suff i,suff j) may not necessarly equal LCP (Suff j,suff i).
RMQ is Range Minimum Query .
Page 3 of this paper.
Details:
Step 1:
First Calculate LCP of Adjacents /consecutive Suffix Pairs .
n= Length of string.
suffixArray[] is Suffix array.
void calculateadjacentsuffixes(int n)
{
for (int i=0; i<n; ++i) Rank[suffixArray[i]] = i;
Height[0] = 0;
for (int i=0, h=0; i<n; ++i)
{
if (Rank[i] > 0)
{
int j = suffixArray[Rank[i]-1];
while (i + h < n && j + h < n && str[i+h] == str[j+h])
{
h++;
}
Height[Rank[i]] = h;
if (h > 0) h--;
}
}
}
Note: Height[i]=LCPs of (Suffix i-1 ,suffix i) ie. Height array contains LCP of adjacent suffix.
Step 2:
Calculate LCP of Any two suffixes i,j using RMQ concept.
RMQ pre-compute function:
void preprocesses(int N)
{
int i, j;
//initialize M for the intervals with length 1
for (i = 0; i < N; i++)
M[i][0] = i;
//compute values from smaller to bigger intervals
for (j = 1; 1 << j <= N; j++)
{
for (i = 0; i + (1 << j) - 1 < N; i++)
{
if (Height[M[i][j - 1]] < Height[M[i + (1 << (j - 1))][j - 1]])
{
M[i][j] = M[i][j - 1];
}
else
{
M[i][j] = M[i + (1 << (j - 1))][j - 1];
}
}
}
}
Step 3: Calculate LCP between any two Suffixes i,j
int LCP(int i,int j)
{
/*Make sure we send i<j always */
/* By doing this ,it resolve following
suppose ,we send LCP(5,4) then it converts it to LCP(4,5)
*/
if(i>j)
swap(i,j);
/*conformation over*/
if(i==j)
{
return (Length_of_str-suffixArray[i]);
}
else
{
return Height[RMQ(i+1,j)];
//LCP(suffix i,suffix j)=LCPadj[RMQ(i + 1; j)]
//LCPadj=LCP of adjacent suffix =Height.
}
}
Where RMQ function is:
int RMQ(int i,int j)
{
int k=log((double)(j-i+1))/log((double)2);
int vv= j-(1<<k)+1 ;
if(Height[M[i][k]]<=Height[ M[vv][ k] ])
return M[i][k];
else
return M[ vv ][ k];
}
Refer Topcoder tutorials for RMQ.
You can check the complete implementation in C++ at my blog.

Search an integer in a row-sorted two dim array, is there any better approach?

I have recently come across with this problem,
you have to find an integer from a sorted two dimensional array. But the two dim array is sorted in rows not in columns. I have solved the problem but still thinking that there may be some better approach. So I have come here to discuss with all of you. Your suggestions and improvement will help me to grow in coding. here is the code
int searchInteger = Int32.Parse(Console.ReadLine());
int cnt = 0;
for (int i = 0; i < x; i++)
{
if (intarry[i, 0] <= searchInteger && intarry[i,y-1] >= searchInteger)
{
if (intarry[i, 0] == searchInteger || intarry[i, y - 1] == searchInteger)
Console.WriteLine("string present {0} times" , ++cnt);
else
{
int[] array = new int[y];
int y1 = 0;
for (int k = 0; k < y; k++)
array[k] = intarry[i, y1++];
bool result;
if (result = binarySearch(array, searchInteger) == true)
{
Console.WriteLine("string present inside {0} times", ++ cnt);
Console.ReadLine();
}
}
}
}
Where searchInteger is the integer we have to find in the array. and binary search is the methiod which is returning boolean if the value is present in the single dimension array (in that single row).
please help, is it optimum or there are better solution than this.
Thanks

Provided you have declared the array intarry, x and y as follows:
int[,] intarry =
{
{0,7,2},
{3,4,5},
{6,7,8}
};
var y = intarry.GetUpperBound(0)+1;
var x = intarry.GetUpperBound(1)+1;
// intarry.Dump();
You can keep it as simple as:
int searchInteger = Int32.Parse(Console.ReadLine());
var cnt=0;
for(var r=0; r<y; r++)
{
for(var c=0; c<x; c++)
{
if (intarry[r, c].Equals(searchInteger))
{
cnt++;
Console.WriteLine(
"string present at position [{0},{1}]" , r, c);
} // if
} // for
} // for
Console.WriteLine("string present {0} times" , cnt);
This example assumes that you don't have any information whether the array is sorted or not (which means: if you don't know if it is sorted you have to go through every element and can't use binary search). Based on this example you can refine the performance, if you know more how the data in the array is structured:
if the rows are sorted ascending, you can replace the inner for loop by a binary search
if the entire array is sorted ascending and the data does not repeat, e.g.
int[,] intarry = {{0,1,2}, {3,4,5}, {6,7,8}};
then you can exit the loop as soon as the item is found. The easiest way to do this to create
a function and add a return statement to the inner for loop.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Longest Common Subsequence for a series of strings - string

For the Longest Common Subsequence of 2 Strings I have found plenty examples online and I believe that I understand the solution. What I don't understand is, what is the proper way to apply this problem for N Strings? Is the same solution somehow applied? How? Is the solution different? What?

Related

Profit Maximization based on dynamix programming

Find the prefix which is also a suffix

Find the index of a specific combination without generating all ncr combinations

Longest Common Prefix property

Search an integer in a row-sorted two dim array, is there any better approach?

Categories

Resources