Stacking and dynamic programing - dynamic-programming

Basically I'm trying to solve this problem :
Given N unit cube blocks, find the smaller number of piles to make in order to use all the blocks. A pile is either a cube or a pyramid. For example two valid piles are the cube 4 *4 *4=64 using 64 blocks, and the pyramid 1²+2²+3²+4²=30 using 30 blocks.
However, I can't find the right angle to approach it. I feel like it's similar to the knapsack problem, but yet, couldn't find an implementation.
Any help would be much appreciated !

First I will give a recurrence relation which will permit to solve the problem recursively. Given N, let
SQUARE-NUMS
TRIANGLE-NUMS
be the subset of square numbers and triangle numbers in {1,...,N} respectively. Let PERMITTED_SIZES be the union of these. Note that, as 1 occurs in PERMITTED_SIZES, any instance is feasible and yields a nonnegative optimum.
The follwing function in pseudocode will solve the problem in the question recursively.
int MinimumNumberOfPiles(int N)
{
int Result = 1 + min { MinimumNumberOfPiles(N-i) }
where i in PERMITTED_SIZES and i smaller than N;
return Result;
}
The idea is to choose a permitted bin size for the items, remove these items (which makes the problem instance smaller) and solve recursively for the smaller instances. To use dynamic programming in order to circumvent multiple evaluation of the same subproblem, one would use a one-dimensional state space, namely an array A[N] where A[i] is the minimum number of piles needed for i unit blocks. Using this state space, the problem can be solved iteratively as follows.
for (int i = 0; i < N; i++)
{
if i is 0 set A[i] to 0,
if i occurs in PERMITTED_SIZES, set A[i] to 1,
set A[i] to positive infinity otherwise;
}
This initializes the states which are known beforehand and correspond to the base cases in the above recursion. Next, the missing states are filled using the following loop.
for (int i = 0; i <= N; i++)
{
if (A[i] is positive infinity)
{
A[i] = 1 + min { A[i-j] : j is in PERMITTED_SIZES and j is smaller than i }
}
}
The desired optimal value will be found in A[N]. Note that this algorithm only calculates the minimum number of piles, but not the piles themselves; if a suitable partition is needed, it has to be found either by backtracking or by maintaining additional auxiliary data structures.
In total, provided that PERMITTED_SIZES is known, the problem can be solved in O(N^2) steps, as PERMITTED_SIZES contains at most N values.
The problem can be seen as an adaptation of the Rod Cutting Problem where each square or triangle size has value 0 and every other size has value 1, and the objective is to minimize the total value.
In total, an additional computation cost is necessary to generate PERMITTED_SIZES from the input.
More precisely, the corresponding choice of piles, once A is filled, can be generated using backtracking as follows.
int i = N; // i is the total amount still to be distributed
while ( i > 0 )
{
choose j such that
j is in PERMITTED_SIZES and j is smaller than i
and
A[i] = 1 + A[i-j] is minimized
Output "Take a set of size" + j; // or just output j, which is the set size
// the part above can be commented as "let's find out how
// the value in A[i] was generated"
set i = i-j; // decrease amount to distribute
}

Related

Dynamic programming algorithm to find palindromes in a directed acyclic graph

The problem is as follows: given a directed acyclic graph, where each node is labeled with a character, find all the longest paths of nodes in the graph that form a palindrome.
The initial solution that I thought of was to simply enumerate all the paths in the graph. This effectively generates a bunch of strings, on which we can then apply Manacher's algorithm to find all the longest palindromes. However, this doesn't seem that efficient, since the amount of paths in a graph is exponential in the number of nodes.
Then I started thinking of using dynamic programming directly on the graph, but my problem is that I cannot figure out how to structure my "dynamic programming array". My initial try was to use a 2d boolean array, where array[i][j] == true means that node i to node j is a palindrome but the problem is that there might be multiple paths from i to j.
I've been stuck on this problem for quite a while now I can't seem to figure it out, any help would be appreciated.
The linear-time trick of Manacher's algorithm relies on the fact that if you know that the longest palindrome centered at character 15 has length 5 (chars 13-17), and there's a palindrome centered at node 19 of length 13 (chars 13-25), then you can skip computing the longest palindrome centered at character 23 (23 = 19 + (19 - 15)) because you know it's just going to be the mirror of the one centered at character 15.
With a DAG, you don't have that kind of guarantee because the palindromes can go off in any direction, not just forwards and backwards. However, if you have a candidate palindrome path from node m to node n, whether you can extend that string to a longer palindrome doesn't depend on the path between m and n, but only on m and n (and the graph itself).
Therefore, I'd do this:
First, sort the graph nodes topologically, so that you have an array s[] of node indexes, and there being an edge from s[i] to s[j] implies that i < j.
I'll also assume that you build up an inverse array or hash structure sinv[] such that s[sinv[j]] == j and sinv[s[n]] == n for all integers j in 0..nodeCount-1 and all node indexes n.
Also, I'll assume that you have functions graphPredecessors, graphSuccessors, and graphLetter that take a node index and return the list of predecessors on the graph, the list of successors, or the letter at that node, respectively.
Then, make a two-dimensional array of integers of size nodeCount by nodeCount called r. When r[i][j] = y, and y > 0, it will mean that if there is a palindrome path from a successor of s[i] to a predecessor of s[j], then that path can be extended by adding s[i] to the front and s[j] to the back, and that the extension can be continued by y more nodes (including s[i] and s[j]) in each direction:
for (i=0; i < nodeCount; i++) {
for (j=i; j < nodeCount; j++) {
if (graphLetter(s[i]) == graphLetter(s[j])) {
r[i][j] = 1;
for (pred in graphPredecessors(s[i])) {
for (succ in graphSuccessors(s[j])) {
/* note that by our sorting, sinv[pred] < i <= j < sinv[succ] */
if (r[sinv[pred]][sinv[succ]] >= r[i][j]) {
r[i][j] = 1 + r[sinv[pred]][sinv[succ]];
}
}
}
} else {
r[i][j] = 0;
}
}
}
Then find the maximum value of r[x][x] for x in 0..nodeSize-1, and of r[lft][rgt] where there is an edge from s[lft] to s[rgt]. Call that maximum value M, and say you found it at location [i][j]. Each such i, j pair will represent the center of a longest palindrome path. As long as M is greater than 1, you then extend each center by finding a pred in graphPredecessors(s[i]) and a succ in graphSuccessors(s[j]) such that r[sinv[pred]][sinv[succ]] == M - 1 (the palindrome is now pred->s[i]->s[j]->succ). You then extend that by finding the appropriate index with an r value of M - 2, etc., stopping when you reach a spot where the value in r is 1.
I think this algorithm overall ends up with a runtime of O(V^2 + E^2), but I'm not entirely certain of that.

Counting minimum number of swaps to group characters in string

I'm trying to solve a pretty complex problem with strings:
Given is a string with up to 100000 characters, consisting of only two different characters 'L' and 'R'. A sequence 'RL' is considered "bad", and such occurrences have to be reduced by applying swaps.
However, the string is to be considered circular, so even the string 'LLLRRR' has an 'RL' sequence formed by the last 'R' and the first 'L'.
Swaps of two consecutive elements can be made. So we can swap only elements that are on positions i and i+1, or on position 0 and n-1, if n is the length of the string (the string is 0-indexed).
The goal is to find the minimum number of swaps needed to leave only one bad connection in the string.
Example
For the string 'RLLRRL' the problem can be solved with exactly one swap: swap the first and the last characters (since the string is circular). The string will thus become 'LLLRRR' with one bad connection.
What I tried
My idea is to use dynamical programming, and to calculate for any given 'L' how many swaps are needed to put all other 'L's left to that 'L', and, alternatively, to the right of this 'L'. For any 'R' I calculate the same.
This algorithm works in O(N) time, but it doesn't give the correct result.
It doesn't work when I have to swap the first and the last elements. What should I add to my algorithm to make it work also for those swaps?
The problem can be solved in linear time.
Some observations and definitions:
The target of having only one bad connection is another way of saying that the L letters should all be grouped together, as well as the R letters (in a circular string)
Let a group denote a series of subsequent letters of the same kind that cannot be made larger (because of surrounding letters that differ). By combining individual swaps you can "move" a group with one or more "steps". An example -- I will write . instead of L so it is more easily readable:
RRR...RR....
There are 4 groups here: RRR, ..., RR and ..... Suppose you want to join the group of two "R" with the left-sided "R" group in the above string. Then you could "move" that middle group with 3 steps to the left by performing 6 swaps:
RRR...RR....
RRR..R.R....
RRR..RR.....
RRR.R.R.....
RRR.RR......
RRRR.R......
RRRRR.......
These 6 swaps are what constitutes one group move. The cost of the move is 6, and is the product of the size of the group (2) and the distance it travels (3). Note that this move is exactly the same as when we would have moved the group with three "L" characters (cf. the dots) to the right.
I will use the word "move" in this meaning.
There is always a solution that can be expressed as a series of group moves, where each group move reduces the number of groups with two, i.e. with each such move, two R groups are merged into one, and consequently also two L groups are merged. In other words, there is always a solution where none of the groups has to split with one part of it moving to the left and another to the right. I will not give a proof of this claim here.
There is always a solution that has one group that will not move at all: all other groups of the same letter will move towards it. As a consequence there is also a group of the opposite letter that will not move, somewhere at the other end of the circle. Again, I will not prove this here.
The problem is then equivalent to minimizing the total cost (swaps) of the moves of the groups that represent one of the two letters (so half of all the groups). The other half of the groups move at the same time as was shown in the above example.
Algorithm
An algorithm could go like this:
Create an array of integers, where each value represents the size of a group. The array would list the groups in the order they appear. This would take the circular property into account, so that the first group (with index 0) would also account for the letter(s) at the very end of the string that are the same as the first letter(s). So at even indices you would have groups that represent counts of one particular letter, and at odd indices there would be counts of the other letter. It does not really matter which of the two letters they represent. The array of groups will always have an even number of entries. This array is all we need to solve the problem.
Pick the first group (index 0), and assume it will not move. Call it the "middle group". Determine which is the group of the opposite colour (with odd index) that will not have to move either. Call this other group the "split group". This split group will split the remaining odd groups into two sections, where the sum of their values (counts) is each less or equal the total of both sums. This represents the fact that it is cheaper for even groups to move in one direction than in the other in order to merge with the group at index 0.
Now determine the cost (number of swaps) for moving all even groups towards the middle group.
This may or may not be a solution, since the choice of the middle group was arbitrary.
The above would have to be repeated for the scenarios where any of the other even groups was taken as middle group.
Now the essence of the algorithm is to avoid redoing the whole operation when taking another group as the middle group. It turns out it is possible to take the next even group as middle group (at index 2) and adjust the previously calculated cost in constant time (on average) to arrive at the cost for this choice of the middle group. For this one has to keep a few parameters in memory: the cost for performing the moves in the left direction, and the cost for performing the moves in the right direction. Also the sum of even-group sizes needs to be maintained for each of both directions. And finally the sum of the odd-group sizes needs to be maintained as well for both directions. Each of these parameters can be adjusted when taking the next even group as middle group. Often the corresponding split group has to be re-identified as well, but also that can happen on average in constant time.
Without going to deep into this, here is a working implementation in simple JavaScript:
Code
function minimumSwaps(s) {
var groups, start, n, i, minCost, halfSpace, splitAt, space,
cost, costLeft, costRight, distLeft, distRight, itemsLeft, itemsRight;
// 1. Get group sizes
groups = [];
start = 0;
for (i = 1; i < s.length; i++) {
if (s[i] != s[start]) {
groups.push(i - start);
start = i;
}
}
// ... exit when the number of groups is already optimal
if (groups.length <= 2) return 0; // zero swaps
// ... the number of groups should be even (because of circle)
if (groups.length % 2 == 1) { // last character not same as first
groups.push(s.length - start);
} else { // Ends are connected: add to the length of first group
groups[0] += s.length - start;
}
n = groups.length;
// 2. Get the parameters of the scenario where group 0 is the middle:
// i.e. the members of group 0 do not move in that case.
// Get sum of odd groups, which we consider as "space", while even
// groups are considered items to be moved.
halfSpace = 0;
for (i = 1; i < n; i+=2) {
halfSpace += groups[i];
}
halfSpace /= 2;
// Get split-point between what is "left" from the "middle"
// and what is "right" from it:
space = 0;
for (i = 1; space < halfSpace; i+=2) {
space += groups[i];
}
splitAt = i-2;
// Get sum of items, and cost, to the right of group 0
itemsRight = distRight = costRight = 0;
for (i = 2; i < splitAt; i+=2) {
distRight += groups[i-1];
itemsRight += groups[i];
costRight += groups[i] * distRight;
}
// Get sum of items, and cost, to the left of group 0
itemsLeft = distLeft = costLeft = 0;
for (i = n-2; i > splitAt; i-=2) {
distLeft += groups[i+1];
itemsLeft += groups[i];
costLeft += groups[i] * distLeft;
}
cost = costLeft + costRight;
minCost = cost;
// 3. Translate the cost parameters by incremental changes for
// where the mid-point is set to the next even group
for (i = 2; i < n; i += 2) {
distLeft += groups[i-1];
itemsLeft += groups[i-2];
costLeft += itemsLeft * groups[i-1];
costRight -= itemsRight * groups[i-1];
itemsRight -= groups[i];
distRight -= groups[i-1];
// See if we need to change the split point. Items that get
// at the different side of the split point represent items
// that have a shorter route via the other half of the circle.
while (distLeft >= halfSpace) {
costLeft -= groups[(splitAt+1)%n] * distLeft;
distLeft -= groups[(splitAt+2)%n];
itemsLeft -= groups[(splitAt+1)%n];
itemsRight += groups[(splitAt+1)%n];
distRight += groups[splitAt];
costRight += groups[(splitAt+1)%n] * distRight;
splitAt = (splitAt+2)%n;
}
cost = costLeft + costRight;
if (cost < minCost) minCost = cost;
}
return minCost;
}
function validate(s) {
return new Set(s).size <= 2; // maximum 2 different letters used
}
// I/O
inp.oninput = function () {
var s, result, start;
s = inp.value;
start = performance.now(); // get timing
if (validate(s)) {
result = minimumSwaps(s); // apply algorithm
} else {
result = 'Please use only 2 different characters';
}
outp.textContent = result;
ms.textContent = Math.round(performance.now() - start);
}
rnd.onclick = function () {
inp.value = Array.from(Array(100000), _ =>
Math.random() < 0.5 ? "L" : "R").join('');
if (inp.value.length != 100000) alert('Your browser truncated the input!');
inp.oninput(); // trigger input handler
}
inp.oninput(); // trigger input handler
input { width: 100% }
<p>
<b>Enter LR series:</b>
<input id="inp" value="RLLRRL"><br>
<button id="rnd">Produce random of size 100000</button>
</p><p>
<b>Number of swaps: </b><span id="outp"></span><br>
<b>Time used: </b><span id="ms"></span>ms
</p>
Time Complexity
The preprocessing (creating the groups array, etc..), and calculation of the cost for when the first group is the middle group, all consist of non-nested loops with at most n iterations, so this part is O(n).
The calculation of the cost for when the middle group is any of the other even groups consists of a loop (for the choosing the middle group), and another inner loop for adjusting the choice of the split group. Even though this inner loop may iterate multiple times for one iteration of the outer loop, in total this inner loop will not have more iterations than n, so the total execution time of this outer loop is still O(n).
Therefore the time complexity is O(n).
Note that the result for a string of 100 000 characters is calculated in a fraction of a second (see the number of milliseconds displayed by the above snippet).
The task is to re-order the items in a circular list like this:
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL
so that we get a list like this:
RRRRRRLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRR
or this:
LLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRRLLLLL
where the two types of items are grouped together, but the exact position of these two groups is not important.
The first task is to count the number of items in each group, so we iterate over the list once, and for the example above the result would be:
#L = 30
#R = 20
Then, the simple brute-force solution would be to consider every position in the list as the start of the L-zone, starting with position 0, iterate over the whole list and count how many steps each item is away from the border of the zone where it should be:
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRR <- desired output
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL <- input
< < << < >> > > > >< < <<< > >> >> > <- direction to move
We would then consider the L-zone to start at position 1, and do the whole calculation again:
RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRR <- desired output
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL <- input
<< < << < >> > > > > < <<< > >> >> > <- direction to move
After calculating the total number of steps for every position of the L-zone, we would know which position requires the least number of steps. This is of course a method with N2 complexity.
If we could calculate the number of required steps with the L-zone at position X, based on the calculation for the L-zone at position X-1 (without iterating over the whole list again), this could bring the complexity down to N.
To do so, we'd need to keep track of the number of wrong items in each half of each zone, and the total number of steps for the wrong items in each of these four half-zones:
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRR <- desired output
<<<<<<<<<<<<<<<>>>>>>>>>>>>>>><<<<<<<<<<>>>>>>>>>> <- half-zones
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRLRRLLRLLRL <- input
< < << < >> > > > >< < <<< > >> >> > <- direction to move
5 6 5 6 <- wrong items
43 45 25 31 <- required steps
When we move right to the next position, the total number of steps in left-moving zones will decrease by the number of wrong items in that zone, and the total number of steps in right-moving zones will increase by the number of wrong items in that zone (because every item is now one step closer/further from the edge of the zone.
5 6 5 6 <- wrong items
38 51 20 37 <- required steps
However, we need to check the four border-points to see whether any wrong items have moved from one half-zone to another, and adjust the item and step count accordingly.
In the example, the L that was the first item of the L-zone has now become the last item in the R-zone, so we increment the R> half-zone's item and step count to 7 and 38.
Also, the L that was the first item in the R-zone, has become the last item of the L zone, so we decrement the item count for the R< half-zone to 4.
Also, the L in the middle of the R-zone has moved from the R> to the R< half-zone, so we decrement and increment the item counts for R> and R< to 6 and 5, and decrease and increase the step counts with 10 (the length of the R> and R< half-zones) to 28 and 30.
RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRR <- desired output
><<<<<<<<<<<<<<<>>>>>>>>>>>>>>><<<<<<<<<<>>>>>>>>> <- half-zones
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRLRRLLRLLRL <- input
>< < << < >> > > > > < <<< < >> >> > <- direction to move
5 6 5 6 <- wrong items
38 51 30 28 <- required steps
So the total number of required steps when the L-zone starts at position 0 was 144, and we have calculated that when the L-zone starts at position 1, the total is now 147, by looking at what happens at four positions in the list, instead of having to iterate over the whole list again.
UPDATE
While thinking of how to implement this, I realised that the number of wrong items moving right in a zone must be the same as the number of wrong items moving left in the other zone; otherwise the border between the zones ends up in the wrong place. This means that the L and R zones aren't split into two half-zones of equal length, and the "mid" point in a zone moves according to how many wrong items are to the left and right of it. I still think it's possible to turn this into working code with O(N) efficiency, but it's probably not as straightforward as I first described it.
O(n) time solution:
L L R L L R R R L L R R L R
Number of R's to the next group of L's to the left:
1 1 1 1 3 3 2
NumRsToLeft: [1, 1, 3, 2]
Number of swaps needed, where 0 indicates the static L group, and | represents
the point to the right of which L's move right, wrapping only when not at the end
(enough L's must move to their right to replace any R's left of the static group):
2*0 + 2*1 + 2*(3+1) + 1*(2+3+1) |
2*1 + 2*0 + 2*3 | + 1*(1+1)
There are not enough L's to place the static group in the third or fourth position.
Variables: 0 1 4 6 |
1 0 3 | 2
Function: 2*v_1 + 2*v_2 + 2*v_3 + 1*v_4
Coefficients (group sizes): [2, 2, 2, 1]
Change in the total swaps needed when moving the static L group from i to (i+1):
Subtract: PSum(CoefficientsToBeGoingLeft) * NumRsToLeft[i+1]
Subtract: c_j * PSum(NumRsToLeft[i+1...j]) for c_j <- CoefficientsNoLongerGoingLeft
Add: (PSum(CoefficientsAlreadyGoingRight) + Coefficients[i]) * NumRsToLeft[i+1]
Add: c_j * PSum(NumRsToLeft[j+1...i+1]) for c_j <- NewCoefficientsGoingRight
(PSum can be calculated in O(1) time with prefix sums; and the count of coefficients
converting from a left move to a right move throughout the whole calculation is not
more than n. This outline does not include the potential splitting of the last new
group converting from left move to right move.)

Time Complexity of Dependent Nested Loop

Hi I've been trying to understand what the time complexity of this nested loop will be for a while now.
int i = 1;
while(i < n) {
int j = 0;
while(j < n/i){
j++;
}
i = 2 * i;
}
Based on the couple of calculations I've done I think its Big O notation is O(log(n)), but I'm not sure if that is correct. I've tried looking for some examples where the inner loop speeds up at this rate, but I couldn't find anything.
Thanks
One information that surprisingly few people use when calculating complexity is: the sum of terms is equal to the average multiplied by the quantity of terms. In other words, you can replace a changing term by its average, and get the same result.
So, your outer while loop repeats O(log n) times. But the inner while loop, repeats: n, n/2, n/4, n/8, ..., 1, depending on which step of the outer while are we. But (n, n/2, n/4, ..., 1) is a geometric progression, with log(n) terms, and ratio 1/2, which sum is n.(1-1/n)/(1/2) = 2n-2 \in O(n). Its average, therefore, is O(n/log(n)). Since it repeats O(log(n)) times, the whole complexity is O(log(n)*n/log(n)) = O(n)...

Algorithm to solve Local Alignment

Local alignment between X and Y, with at least one column aligning a C
to a W.
Given two sequences X of length n and Y of length m, we
are looking for a highest-scoring local alignment (i.e., an alignment
between a substring X' of X and a substring Y' of Y) that has at least
one column in which a C from X' is aligned to a W from Y' (if such an
alignment exists). As scoring model, we use a substitution matrix s
and linear gap penalties with parameter d.
Write a code in order to solve the problem efficiently. If you use dynamic
programming, it suffices to give the equations for computing the
entries in the dynamic programming matrices, and to specify where
traceback starts and ends.
My Solution:
I've taken 2 sequences namely, "HCEA" and "HWEA" and tried to solve the question.
Here is my code. Have I fulfilled what is asked in the question? If am wrong kindly tell me where I've gone wrong so that I will modify my code.
Also is there any other way to solve the question? If its available can anyone post a pseudo code or algorithm, so that I'll be able to code for it.
public class Q1 {
public static void main(String[] args) {
// Input Protein Sequences
String seq1 = "HCEA";
String seq2 = "HWEA";
// Array to store the score
int[][] T = new int[seq1.length() + 1][seq2.length() + 1];
// initialize seq1
for (int i = 0; i <= seq1.length(); i++) {
T[i][0] = i;
}
// Initialize seq2
for (int i = 0; i <= seq2.length(); i++) {
T[0][i] = i;
}
// Compute the matrix score
for (int i = 1; i <= seq1.length(); i++) {
for (int j = 1; j <= seq2.length(); j++) {
if ((seq1.charAt(i - 1) == seq2.charAt(j - 1))
|| (seq1.charAt(i - 1) == 'C') && (seq2.charAt(j - 1) == 'W')) {
T[i][j] = T[i - 1][j - 1];
} else {
T[i][j] = Math.min(T[i - 1][j], T[i][j - 1]) + 1;
}
}
}
// Strings to store the aligned sequences
StringBuilder alignedSeq1 = new StringBuilder();
StringBuilder alignedSeq2 = new StringBuilder();
// Build for sequences 1 & 2 from the matrix score
for (int i = seq1.length(), j = seq2.length(); i > 0 || j > 0;) {
if (i > 0 && T[i][j] == T[i - 1][j] + 1) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append("-");
} else if (j > 0 && T[i][j] == T[i][j - 1] + 1) {
alignedSeq2.append(seq2.charAt(--j));
alignedSeq1.append("-");
} else if (i > 0 && j > 0 && T[i][j] == T[i - 1][j - 1]) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append(seq2.charAt(--j));
}
}
// Display the aligned sequence
System.out.println(alignedSeq1.reverse().toString());
System.out.println(alignedSeq2.reverse().toString());
}
}
#Shole
The following are the two question and answers provided in my solved worksheet.
Aligning a suffix of X to a prefix of Y
Given two sequences X and Y, we are looking for a highest-scoring alignment between any suffix of X and any prefix of Y. As a scoring model, we use a substitution matrix s and linear gap penalties with parameter d.
Give an efficient algorithm to solve this problem optimally in time O(nm), where n is the length of X and m is the length of Y. If you use a dynamic programming approach, it suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends.
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
P[i][j] = D, T or L according to which of the three expressions above is the maximum
Once we have computed F and P, we find the largest value in the bottom row of the matrix F. Let F[n][j0] be that largest value. We start traceback at F[n][j0] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Aligning Y to a substring of X, without gaps in Y
Given a string X of length n and a string Y of length m, we want to compute a highest-scoring alignment of Y to any substring of X, with the extra constraint that we are not allowed to insert any gaps into Y. In other words, the output is an alignment of a substring X' of X with the string Y, such that the score of the alignment is the largest possible (among all choices of X') and such that the alignment does not introduce any gaps into Y (but may introduce gaps into X'). As a scoring model, we use again a substitution matrix s and linear gap penalties with parameter d.
Give an efficient dynamic programming algorithm that solves this problem optimally in polynomial time. It suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends. What is the running-time of your algorithm?
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j, such that the alignment does not insert gaps in Y. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i][j-1]-d }
P[i][j] = D or L according to which of the two expressions above is the maximum
Once we have computed F and P, we find the largest value in the rightmost column of the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Hope you get some idea about wot i really need.
I think it's quite easy to find resources or even the answer by google...as the first result of the searching is already a thorough DP solution.
However, I appreciate that you would like to think over the solution by yourself and are requesting some hints.
Before I give out some of the hints, I would like to say something about designing a DP solution
(I assume you know this can be solved by a DP solution)
A dp solution basically consisting of four parts:
1. DP state, you have to self define the physical meaning of one state, eg:
a[i] := the money the i-th person have;
a[i][j] := the number of TV programmes between time i and time j; etc
2. Transition equations
3. Initial state / base case
4. how to query the answer, eg: is the answer a[n]? or is the answer max(a[i])?
Just some 2 cents on a DP solution, let's go back to the question :)
Here's are some hints I am able to think of:
What is the dp state? How many dimensions are enough to define such a state?
Thinking of you are solving problems much alike to common substring problem (on 2 strings),
1-dimension seems too little and 3-dimensions seems too many right?
As mentioned in point 1, this problem is very similar to common substring problem, maybe you should have a look on these problems to get yourself some idea?
LCS, LIS, Edit Distance, etc.
Supplement part: not directly related to the OP
DP is easy to learn, but hard to master. I know a very little about it, really cannot share much. I think "Introduction to algorithm" is a quite standard book to start with, you can find many resources, especially some ppt/ pdf tutorials of some colleges / universities to learn some basic examples of DP.(Learn these examples is useful and I'll explain below)
A problem can be solved by many different DP solutions, some of them are much better (less time / space complexity) due to a well-defined DP state.
So how to design a better DP state or even get the sense that one problem can be solved by DP? I would say it's a matter of experiences and knowledge. There are a set of "well-known" DP problems which I would say many other DP problems can be solved by modifying a bit of them. Here is a post I just got accepted about another DP problem, as stated in that post, that problem is very similar to a "well-known" problem named "matrix chain multiplication". So, you cannot do much about the "experience" part as it has no express way, yet you can work on the "knowledge" part by studying these standard DP problems first maybe?
Lastly, let's go back to your original question to illustrate my point of view:
As I knew LCS problem before, I have a sense that for similar problem, I may be able to solve it by designing similar DP state and transition equation? The state s(i,j):= The optimal cost for A(1..i) and B(1..j), given two strings A & B
What is "optimal" depends on the question, and how to achieve this "optimal" value in each state is done by the transition equation.
With this state defined, it's easy to see the final answer I would like to query is simply s(len(A), len(B)).
Base case? s(0,0) = 0 ! We can't really do much on two empty string right?
So with the knowledge I got, I have a rough thought on the 4 main components of designing a DP solution. I know it's a bit long but I hope it helps, cheers.

What is an efficient way to compute the Dice coefficient between 900,000 strings?

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.
Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.
Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.

Resources