Algorithm to solve Local Alignment

Algorithm to solve Local Alignment - string

Local alignment between X and Y, with at least one column aligning a C
to a W.
Given two sequences X of length n and Y of length m, we
are looking for a highest-scoring local alignment (i.e., an alignment
between a substring X' of X and a substring Y' of Y) that has at least
one column in which a C from X' is aligned to a W from Y' (if such an
alignment exists). As scoring model, we use a substitution matrix s
and linear gap penalties with parameter d.
Write a code in order to solve the problem efficiently. If you use dynamic
programming, it suffices to give the equations for computing the
entries in the dynamic programming matrices, and to specify where
traceback starts and ends.
My Solution:
I've taken 2 sequences namely, "HCEA" and "HWEA" and tried to solve the question.
Here is my code. Have I fulfilled what is asked in the question? If am wrong kindly tell me where I've gone wrong so that I will modify my code.
Also is there any other way to solve the question? If its available can anyone post a pseudo code or algorithm, so that I'll be able to code for it.
public class Q1 {
public static void main(String[] args) {
// Input Protein Sequences
String seq1 = "HCEA";
String seq2 = "HWEA";
// Array to store the score
int[][] T = new int[seq1.length() + 1][seq2.length() + 1];
// initialize seq1
for (int i = 0; i <= seq1.length(); i++) {
T[i][0] = i;
}
// Initialize seq2
for (int i = 0; i <= seq2.length(); i++) {
T[0][i] = i;
}
// Compute the matrix score
for (int i = 1; i <= seq1.length(); i++) {
for (int j = 1; j <= seq2.length(); j++) {
if ((seq1.charAt(i - 1) == seq2.charAt(j - 1))
|| (seq1.charAt(i - 1) == 'C') && (seq2.charAt(j - 1) == 'W')) {
T[i][j] = T[i - 1][j - 1];
} else {
T[i][j] = Math.min(T[i - 1][j], T[i][j - 1]) + 1;
}
}
}
// Strings to store the aligned sequences
StringBuilder alignedSeq1 = new StringBuilder();
StringBuilder alignedSeq2 = new StringBuilder();
// Build for sequences 1 & 2 from the matrix score
for (int i = seq1.length(), j = seq2.length(); i > 0 || j > 0;) {
if (i > 0 && T[i][j] == T[i - 1][j] + 1) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append("-");
} else if (j > 0 && T[i][j] == T[i][j - 1] + 1) {
alignedSeq2.append(seq2.charAt(--j));
alignedSeq1.append("-");
} else if (i > 0 && j > 0 && T[i][j] == T[i - 1][j - 1]) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append(seq2.charAt(--j));
}
}
// Display the aligned sequence
System.out.println(alignedSeq1.reverse().toString());
System.out.println(alignedSeq2.reverse().toString());
}
}
#Shole
The following are the two question and answers provided in my solved worksheet.
Aligning a suffix of X to a prefix of Y
Given two sequences X and Y, we are looking for a highest-scoring alignment between any suffix of X and any prefix of Y. As a scoring model, we use a substitution matrix s and linear gap penalties with parameter d.
Give an efficient algorithm to solve this problem optimally in time O(nm), where n is the length of X and m is the length of Y. If you use a dynamic programming approach, it suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends.
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
P[i][j] = D, T or L according to which of the three expressions above is the maximum
Once we have computed F and P, we find the largest value in the bottom row of the matrix F. Let F[n][j0] be that largest value. We start traceback at F[n][j0] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Aligning Y to a substring of X, without gaps in Y
Given a string X of length n and a string Y of length m, we want to compute a highest-scoring alignment of Y to any substring of X, with the extra constraint that we are not allowed to insert any gaps into Y. In other words, the output is an alignment of a substring X' of X with the string Y, such that the score of the alignment is the largest possible (among all choices of X') and such that the alignment does not introduce any gaps into Y (but may introduce gaps into X'). As a scoring model, we use again a substitution matrix s and linear gap penalties with parameter d.
Give an efficient dynamic programming algorithm that solves this problem optimally in polynomial time. It suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends. What is the running-time of your algorithm?
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j, such that the alignment does not insert gaps in Y. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i][j-1]-d }
P[i][j] = D or L according to which of the two expressions above is the maximum
Once we have computed F and P, we find the largest value in the rightmost column of the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Hope you get some idea about wot i really need.

I think it's quite easy to find resources or even the answer by google...as the first result of the searching is already a thorough DP solution.
However, I appreciate that you would like to think over the solution by yourself and are requesting some hints.
Before I give out some of the hints, I would like to say something about designing a DP solution
(I assume you know this can be solved by a DP solution)
A dp solution basically consisting of four parts:
1. DP state, you have to self define the physical meaning of one state, eg:
a[i] := the money the i-th person have;
a[i][j] := the number of TV programmes between time i and time j; etc
2. Transition equations
3. Initial state / base case
4. how to query the answer, eg: is the answer a[n]? or is the answer max(a[i])?
Just some 2 cents on a DP solution, let's go back to the question :)
Here's are some hints I am able to think of:
What is the dp state? How many dimensions are enough to define such a state?
Thinking of you are solving problems much alike to common substring problem (on 2 strings),
1-dimension seems too little and 3-dimensions seems too many right?
As mentioned in point 1, this problem is very similar to common substring problem, maybe you should have a look on these problems to get yourself some idea?
LCS, LIS, Edit Distance, etc.
Supplement part: not directly related to the OP
DP is easy to learn, but hard to master. I know a very little about it, really cannot share much. I think "Introduction to algorithm" is a quite standard book to start with, you can find many resources, especially some ppt/ pdf tutorials of some colleges / universities to learn some basic examples of DP.(Learn these examples is useful and I'll explain below)
A problem can be solved by many different DP solutions, some of them are much better (less time / space complexity) due to a well-defined DP state.
So how to design a better DP state or even get the sense that one problem can be solved by DP? I would say it's a matter of experiences and knowledge. There are a set of "well-known" DP problems which I would say many other DP problems can be solved by modifying a bit of them. Here is a post I just got accepted about another DP problem, as stated in that post, that problem is very similar to a "well-known" problem named "matrix chain multiplication". So, you cannot do much about the "experience" part as it has no express way, yet you can work on the "knowledge" part by studying these standard DP problems first maybe?
Lastly, let's go back to your original question to illustrate my point of view:
As I knew LCS problem before, I have a sense that for similar problem, I may be able to solve it by designing similar DP state and transition equation? The state s(i,j):= The optimal cost for A(1..i) and B(1..j), given two strings A & B
What is "optimal" depends on the question, and how to achieve this "optimal" value in each state is done by the transition equation.
With this state defined, it's easy to see the final answer I would like to query is simply s(len(A), len(B)).
Base case? s(0,0) = 0 ! We can't really do much on two empty string right?
So with the knowledge I got, I have a rough thought on the 4 main components of designing a DP solution. I know it's a bit long but I hope it helps, cheers.

Related

Edit distance, with a twist

I'm trying to solve something using dynamic programming, but I'm having some trouble. When I work on dynamic programming, I usually determine a recursive algorithm then go from there to my dynamic solution. This time I'm having trouble
The Problem
Say you have two strings: m and n, such that n.length is greater than m.length, and n does not contain the character '#'. You want the string that turns m into the same length as string n in minimum cost.
Cost is defined as SUM(Penalty(m[i],n[i])), where i is in an index of the strings char array.
Penalty is defined as such
private static int penalty(char x,char y) {
if (x==y) { return 0;}
else if (y=='#') { return 4;}
else { return 2;}
}
The only way I can think of is as follows:
[0] If m and n are the same length, return m
[1] Compute cost of inserting a # at any index of m
[2] determine the string that has the minimum of such cost. Let that string be m'
[3] Run the algorithm on m' and n again.
I don't think this is even the optimal recursive algorithm, leading me to believe that I'm not on the right track for a dynamic algorithm.
I've read up on using a m.length x n.length matrix for normal edit distance, but I don't see how I could easily transform that to fit my algorithm.
Thoughts on my recursive algorithm and the steps I need to take to reach a dynamic solution?

Taking your definitions (python):
def penalty(x, y):
if x == y:
return 0
if y == '#':
return 4
return 2
def cost(n, m):
return sum(penalty(a, b) for a, b in zip(n, m))
Then you can define the distance reassigning to m the lowest cost for each # to be included.
def distance(n, m):
for _ in range(len(n) - len(m)):
m = min((m[:i]+'#'+m[i:] for i in range(len(m)+1)), key=lambda s: cost(n, s))
return m
>>> distance('hello world', 'heloworld')
'he#lo#world'

The only way that I can see the optimality principle to work here is if you solve the problem over growing lengths of n. So the dynamic programming solution would look like this:
For each contiguous substring of length m.length()+1, solve the problem, yielding a list of proposals for the new m.
Select the proposal with the minimum distance to the corresponding substring as the new m, and repeat the process.
You won't need to store anything other than the currently optimal solution in this algorithm, certainly not a distance matrix. It looks to me like you were pretty close to this solution as well, you only missed the 'shrink n to get a subproblem'-part.

Stacking and dynamic programing

Basically I'm trying to solve this problem :
Given N unit cube blocks, find the smaller number of piles to make in order to use all the blocks. A pile is either a cube or a pyramid. For example two valid piles are the cube 4 *4 *4=64 using 64 blocks, and the pyramid 1²+2²+3²+4²=30 using 30 blocks.
However, I can't find the right angle to approach it. I feel like it's similar to the knapsack problem, but yet, couldn't find an implementation.
Any help would be much appreciated !

First I will give a recurrence relation which will permit to solve the problem recursively. Given N, let
SQUARE-NUMS
TRIANGLE-NUMS
be the subset of square numbers and triangle numbers in {1,...,N} respectively. Let PERMITTED_SIZES be the union of these. Note that, as 1 occurs in PERMITTED_SIZES, any instance is feasible and yields a nonnegative optimum.
The follwing function in pseudocode will solve the problem in the question recursively.
int MinimumNumberOfPiles(int N)
{
int Result = 1 + min { MinimumNumberOfPiles(N-i) }
where i in PERMITTED_SIZES and i smaller than N;
return Result;
}
The idea is to choose a permitted bin size for the items, remove these items (which makes the problem instance smaller) and solve recursively for the smaller instances. To use dynamic programming in order to circumvent multiple evaluation of the same subproblem, one would use a one-dimensional state space, namely an array A[N] where A[i] is the minimum number of piles needed for i unit blocks. Using this state space, the problem can be solved iteratively as follows.
for (int i = 0; i < N; i++)
{
if i is 0 set A[i] to 0,
if i occurs in PERMITTED_SIZES, set A[i] to 1,
set A[i] to positive infinity otherwise;
}
This initializes the states which are known beforehand and correspond to the base cases in the above recursion. Next, the missing states are filled using the following loop.
for (int i = 0; i <= N; i++)
{
if (A[i] is positive infinity)
{
A[i] = 1 + min { A[i-j] : j is in PERMITTED_SIZES and j is smaller than i }
}
}
The desired optimal value will be found in A[N]. Note that this algorithm only calculates the minimum number of piles, but not the piles themselves; if a suitable partition is needed, it has to be found either by backtracking or by maintaining additional auxiliary data structures.
In total, provided that PERMITTED_SIZES is known, the problem can be solved in O(N^2) steps, as PERMITTED_SIZES contains at most N values.
The problem can be seen as an adaptation of the Rod Cutting Problem where each square or triangle size has value 0 and every other size has value 1, and the objective is to minimize the total value.
In total, an additional computation cost is necessary to generate PERMITTED_SIZES from the input.
More precisely, the corresponding choice of piles, once A is filled, can be generated using backtracking as follows.
int i = N; // i is the total amount still to be distributed
while ( i > 0 )
{
choose j such that
j is in PERMITTED_SIZES and j is smaller than i
and
A[i] = 1 + A[i-j] is minimized
Output "Take a set of size" + j; // or just output j, which is the set size
// the part above can be commented as "let's find out how
// the value in A[i] was generated"
set i = i-j; // decrease amount to distribute
}

What is an efficient way to compute the Dice coefficient between 900,000 strings?

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!

Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.

You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.

Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.

Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you

You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret

Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

Explain this DSP notation

I'm trying to implement this extenstion of the Karplus-Strong plucked string algorithm, but I don't understand the notation there used. Maybe it will take years of study, but maybe it won't - maybe you can tell me.
I think the equations below are in the frequency domain or something. Just starting with the first equation, Hp(z), the pick direction lowpass filter. For one direction you use p = 0, for the other, perhaps 0.9. This boils down to to 1 in the first case, or 0.1 / (1 - 0.9 z-1) in the second.
alt text http://www.dsprelated.com/josimages/pasp/img902.png
Now, I feel like this might mean, in coding terms, something towards:
H_p(float* input, int time) {
if (downpick) {
return input[time];
} else {
return some_function_of(input[t], input[t-1]);
}
}
Can someone give me a hint? Or is this futile and I really need all the DSP background to implement this? I was a mathematician once...but this ain't my domain.

So the z-1 just means a one-unit delay.
Let's take Hp = (1-p)/(1-pz-1).
If we follow the convention of "x" for input and "y" for output, the transfer function H = y/x (=output/input)
so we get y/x = (1-p)/(1-pz-1)
or (1-p)x = (1-pz-1)y
(1-p)x[n] = y[n] - py[n-1]
or: y[n] = py[n-1] + (1-p)x[n]
In C code this can be implemented
y += (1-p)*(x-y);
without any additional state beyond using the output "y" as a state variable itself. Or you can go for the more literal approach:
y_delayed_1 = y;
y = p*y_delayed_1 + (1-p)*x;
As far as the other equations go, they're all typical equations except for that second equation which looks like maybe it's a way of selecting either HΒ = 1-z-1 OR 1-z-2. (what's N?)
The filters are kind of vague and they'll be tougher for you to deal with unless you can find some prepackaged filters. In general they're of the form
H = H0*(1+az-1+bz-2+cz-3...)/(1+rz-1+sz-2+tz-3...)
and all you do is write down H = y/x, cross multiply to get
H0 * (1+az-1+bz-2+cz-3...) * x = (1+rz-1+sz-2+tz-3...) * y
and then isolate "y" by itself, making the output "y" a linear function of various delays of itself and of the input.
But designing filters (picking the a,b,c,etc.) is tougher than implementing them, for the most part.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string