Sequence alignment given two strings

Sequence alignment given two strings - dynamic-programming

I have two sequences and I need to perform sequence alignment to determine all possible sequence alignments. I have created the matrix and managed to find 16 alignments.
I wanted to understand if I had correctly approached this because I need to provide all possible alignment sequences and then provide the best alignment sequence/sequences out of all the ones found, however, all my found sequences have a score of -9. I have provided 2 sequences I found below, and all 16 found sequences can be found in a pastebin link below since it made the post quite large.
I was wondering if someone could help me understand if I may have gone wrong or if I am missing some sequences since all of them have a score of -9. I do not think I have gone wrong during the matrix creation phase but I am unsure. I have asked for help from my tutor however he has been quite rude and reluctant to explain/help so I have decided to turn to the community for help and guidance. Would appreciate any insight into this. Thanks in advance.
Sequence strings:
ggaatggmeeff
gatge
Alignment matrix
Found sequences: Sequences with score.
1. Score: -9
G G A A T G G M E E F F
G - - A T - G - - E - -
2. Score: -9
G G A A T G G M E E F F
- G - A T - G - - E - -

Your alignment score is low because your sequences are short and one is twice as long as the other. This inescapably introduces gaps in the alignment leading to a low alignment score. This alignment with a -9 score is still the optimal global alignment accordting to the Needleman Wunsch algorithm.

Related

Optimizations for Raycasting

I've been wanting to build a 3D engine starting from scratch as a coding challenge with the end objective of implementing it on a fantasy console.
The best (i.e. most simple?) way I found was raytracing/raycasting. I haven't found much by looking online for raycasting algorithms, only finding point-in-polygon problems (which only tell me whether a ray intersects a polygon or not, not quite my interest since I wouldn't have info about the first intersection nor I'd have the intersection points).
The only solution I could think of is brute forcing the ray by moving it at small intervals and every time check whether that point is occupied by something or not (which would require having filled shapes and wouldn't let me have 2D shapes since they would never be rendered, although none of those is a problem). Still, it looks way too complex performance-wisely.
As far as I know, most of those problems are solved using linear algebra, but I'm not quite as competent as to build up a solution on my own. Does this problem have a practical solution?
EDIT: I just found an algebric solution in 2D which could maybe be expanded in 3D. The idea is:
For each edge, check whether one of the two vertices are in the field of view (i.e. if O is the origin of every ray and P is the vertex, you have to check first that the point is inside the far point of sight, and then whether the angle with the forward vector is less than the angle of vision). If at least one of the two vertices is inside the field of view, add them to an array E.
If we have an array R of rays to shoot and an array of arrays I of info about hit points, we can loop for each ray in R and for each edge in E and then store f(ray, edge) in I, where f is a function that gives us info on whether the ray and the edge collided and where they did.
f uses basic linear algebra: both the ray and the edge are, for all purposes, two segments. Two segments are just parts of two lines. Let's say that if the edge has vertices A and B (AB is the vector that goes from A to B) and if the far point is called P (OP is the vector that goes from O to P). We can create two lines, r and s, defined by A + ηAB and O + λOP. After we do a check to see whether r and s are parallel (check if the absolute value of the dot product of AB and OP is equal to the norm of AB times the norm of OP), it's trivial to get the values for η and λ.
Now, if η < 0 OR η > 1 we have that the two segments are not colliding.
After we've done this for every ray and every edge, we compare every element in each array i in I to see which one had the lowest λ. The lowest λ carries the first collision and hence the data to show on screen.
Everything here is linear algebra, though I fear that it might still be computationally heavy, since there's a lot going on, and it's still only 2D.

Best Alignment of String B in a Substring of A -- Bioinformatics

I have two strings A and B, let's say
A = AATCGGATATAG
B = CGATA
Some of you may know two types of alignments:
Global Alignment
Local Alignment
But I would like to implement an alignment that takes the best whole substring of A which, if aligned with B, yields the best alignment
For example:
A,B -- Alignment algorithm --> AATCGGATATAG
CG-ATA
So far I've been using the Smith-Waterman Algorithm
Does anyone know any suggestions to solve this problem?
Thanks in advance!

Smith-Waterman is still the algorithm you should use. In order to get the full sequence aligned, you should change your gap penalty to 0. This will make S-W favor gaps over mismatches and add as many gaps as are need to include the whole sequence.
For example setting the gap penalty to 0 using the standard nucleotide 4.4 subsitution matrix will make this alignment:
A = AATCGGATATAG
B = C-GATA

searching two identical submatrices in a bigger matrix

I need to find two identical submatices in a larger matrix.
I am not getting how to approach as first I need to have a submatrice and then to find if its identical is present or not.
e.g.
(N X M) 4X5 matrix
xy*yyx*
yx*xyy*
x*yyx*y
x*xyy*x
Identicals are
yyx
xyy
present at bold letters made points.. N,M <=10
not sure where to start with..

What are the limits on the submatrix? E.g., can you have a 1x1 submatrix? If there are no limits, then a dynamic programming approach would seem to be in order.
I.e., find the 1x1 matching submatrices, then use that as a starting point to find 2x2, 3x3, etc.

Calculating margin and bias for SVM's

I apologise for the newbishness of this question in advance but I am stuck. I am trying to solve this question,
I can do parts i)-1v) but I am stuck on v. I know to calculate the margin y, you do
y=2/||W||
and I know that W is the normal to the hyperplane, I just don't know how to calculate it. Is this always
W=[1;1] ?
Similarly, the bias, W^T * x + b = 0
how do I find the value x from the data points? Thank you for your help.

Consider building an SVM over the (very little) data set shown in Picture for an example like this, the maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between and , giving a weight vector of . The optimal decision surface is orthogonal to that line and intersects it at the halfway point. Therefore, it passes through . So, the SVM decision boundary is:
Working algebraically, with the standard constraint that , we seek to minimize . This happens when this constraint is satisfied with equality by the two support vectors. Further we know that the solution is for some . So we have that:
Therefore a=2/5 and b=-11/5, and . So the optimal hyperplane is given by
and b= -11/5 .
The margin boundary is
This answer can be confirmed geometrically by examining picture.

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.

Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.

You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.

You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.

In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.

Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string