Best Alignment of String B in a Substring of A -- Bioinformatics - string

I have two strings A and B, let's say
A = AATCGGATATAG
B = CGATA
Some of you may know two types of alignments:
Global Alignment
Local Alignment
But I would like to implement an alignment that takes the best whole substring of A which, if aligned with B, yields the best alignment
For example:
A,B -- Alignment algorithm --> AATCGGATATAG
CG-ATA
So far I've been using the Smith-Waterman Algorithm
Does anyone know any suggestions to solve this problem?
Thanks in advance!

Smith-Waterman is still the algorithm you should use. In order to get the full sequence aligned, you should change your gap penalty to 0. This will make S-W favor gaps over mismatches and add as many gaps as are need to include the whole sequence.
For example setting the gap penalty to 0 using the standard nucleotide 4.4 subsitution matrix will make this alignment:
A = AATCGGATATAG
B = C-GATA

Related

Quickest way to find closest set of point

I have three arrays of points:
A=[[5,2],[1,0],[5,1]]
B=[[3,3],[5,3],[1,1]]
C=[[4,2],[9,0],[0,0]]
I need the most efficient way to find the three points (one for each array) that are closest to each other (within one pixel in each axis).
What I'm doing right now is taking one point as reference, let's say A[0], and cycling all other B and C points looking for a solution. If A[0] gives me no result I'll move the reference to A[1] and so on. This approach as a huge problem because if I increase the number of points for each array and/or the number of arrays it requires too much time to converge some times, especially if the solution is in the last members of the arrays. So I'm wondering if there is any way to do this without maybe using a reference, or any quicker way than just looping all over the elements.
The rules that I must follow are the following:
the final solution has to be made by only one element from each array like: S=[A[n],B[m],C[j]]
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
For example in this simplified case the solution would be: S=[A[1],B[2],C[1]]
To clarify further the problem: what I wrote above it's just a simplify example to explain what I need. In my real case I don't know a priori the length of the lists nor the number of lists I have to work with, could be A,B,C, or A,B,C,D,E... (each of one with different number of points) etc. So I also need to find a way to make it as general as possible.
This requirement:
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
massively simplifies the problem, because it means that for any given (xi, yi), there are only nine possible choices of (xj, yj).
So I think the best approach is as follows:
Copy B and C into sets of tuples.
Iterate over A. For each point (xi, yi):
Iterate over the values of x from xi−1 to xi+1 and the values of y from yi−1 to yi+1. For each resulting point (xj, yj):
Check if (xj, yj) is in B. If so:
Iterate over the values of x from max(xi, xj)−1 to min(xi, xj)+1 and the values of y from max(yi, yj)−1 to min(yi, yj)+1. For each resulting point (xk, yk):
Check if (xk, yk) is in C. If so, we're done!
If we get to the end without having a match, that means there isn't one.
This requires roughly O(len(A) + len(B) + len(C)) time and O(len(B) + len(C) extra space.
Edited to add (due to a follow-up question in the comments): if you have N lists instead of just 3, then instead of nesting N loops deep (which gives time exponential in N), you'll want to do something more like this:
Copy B, C, etc., into sets of tuples, as above.
Iterate over A. For each point (xi, yi):
Create a set containing (xi, yi) and its eight neighbors.
For each of the lists B, C, etc.:
For each element in the set of nine points, see if it's in the current list.
Update the set to remove any points that aren't in the current list and don't have any neighbors in the current list.
If the set still has at least one element, then — great, each list contained a point that's within one pixel of that element (with all of those points also being within one pixel of each other). So, we're done!
If we get to the end without having a match, that means there isn't one.
which is much more complicated to implement, but is linear in N instead of exponential in N.
Currently, you are finding the solution with a bruteforce algorithm which has a O(n2) complexity. If your lists contains 1000 items, your algo will need 1000000 iterations to run... (It's even O(n3) as tobias_k pointed out)
Like you can see there: https://en.wikipedia.org/wiki/Closest_pair_of_points_problem, you could improve it by using a divide and conquer algorithm, which would run in a O(n log n) time.
You should search for Delaunay triangulation and/or Voronoi diagram implementations.
NB: if you can use external libs, you should also consider taking a look at the scipy lib: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.Delaunay.html

Computing generalized mean for extreme values of p

How do I compute the generalized mean for extreme values of p (very close to 0, or very large) with reasonable computational error?
As per your link, the limit for p going to 0 is the geometric mean, for which bounds are derived.
The limit for p going to infinity is the maximum.
I have been struggling with the same problem. Here is how I handled this:
Let gmean_p(x1,...,xn) be the generalized mean where p is real but not 0, and x1, ..xn nonnegative. For M>0, we have gmean_p(x1,...,xn) = M*gmean_p(x1/M,...,xn/M) of which the latter form can be exploited to reduce the computational error. For large p, I use M=max(x1,...,xn) and for p close to 0, I use M=mean(x1,..xn). In case M=0, just add a small positive constant to it. This did the job for me.
I suspect if you're interested in very large or small values of p, it may be best to do some form of algebraic manipulation of the generalized-mean formula before putting in numerical values.
For example, in the small-p limit, one can show that the generalized mean tends to the n'th root of the product x_1*x_2*...x_n. The higher order terms in p involve sums and products of log(x_i), which should also be relatively numerically stable to compute. In fact, I believe the first-order expansion in p has a simple relationship to the variance of log(x_i):
If one applies this formula to a set of 100 random numbers drawn uniformly from the range [0.2, 2], one gets a trend like this:
which here shows the asymptotic formula becoming pretty accurate for p less than about 0.3, and the simple formula only failing when p is less than about 1e-10.
The case of large p, is dominated by that x_i which has the largest magnitude (lets call that index i_max). One can rearrange the generalized mean formula to take the following form, which has less pathological behaviour for large p:
If this is applied (using standard numpy routines including numpy.log1p) to another 100 uniformly distributed samples over [0.2, 2.0], one finds that the rearranged formula agrees essentially exactly with the simple formula, but remains valid for much larger values of p for which the simple formula overflows when computing powers of x_i.
(Note that the left-hand plot has the blue curve for the simple formula shifted up by 0.1 so that one can see where it ends due to overflows. For p less than about 1000, the two curves would otherwise be indistinguishable.)
I think the answer here should be to use a recursive solution. In the same way that mean(1,2,3,4)=mean(mean(1,2),mean(3,4)), you can do this kind of recursion for generalized means. What this buys you is that you won't need to do as many sums of really large numbers and you decrease the likelihood of creating an overflow. Also, the other danger when working with floating point numbers is when adding numbers of very different magnitudes (or subtracting numbers of very similar magnitudes). So to avoid these kinds of rounding errors it might help to sort your data before you try and calculate the generalized mean.
Here's a hunch:
First convert all your numbers into a representation in base p. Now to raise to a power of 1/p or p, you just have to shift them --- so you can very easily do all powers without losing precision.
Work out your mean in base p, then convert the result back to base two.
If that doesn't work, an even less practical hunch:
Try working out the discrete Fourier transform, and relating that to the discrete Fourier transform of the input vector.

Sorting a list of colors in one dimension?

I would like to sort a one-dimensional list of colors so that colors that a typical human would perceive as "like" each other are near each other.
Obviously this is a difficult or perhaps impossible problem to get "perfectly", since colors are typically described with three dimensions, but that doesn't mean that there aren't some sorting methods that look obviously more natural than others.
For example, sorting by RGB doesn't work very well, as it will sort in the following order, for example:
(1) R=254 G=0 B=0
(2) R=254 G=255 B=0
(3) R=255 G=0 B=0
(4) R=255 G=255 B=0
That is, it will alternate those colors red, yellow, red, yellow, with the two "reds" being essentially imperceivably different than each other, and the two yellows also being imperceivably different from each other.
But sorting by HLS works much better, generally speaking, and I think HSL even better than that; with either, the reds will be next to each other, and the yellows will be next to each other.
But HLS/HSL has some problems, too; things that people would perceive as "black" could be split far apart from each other, as could things that people would perceive as "white".
Again, I understand that I pretty much have to accept that there will be some splits like this; I'm just wondering if anyone has found a better way than HLS/HSL. And I'm aware that "better" is somewhat arbitrary; I mean "more natural to a typical human".
For example, a vague thought I've had, but have not yet tried, is perhaps "L is the most important thing if it is very high or very low", but otherwise it is the least important. Has anyone tried this? Has it worked well? What specifically did you decide "very low" and "very high" meant? And so on. Or has anyone found anything else that would improve upon HSL?
I should also note that I am aware that I can define a space-filling curve through the cube of colors, and order them one-dimensionally as they would be encountered while travelling along that curve. That would eliminate perceived discontinuities. However, it's not really what I want; I want decent overall large-scale groupings more than I want perfect small-scale groupings.
Thanks in advance for any help.
If you want to sort a list of colors in one dimension you first have to decide by what metrics you are going to sort them. The most sense to me is the perceived brightness (related question).
I have came across 4 algorithms to sort colors by brightness and compared them. Here is the result.
I generated colors in cycle where only about every 400th color was used. Each color is represented by 2x2 pixels, colors are sorted from darkest to lightest (left to right, top to bottom).
1st picture - Luminance (relative)
0.2126 * R + 0.7152 * G + 0.0722 * B
2nd picture - http://www.w3.org/TR/AERT#color-contrast
0.299 * R + 0.587 * G + 0.114 * B
3rd picture - HSP Color Model
sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)
4td picture - WCAG 2.0 SC 1.4.3 relative luminance and contrast ratio formula
Pattern can be sometimes spotted on 1st and 2nd picture depending on the number of colors in one row. I never spotted any pattern on picture from 3rd or 4th algorithm.
If i had to choose i would go with algorithm number 3 since its much easier to implement and its about 33% faster than the 4th
You cannot do this without reducing the 3 color dimensions to a single measurement. There are many (infinite) ways of reducing this information, but it is not mathematically possible to do this in a way that ensures that two data points near each other on the reduced continuum will also be near each other in all three of their component color values. As a result, any formula of this type will potentially end up grouping dissimilar colors.
As you mentioned in your question, one way to sort of do this would be to fit a complex curve through the three-dimensional color space occupied by the data points you're trying to sort, and then reduce each data point to its nearest location on the curve and then to that point's distance along the curve. This would work, but in each case it would be a solution custom-tailored to a particular set of data points (rather than a generally applicable solution). It would also be relatively expensive (maybe), and simply wouldn't work on a data set that was not nicely distributed in a curved-line sort of way.
A simpler alternative (that would not work perfectly) would be to choose two "endpoint" colors, preferably on opposite sides of the color wheel. So, for example, you could choose Red as one endpoint color and Blue as the other. You would then convert each color data point to a value on a scale from 0 to 1, where a color that is highly Reddish would get a score near 0 and a color that is highly Bluish would get a score near 1. A score of .5 would indicate a color that either has no Red or Blue in it (a.k.a. Green) or else has equal amounts of Red and Blue (a.k.a. Purple). This approach isn't perfect, but it's the best you can do with this problem.
There are several standard techniques for reducing multiple dimensions to a single dimension with some notion of "proximity".
I think you should in particular check out the z-order transform.
You can implement a quick version of this by interleaving the bits of your three colour components, and sorting the colours based on this transformed value.
The following Java code should help you get started:
public static int zValue(int r, int g, int b) {
return split(r) + (split(g)<<1) + (split(b)<<2);
}
public static int split(int a) {
// split out the lowest 10 bits to lowest 30 bits
a=(a|(a<<12))&00014000377;
a=(a|(a<<8)) &00014170017;
a=(a|(a<<4)) &00303030303;
a=(a|(a<<2)) &01111111111;
return a;
}
There are two approaches you could take. The simple approach is to distil each colour into a single value, and the list of values can then be sorted. The complex approach would depend on all of the colours you have to sort; perhaps it would be an iterative solution that repeatedly shuffles the colours around trying to minimise the "energy" of the entire sequence.
My guess is that you want something simple and fast that looks "nice enough" (rather than trying to figure out the "optimum" aesthetic colour sort), so the simple approach is enough for you.
I'd say HSL is the way to go. Something like
sortValue = L * 5 + S * 2 + H
assuming that H, S and L are each in the range [0, 1].
Here's an idea I came up with after a couple of minutes' thought. It might be crap, or it might not even work at all, but I'll spit it out anyway.
Define a distance function on the space of colours, d(x, y) (where the inputs x and y are colours and the output is perhaps a floating-point number). The distance function you choose may not be terribly important. It might be the sum of the squares of the differences in R, G and B components, say, or it might be a polynomial in the differences in H, L and S components (with the components differently weighted according to how important you feel they are).
Then you calculate the "distance" of each colour in your list from each other, which effectively gives you a graph. Next you calculate the minimum spanning tree of your graph. Then you identify the longest path (with no backtracking) that exists in your MST. The endpoints of this path will be the endpoints of the final list. Next you try to "flatten" the tree into a line by bringing points in the "branches" off your path into the path itself.
Hmm. This might not work all that well if your MST ends up in the shape of a near-loop in colour space. But maybe any approach would have that problem.

Non-Uniform Random Number Generator Implementation?

I need a random number generator that picks numbers over a specified range with a programmable mean.
For example, I need to pick numbers between 2 and 14 and I need the average of the random numbers to be 5.
I use random number generators a lot. Usually I just need a uniform distribution.
I don't even know what to call this type of distribution.
Thank you for any assistance or insight you can provide.
You might be able to use a binomial distribution, if you're happy with the shape of that distribution. Set n=12 and p=0.25. This will give you a value between 0 and 12 with a mean of 3. Just add 2 to each result to get the range and mean you are looking for.
Edit: As for implementation, you can probably find a library for your chosen language that supports non-uniform distributions (I've written one myself for Java).
A binomial distribution can be approximated fairly easily using a uniform RNG. Simply perform n trials and record the number of successes. So if you have n=10 and p=0.5, it's just like flipping a coin 10 times in a row and counting the number of heads. For p=0.25 just generate uniformly-distributed values between 0 and 3 and only count zeros as successes.
If you want a more efficient implementation, there is a clever algorithm hidden away in the exercises of volume 2 of Knuth's The Art of Computer Programming.
You haven't said what distribution you are after. Regarding your specific example, a function which produced a uniform distribution between 2 and 8 would satisfy your requirements, strictly as you have written them :)
If you want a non-uniform distribution of the random number, then you might have to implement some sort of mapping, e.g:
// returns a number between 0..5 with a custom distribution
int MyCustomDistribution()
{
int r = rand(100); // random number between 0..100
if (r < 10) return 1;
if (r < 30) return 2;
if (r < 42) return 3;
...
}
Based on the Wikipedia sub-article about non-uniform generators, it would seem you want to apply the output of a uniform pseudorandom number generator to an area distribution that meets the desired mean.
You can create a non-uniform PRNG from a uniform one. This makes sense, as you can imagine taking a uniform PRNG that returns 0,1,2 and create a new, non-uniform PRNG by returning 0 for values 0,1 and 1 for the value 2.
There is more to it if you want specific characteristics on the distribution of your new, non-uniform PRNG. This is covered on the Wikipedia page on PRNGs, and the Ziggurat algorithm is specifically mentioned.
With those clues you should be able to search up some code.
My first idea would be:
generate numbers in the range 0..1
scale to the range -9..9 ( x-0.5; x*18)
shift range by 5 -> -4 .. 14 (add 5)
truncate the range to 2..14 (discard numbers < 2)
that should give you numbers in the range you want.
You need a distributed / weighted random number generator. Here's a reference to get you started.
Assign all numbers equal probabilities,
while currentAverage not equal to intendedAverage (whithin possible margin)
pickedNumber = pick one of the possible numbers (at random, uniform probability, if you pick intendedAverage pick again)
if (pickedNumber is greater than intendedAverage and currentAverage<intendedAverage) or (pickedNumber is less than intendedAverage and currentAverage>intendedAverage)
increase pickedNumber's probability by delta at the expense of all others, conserving sum=100%
else
decrease pickedNumber's probability by delta to the benefit of all others, conserving sum=100%
end if
delta=0.98*delta (the rate of decrease of delta should probably be experimented with)
end while

How to rewrite the halve function in J?

in the J programming language,
-: i. 5
the above function computes the halves of all integers in [0,4]. Now let's say I'd like to re-write the -: function, just for the fun of it. My best guess so far was
]&%.2
but that doesn't seem to cut it. How do you do it?
%&2 NB. divide by two
0.5&* NB. multiply by one half
Note that ] % 2: would also work, but to ensure proper grammar you would either want to use that as the definition of a name, or you would want to put the expression in parenthesis.
I saw you were using %. probably because you were dividing a matrix and thought you needed to do a "matrix divide".
The matrix divide and matrix inverse they are talking about there is for matrix algebra, where you have a list of, well, essentially polynomials, and you want to do transformations on the polynomials all at once, so as to solve the equations. One of the things you can do really easily in J is matrix algebra, there are builtins for matrix divide and for inverting a matrix (as you have seen) and in the phrases section, there are short phrases for doing all of the typical matrix transformations. Taking the determinant, for example.
But when you are simply dividing a vector by a scalar to get a vector, or you are dividing a matrix by the corresponding elements of another matrix, well, that is just the % division symbol.
If you want to try and understand this, look at euler problem 101 (http://projecteuler.net/problem=101) and then google curve fitting on the Jsoftware.com site. Creating the matrixes from the observations, and the basic matrixes as shown allow you to solve for ax^2+bx+c = y where you have x and y and you want to determine a, b, and c. Just remember to use extended arithmetic for everything, as the resultant equations are very good but not perfect unless you do, and to solve the equation you need perfect equations.
Just a thought, unless you want to play with Matrix Algebra, you might not care.

Resources