Edit distance, with a twist - string

I'm trying to solve something using dynamic programming, but I'm having some trouble. When I work on dynamic programming, I usually determine a recursive algorithm then go from there to my dynamic solution. This time I'm having trouble
The Problem
Say you have two strings: m and n, such that n.length is greater than m.length, and n does not contain the character '#'. You want the string that turns m into the same length as string n in minimum cost.
Cost is defined as SUM(Penalty(m[i],n[i])), where i is in an index of the strings char array.
Penalty is defined as such
private static int penalty(char x,char y) {
if (x==y) { return 0;}
else if (y=='#') { return 4;}
else { return 2;}
}
The only way I can think of is as follows:
[0] If m and n are the same length, return m
[1] Compute cost of inserting a # at any index of m
[2] determine the string that has the minimum of such cost. Let that string be m'
[3] Run the algorithm on m' and n again.
I don't think this is even the optimal recursive algorithm, leading me to believe that I'm not on the right track for a dynamic algorithm.
I've read up on using a m.length x n.length matrix for normal edit distance, but I don't see how I could easily transform that to fit my algorithm.
Thoughts on my recursive algorithm and the steps I need to take to reach a dynamic solution?

Taking your definitions (python):
def penalty(x, y):
if x == y:
return 0
if y == '#':
return 4
return 2
def cost(n, m):
return sum(penalty(a, b) for a, b in zip(n, m))
Then you can define the distance reassigning to m the lowest cost for each # to be included.
def distance(n, m):
for _ in range(len(n) - len(m)):
m = min((m[:i]+'#'+m[i:] for i in range(len(m)+1)), key=lambda s: cost(n, s))
return m
>>> distance('hello world', 'heloworld')
'he#lo#world'

The only way that I can see the optimality principle to work here is if you solve the problem over growing lengths of n. So the dynamic programming solution would look like this:
For each contiguous substring of length m.length()+1, solve the problem, yielding a list of proposals for the new m.
Select the proposal with the minimum distance to the corresponding substring as the new m, and repeat the process.
You won't need to store anything other than the currently optimal solution in this algorithm, certainly not a distance matrix. It looks to me like you were pretty close to this solution as well, you only missed the 'shrink n to get a subproblem'-part.

Related

Implementing alternative Fibonacci sequence

So I'm struggling with Question 3. I think the representation of L would be a function that goes something like this:
import numpy as np
def L(a, b):
#L is 2x2 Matrix, that is
return(np.dot([[0,1],[1,1]],[a,b]))
def fibPow(n):
if(n==1):
return(L(0,1))
if(n%2==0):
return np.dot(fibPow(n/2), fibPow(n/2))
else:
return np.dot(L(0,1),np.dot(fibPow(n//2), fibPow(n//2)))
Given b I'm pretty sure I'm wrong. What should I be doing? Any help would be appreciated. I don't think I'm supposed to use the golden ratio property of the Fibonacci series. What should my a and b be?
EDIT: I've updated my code. For some reason it doesn't work. L will give me the right answer, but my exponentiation seems to be wrong. Can someone tell me what I'm doing wrong
With an edited code, you are almost there. Just don't cram everything into one function. That leads to subtle mistakes, which I think you may enjoy to find.
Now, L is not function. As I said before, it is a matrix. And the core of the problem is to compute its nth power. Consider
L = [[0,1], [1,1]]
def nth_power(matrix, n):
if n == 1:
return matrix
if (n % 2) == 0:
temp = nth_power(matrix, n/2)
return np.dot(temp, temp)
else:
temp = nth_power(matrix, n // 2)
return np.dot(matrix, np.dot(temp, temp))
def fibPow(n):
Ln = nth_power(L, n)
return np.dot(L, [0,1])[1]
The nth_power is almost identical to your approach, with some trivial optimization. You may optimize it further by eliminating recursion.
First thing first, there is no L(n, a, b). There is just L(a, b), a well defined linear operator which transforms a vector a, b into a vector b, a+b.
Now a huge hint: a linear operator is a matrix (in this case, 2x2, and very simple). Can you spell it out?
Now, applying this matrix n times in a row to an initial vector (in this case, 0, 1), by matrix magic is equivalent to applying nth power of L once to the initial vector. This is what Question 2 is about.
Once you determine how this matrix looks like, fibPow reduces to computing its nth power, and multiplying the result by 0, 1. To get O(log n) complexity, check out exponentiation by squaring.

user def. function modifying argument though it is not supposed to

Just for practice, I am using nested lists (for exaple, [[1, 0], [0, 1]] is the 2*2 identity matrix) as matrices. I am trying to compute determinant by reducing it to an upper triangular matrix and then by multiplying its diagonal entries. To do this:
"""adds two matrices"""
def add(A, B):
S = []
for i in range(len(A)):
row = []
for j in range(len(A[0])):
row.append(A[i][j] + B[i][j])
S.append(row)
return S
"""scalar multiplication of matrix with n"""
def scale(n, A):
return [[(n)*x for x in row] for row in A]
def detr(M):
Mi = M
#the loops below are supossed to convert Mi
#to upper triangular form:
for i in range(len(Mi)):
for j in range(len(Mi)):
if j>i:
k = -(Mi[j][i])/(Mi[i][i])
Mi[j] = add( scale(k, [Mi[i]]), [Mi[j]] )[0]
#multiplies diagonal entries of Mi:
k = 1
for i in range(len(Mi)):
k = k*Mi[i][i]
return k
Here, you can see that I have set M (argument) equal to Mi and and then operated on Mi to take it to upper triangular form. So, M is supposed to stay unmodified. But after using detr(A), print(A) prints the upper triangular matrix. I tried:
setting X = M, then Mi = X
defining kill(M): return M and then setting Mi = kill(M)
But these approaches are not working. This was causing some problems as I was trying to use detr(M) in another function, problems which I was able to bypass, but why is this happening? What is the compiler doing here, why was M modified even though I operated only on Mi?
(I am using Spyder 3.3.2, Python 3.7.1)
(I am sorry if this question is silly, but I have only started learning python and new to coding in general. This question means a lot to me because I still don't have a deep understanding of this language.)
See python documentation about assignment:
https://docs.python.org/3/library/copy.html
Assignment statements in Python do not copy objects, they create bindings between a target and an object. For collections that are mutable or contain mutable items, a copy is sometimes needed so one can change one copy without changing the other.
You need to import copy and then use Mi = copy.deepcopy(M)
See also
How to deep copy a list?

I keep timing out calculating the mean of a subset

I have a working block of code, but the online judge on HackerEarth keeps returning a timing error. I'm new to coding and so i don't know the tricks to speed up my code. any help would be much appreciated!
N, Q = map(int, input().split())
#N is the length of the array, Q is the number of queries
in_list =input().split()
#input is a list of integers separated by a space
array = list(map(int, in_list))
from numpy import mean
means=[]
for i in range(Q):
L, R = map(int, input().split())
m= int(mean(array[L-1:R]))
means.append(m)
for i in means:
print(i)
Any suggestions would be amazing!
You probably need to avoid doing O(N) operations in the loop. Currently both the slicing and the mean call (which needs to sum up the items in the slice) are both that slow. So you need a better algorithm.
I'll suggest that you do some preprocessing on the list of numbers so that you can figure out the sum of the values that would be in the slice (without actually doing a slice and adding them up). By using O(N) space, you can do the calculation of each sum in O(1) time (making the whole process take O(N + Q) time rather than O(N * Q)).
Here's a quick solution I put together, using itertools.accumulate to find a cumulative sum of the list items. I don't actually save the items themselves, as the cumulative sum is enough.
from itertools import accumulate
N, Q = map(int, input().split())
sums = list(accumulate(map(int, input().split())))
for _ in range(Q):
L, R = map(int, input().split())
print((sums[R] - (sums[L-1] if L > 0 else 0)) / (R-L+1))

Algorithm to solve Local Alignment

Local alignment between X and Y, with at least one column aligning a C
to a W.
Given two sequences X of length n and Y of length m, we
are looking for a highest-scoring local alignment (i.e., an alignment
between a substring X' of X and a substring Y' of Y) that has at least
one column in which a C from X' is aligned to a W from Y' (if such an
alignment exists). As scoring model, we use a substitution matrix s
and linear gap penalties with parameter d.
Write a code in order to solve the problem efficiently. If you use dynamic
programming, it suffices to give the equations for computing the
entries in the dynamic programming matrices, and to specify where
traceback starts and ends.
My Solution:
I've taken 2 sequences namely, "HCEA" and "HWEA" and tried to solve the question.
Here is my code. Have I fulfilled what is asked in the question? If am wrong kindly tell me where I've gone wrong so that I will modify my code.
Also is there any other way to solve the question? If its available can anyone post a pseudo code or algorithm, so that I'll be able to code for it.
public class Q1 {
public static void main(String[] args) {
// Input Protein Sequences
String seq1 = "HCEA";
String seq2 = "HWEA";
// Array to store the score
int[][] T = new int[seq1.length() + 1][seq2.length() + 1];
// initialize seq1
for (int i = 0; i <= seq1.length(); i++) {
T[i][0] = i;
}
// Initialize seq2
for (int i = 0; i <= seq2.length(); i++) {
T[0][i] = i;
}
// Compute the matrix score
for (int i = 1; i <= seq1.length(); i++) {
for (int j = 1; j <= seq2.length(); j++) {
if ((seq1.charAt(i - 1) == seq2.charAt(j - 1))
|| (seq1.charAt(i - 1) == 'C') && (seq2.charAt(j - 1) == 'W')) {
T[i][j] = T[i - 1][j - 1];
} else {
T[i][j] = Math.min(T[i - 1][j], T[i][j - 1]) + 1;
}
}
}
// Strings to store the aligned sequences
StringBuilder alignedSeq1 = new StringBuilder();
StringBuilder alignedSeq2 = new StringBuilder();
// Build for sequences 1 & 2 from the matrix score
for (int i = seq1.length(), j = seq2.length(); i > 0 || j > 0;) {
if (i > 0 && T[i][j] == T[i - 1][j] + 1) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append("-");
} else if (j > 0 && T[i][j] == T[i][j - 1] + 1) {
alignedSeq2.append(seq2.charAt(--j));
alignedSeq1.append("-");
} else if (i > 0 && j > 0 && T[i][j] == T[i - 1][j - 1]) {
alignedSeq1.append(seq1.charAt(--i));
alignedSeq2.append(seq2.charAt(--j));
}
}
// Display the aligned sequence
System.out.println(alignedSeq1.reverse().toString());
System.out.println(alignedSeq2.reverse().toString());
}
}
#Shole
The following are the two question and answers provided in my solved worksheet.
Aligning a suffix of X to a prefix of Y
Given two sequences X and Y, we are looking for a highest-scoring alignment between any suffix of X and any prefix of Y. As a scoring model, we use a substitution matrix s and linear gap penalties with parameter d.
Give an efficient algorithm to solve this problem optimally in time O(nm), where n is the length of X and m is the length of Y. If you use a dynamic programming approach, it suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends.
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
P[i][j] = D, T or L according to which of the three expressions above is the maximum
Once we have computed F and P, we find the largest value in the bottom row of the matrix F. Let F[n][j0] be that largest value. We start traceback at F[n][j0] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Aligning Y to a substring of X, without gaps in Y
Given a string X of length n and a string Y of length m, we want to compute a highest-scoring alignment of Y to any substring of X, with the extra constraint that we are not allowed to insert any gaps into Y. In other words, the output is an alignment of a substring X' of X with the string Y, such that the score of the alignment is the largest possible (among all choices of X') and such that the alignment does not introduce any gaps into Y (but may introduce gaps into X'). As a scoring model, we use again a substitution matrix s and linear gap penalties with parameter d.
Give an efficient dynamic programming algorithm that solves this problem optimally in polynomial time. It suffices to give the equations that are needed to compute the dynamic programming matrix, to explain what information is stored for the traceback, and to state where the traceback starts and ends. What is the running-time of your algorithm?
Solution:
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i and the string Y_j, such that the alignment does not insert gaps in Y. We also compute a traceback matrix P. The computation of F and P can be done in O(nm) time using the following equations:
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i][j-1]-d }
P[i][j] = D or L according to which of the two expressions above is the maximum
Once we have computed F and P, we find the largest value in the rightmost column of the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and continue traceback until we hit the first column of the matrix. The alignment constructed in this way is the solution.
Hope you get some idea about wot i really need.
I think it's quite easy to find resources or even the answer by google...as the first result of the searching is already a thorough DP solution.
However, I appreciate that you would like to think over the solution by yourself and are requesting some hints.
Before I give out some of the hints, I would like to say something about designing a DP solution
(I assume you know this can be solved by a DP solution)
A dp solution basically consisting of four parts:
1. DP state, you have to self define the physical meaning of one state, eg:
a[i] := the money the i-th person have;
a[i][j] := the number of TV programmes between time i and time j; etc
2. Transition equations
3. Initial state / base case
4. how to query the answer, eg: is the answer a[n]? or is the answer max(a[i])?
Just some 2 cents on a DP solution, let's go back to the question :)
Here's are some hints I am able to think of:
What is the dp state? How many dimensions are enough to define such a state?
Thinking of you are solving problems much alike to common substring problem (on 2 strings),
1-dimension seems too little and 3-dimensions seems too many right?
As mentioned in point 1, this problem is very similar to common substring problem, maybe you should have a look on these problems to get yourself some idea?
LCS, LIS, Edit Distance, etc.
Supplement part: not directly related to the OP
DP is easy to learn, but hard to master. I know a very little about it, really cannot share much. I think "Introduction to algorithm" is a quite standard book to start with, you can find many resources, especially some ppt/ pdf tutorials of some colleges / universities to learn some basic examples of DP.(Learn these examples is useful and I'll explain below)
A problem can be solved by many different DP solutions, some of them are much better (less time / space complexity) due to a well-defined DP state.
So how to design a better DP state or even get the sense that one problem can be solved by DP? I would say it's a matter of experiences and knowledge. There are a set of "well-known" DP problems which I would say many other DP problems can be solved by modifying a bit of them. Here is a post I just got accepted about another DP problem, as stated in that post, that problem is very similar to a "well-known" problem named "matrix chain multiplication". So, you cannot do much about the "experience" part as it has no express way, yet you can work on the "knowledge" part by studying these standard DP problems first maybe?
Lastly, let's go back to your original question to illustrate my point of view:
As I knew LCS problem before, I have a sense that for similar problem, I may be able to solve it by designing similar DP state and transition equation? The state s(i,j):= The optimal cost for A(1..i) and B(1..j), given two strings A & B
What is "optimal" depends on the question, and how to achieve this "optimal" value in each state is done by the transition equation.
With this state defined, it's easy to see the final answer I would like to query is simply s(len(A), len(B)).
Base case? s(0,0) = 0 ! We can't really do much on two empty string right?
So with the knowledge I got, I have a rough thought on the 4 main components of designing a DP solution. I know it's a bit long but I hope it helps, cheers.

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

Resources