I am trying to create a list of strings from a hamming distance matrix. Each string must be 20 characters long with a 4 letter alphabet (A,B,C,D). For example, say I have the following hamming distance matrix:
S1 S2 S3
S1 0 5 12
S2 5 0 14
S3 12 14 0
From this matrix I need to create 3 strings, for example:
S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"
I created these strings manually, but I need to do this for a hamming distance matrix representing 100 strings which is not practical to do manually. Can anyone suggest an algorithm that can do this?
Thanks, Chris
That is a fun exercise. :-)
The following octave script randomly generates n strings of length len. Subsequently it calculates the hamming distance between all these strings.
What is done next is that strings are compared pairwise. If for example you search for [5 12 14], you will find the table N to contain strings that are 5 and 12 apart as well as strings that are 12 and 14 apart. The next challenge is of course to find the circuit in which the ones that are 5 and 12 apart can be put together with the ones that are 12 and 14 apart in such way that the circuit "closes".
% We generate n strings of length len
n=50;
len=20;
% We have a categorical variable of size 4 (ABCD)
cat=4;
% We want to generate strings that correspond with the following hamming distance matrix
search=[5 12 14];
%search=[10 12 14 14 14 16];
S=squareform(search);
% Note that we generate each string totally random. If you need small distances it makes sense to introduce
% correlations across the strings
X=randi(cat-1,n,len);
% Calculate the hamming distances
t=pdist(X,'hamming')*len;
% The big matrix we have to find our little matrix S within
Y=squareform(t);
% All the following might be replaced by something like submatrix(Y,S) if that would exist
R=zeros(size(S),size(Y));
for j = 1:size(S)
M=zeros(size(Y),size(S));
for i = 1:size(Y)
M(i,:)=ismember(S(j,:),Y(i,:));
endfor
R(j,:)=all(M');
endfor
[x,y]=find(R);
% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices
A = accumarray(x,y,[], #(v) {sort(v).'});
% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum(cellfun(#isempty,A)) > 0)
printf("There are no matches\n");
return
endif
% We are now gonna get all possible submatrices with the values in "search"
C = cell(1, numel(A));
[C{:}] = ndgrid( A{:} );
N = cell2mat( cellfun(#(v)v(:), C, 'UniformOutput',false) );
N = unique(sort(N,2), 'rows');
printf("Found %i potential matches (but contains duplicates)\n", size(N,1));
% We are now further filtering (remove duplicates)
[f,g]=mode(N,2);
h=g==1;
N=N(h,:);
printf("Found %i potential matches\n", size(N,1));
M=zeros(size(N),size(search,2));
for i = 1:size(N)
f=N(i,:);
M(i,:)=squareform(Y(f,f))';
endfor
F=squareform(S)';
% For now we forget about wrong permutations, so for search > 3 you need to filter these out!
M = sort(M,2);
F = sort(F,2);
% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D=ismember(M,F,'rows');
mf=find(D);
if (mf)
matches=size(mf,1);
printf("Found %i matches\n", matches);
for i = 1:matches
r=mf(i);
printf("We return match %i (only check permutations now)\n", r);
t=N(r,:)';
str=X(t,:);
check=squareform(pdist(str,'hamming')*len);
strings=char(str+64)
check
endfor
else
printf("There are no matches\n");
endif
It will generate strings such as:
ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC
Related
I have a project where I need to find the optimal triplets of number (E, F, G) such that E and F is very close to eachother (the difference is smallest) and G is bigger than E and F. I have to make n numbers of such triplets.
The way I tought about it was to sort the given list of numbers then search for the smallest differences then those two will be my E and F after all the n pairs will be done I will search for every pair of E and F a G such that G is bigger than E and F. I know this is the greedy way but my code is very slow, it takes up to 1 minute when the list is like 300k numbers and i have to do 2k triplets. Any idea on how to improve the code?
guests is n (the number of triplets)
sticks is the list of all the numbers
# We sort the list using the inbuilt function sticks.sort()
save = guests # Begining to search for the best pairs of E and F
efficiency = 0 while save != 0:
difference = 1000000 # We asign a big value to difference each time
# Searching for the smallest difference between two elements
for i in range(0, length - 1):
if sticks[i+1] - sticks[i] < difference:
temp_E = i
temp_F = i + 1
difference = sticks[i+1] - sticks[i]
# Saving the two elements in list stick_E and stick_F
stick_E.append(sticks[temp_E])
stick_F.append(sticks[temp_F])
# Calculating the efficiency
efficiency += ((sticks[temp_F] - sticks[temp_E]) * (sticks[temp_F] - sticks[temp_E]))
# Deleting the two elements from the main list
sticks.pop(temp_E)
sticks.pop(temp_E)
length -= 2
save -= 1
# Searching for stick_G for every pair made for i in range(0, len(stick_F)):
for j in range(0, length):
if stick_F[i] < sticks[j]:
stick_G.append(sticks[j]) # Saves the element found
sticks.pop(j) # Deletes the element from the main list
length -= 1
break
> # Output the result to a local file print_to_file(stick_E, stick_F, stick_G, efficiency, output_file)
I commented the code the best I could so it would be easier for you to understand.
I have a task: given a value N. I should generate a list of length L > 1 such that the sum of the squares of its elements is equal to N.
I wrote a code:
deltas = np.zeros(L)
deltas[0] = (np.random.uniform(-N, N))
i = 1
while i < L and np.sum(np.array(deltas)**2) < N**2:
deltas[i] = (np.random.uniform(-np.sqrt(N**2 - np.sum(np.array(deltas)**2)),\
np.sqrt(N**2 - np.sum(np.array(deltas)**2))))
i += 1
But this approach takes long time, if I generate such list many times. (I think because of loop).
Note, that I don't want my list to consist of just one unique value. The distribution of values does not have to be uniform - I took uniform just for example.
Could you suggest any faster approach? May be there is special function in any lib?
If you didn't mind a few repeating 1s, you could do something like this:
def square_list(integer):
components = []
total = 0
remaining = integer
while total != integer:
component = int(remaining ** 0.5)
remaining -= component ** 2
components.append(component)
total = sum([x ** 2 for x in components])
return components
This code works by finding the taking the largest square, and then decreasing to the next largest square. It continues until the largest square is 1, which could at worse result in 3 1s in a list.
If you are looking for a more random distribution, it might make sense to randomly transform remaining as a separate variable before subtracting from it.
IE:
value = transformation(remaining)
component = int(value ** 0.5)
which should give you more "random" values.
Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.
I have an initial string : S= 'ABCDEFGH'
How can I generate 100 strings from S where there is no repeated character in each string and the characters in each string will be in an order from 'A' to 'H' . Every string has diffent length which is based on normal distribution.Here, the mean=4, and sd = 1
The expected output (may be different because of random strings are genrated should be 100 srings like below:
Output = { 'ABEGH'; 'ABE'; 'DH' ; 'BCGH' ..........; 'ABCDEGH'}
Thanks !
It's not clear what distribution you want. This is a generic answer for any length distribution.
S = 'ABCDEFGH'; %// input characters
distr = [.1 .2 .1 .2 .1 .1 .1 .1]; %// probability of getting lengths 1, 2, ..., numel(S)
n = randsample(numel(distr), 1, 1, distr); %// random length with the specified distribution
ind = sort(randperm(numel(S), n)); %// take n sorted values from 1, ..., numel(S);
result = S(ind);
Assuming all permutations produced from randperm are equally likely1 the above code, conditioned on a given n, generates all possible n-digit substrings with the same probability.
1
In old Matlab versions randperm was an m-function. From its source code it was clear that it produced all permutations with the same probability. In recent versions it's not an m-function anymore, and its documentation doesn't specify that.
I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,