Generate strings which have normal distribution length (Matlab) - string

I have an initial string : S= 'ABCDEFGH'
How can I generate 100 strings from S where there is no repeated character in each string and the characters in each string will be in an order from 'A' to 'H' . Every string has diffent length which is based on normal distribution.Here, the mean=4, and sd = 1
The expected output (may be different because of random strings are genrated should be 100 srings like below:
Output = { 'ABEGH'; 'ABE'; 'DH' ; 'BCGH' ..........; 'ABCDEGH'}
Thanks !

It's not clear what distribution you want. This is a generic answer for any length distribution.
S = 'ABCDEFGH'; %// input characters
distr = [.1 .2 .1 .2 .1 .1 .1 .1]; %// probability of getting lengths 1, 2, ..., numel(S)
n = randsample(numel(distr), 1, 1, distr); %// random length with the specified distribution
ind = sort(randperm(numel(S), n)); %// take n sorted values from 1, ..., numel(S);
result = S(ind);
Assuming all permutations produced from randperm are equally likely1 the above code, conditioned on a given n, generates all possible n-digit substrings with the same probability.
1
In old Matlab versions randperm was an m-function. From its source code it was clear that it produced all permutations with the same probability. In recent versions it's not an m-function anymore, and its documentation doesn't specify that.

Related

Search and remove algorithm

Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.

How to generate 1825 numbers with a step of 0.01 in a specific range

In the following code I want to get len(a) should be 1825 keeping step 0.01. But when I print len(a) it gives me 73. For getting length of 1825 I have to generate numbers from 2.275 to 3 with a step of 0.01 ,73 times. How can I do that? I tried to use np.linspace but that command doesn't work for this case.
a = np.arange(2.275, 3, 0.01)
Seems like you want to np.random.choice 1825 times
>>> a = np.arange(2.275,3,0.01)
>>> c = np.random.choice(a, 1825)
array([2.995, 2.545, 2.755, ..., 2.875, 2.275, 2.605])
>>> c.shape
(1825,)
Edit
If you want a repeated 25 times (i.e. 1825/73) in sequence, use np.tile()
target = 1825
n = target/len(a)
np.tile(a, int(n))
yields
array([2.275, 2.285, 2.295, ..., 2.975, 2.985, 2.995])
Here's a one liner, given a = np.arange(2.275, 3, 0.01) and n = 1825:
a = np.broadcast_to(a, (n // a.size + book(n % a.size), a.size)).ravel()[:n]
This uses np.broadcast_to to turn a into a matrix where it repeats itself enough times to fill 1825 elements. ravel then flattens the repeated list and the final slice chops off the unwanted elements. The ravel operation is what actually copies the list since the broadcast uses stride tricks to avoid copying the data.

Python: Increasing one number in a list by one

I am trying to write a program that returns the frequency of a certain pattern. My frequency list is initially a list of zeros, and I want to increase a certain zero by one depending on the pattern. I have tried the code below, but it does not work.
FrequencyArray[j] = FrequencyArray[j]+1
Is there another way to increase one element of the list by 1 without affecting the other elements?
While your approach should work, this would be the alternative:
FrequencyArray[j] += 1
Example:
>>> zeros = [0, 0, 0]
>>> zeros[1] += 1
>>> zeros
[0, 1, 0]

determine strings that satisfy hamming distance matrix

I am trying to create a list of strings from a hamming distance matrix. Each string must be 20 characters long with a 4 letter alphabet (A,B,C,D). For example, say I have the following hamming distance matrix:
S1 S2 S3
S1 0 5 12
S2 5 0 14
S3 12 14 0
From this matrix I need to create 3 strings, for example:
S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"
I created these strings manually, but I need to do this for a hamming distance matrix representing 100 strings which is not practical to do manually. Can anyone suggest an algorithm that can do this?
Thanks, Chris
That is a fun exercise. :-)
The following octave script randomly generates n strings of length len. Subsequently it calculates the hamming distance between all these strings.
What is done next is that strings are compared pairwise. If for example you search for [5 12 14], you will find the table N to contain strings that are 5 and 12 apart as well as strings that are 12 and 14 apart. The next challenge is of course to find the circuit in which the ones that are 5 and 12 apart can be put together with the ones that are 12 and 14 apart in such way that the circuit "closes".
% We generate n strings of length len
n=50;
len=20;
% We have a categorical variable of size 4 (ABCD)
cat=4;
% We want to generate strings that correspond with the following hamming distance matrix
search=[5 12 14];
%search=[10 12 14 14 14 16];
S=squareform(search);
% Note that we generate each string totally random. If you need small distances it makes sense to introduce
% correlations across the strings
X=randi(cat-1,n,len);
% Calculate the hamming distances
t=pdist(X,'hamming')*len;
% The big matrix we have to find our little matrix S within
Y=squareform(t);
% All the following might be replaced by something like submatrix(Y,S) if that would exist
R=zeros(size(S),size(Y));
for j = 1:size(S)
M=zeros(size(Y),size(S));
for i = 1:size(Y)
M(i,:)=ismember(S(j,:),Y(i,:));
endfor
R(j,:)=all(M');
endfor
[x,y]=find(R);
% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices
A = accumarray(x,y,[], #(v) {sort(v).'});
% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum(cellfun(#isempty,A)) > 0)
printf("There are no matches\n");
return
endif
% We are now gonna get all possible submatrices with the values in "search"
C = cell(1, numel(A));
[C{:}] = ndgrid( A{:} );
N = cell2mat( cellfun(#(v)v(:), C, 'UniformOutput',false) );
N = unique(sort(N,2), 'rows');
printf("Found %i potential matches (but contains duplicates)\n", size(N,1));
% We are now further filtering (remove duplicates)
[f,g]=mode(N,2);
h=g==1;
N=N(h,:);
printf("Found %i potential matches\n", size(N,1));
M=zeros(size(N),size(search,2));
for i = 1:size(N)
f=N(i,:);
M(i,:)=squareform(Y(f,f))';
endfor
F=squareform(S)';
% For now we forget about wrong permutations, so for search > 3 you need to filter these out!
M = sort(M,2);
F = sort(F,2);
% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D=ismember(M,F,'rows');
mf=find(D);
if (mf)
matches=size(mf,1);
printf("Found %i matches\n", matches);
for i = 1:matches
r=mf(i);
printf("We return match %i (only check permutations now)\n", r);
t=N(r,:)';
str=X(t,:);
check=squareform(pdist(str,'hamming')*len);
strings=char(str+64)
check
endfor
else
printf("There are no matches\n");
endif
It will generate strings such as:
ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

Resources