Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.
Related
Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words)
Approach that I tried: using sklearn's count vectoriser, get the vectors for each string and compute dot product for each vector with every other vector. Those vectors with zero dot product will be added to a set.
This is done using O(n2) dot product computations. Is there a better way to approach this problem?
There is little you can do, suppose the trivial case where each string has a single unique word. In order to determine that all the intersections is empty you have to consider all the n * (n - 1) / 2 pairs, hence complexity is O(n^2*v) and v is the number of unique words in your vocabulary.
For the typical case however we can have better approaches. Assuming that the number of words in each string is much less than the number of unique words it is better to iterate over the words of the maybe even skipping the vectorization. Let 0 < id[word] < nWords be a unique number for each word
you could do
v1 = np.zeros(nWords)
for i in range(len(strings)):
for w in getWords(strings[i]):
v1[id[w]] = 1;
for j in range(i+1, len(strings)):
for w in getWords(strings[j]):
if v1[id[w]]:
# strings[j] and strings[i] share at least one word.
break
for w in getWords(strings[i]):
v1[id[w]] = 1;
still O(n * C), where C is the number of words in all your strings.
You may want to precompute the getWords(strings[i])
I have a task: given a value N. I should generate a list of length L > 1 such that the sum of the squares of its elements is equal to N.
I wrote a code:
deltas = np.zeros(L)
deltas[0] = (np.random.uniform(-N, N))
i = 1
while i < L and np.sum(np.array(deltas)**2) < N**2:
deltas[i] = (np.random.uniform(-np.sqrt(N**2 - np.sum(np.array(deltas)**2)),\
np.sqrt(N**2 - np.sum(np.array(deltas)**2))))
i += 1
But this approach takes long time, if I generate such list many times. (I think because of loop).
Note, that I don't want my list to consist of just one unique value. The distribution of values does not have to be uniform - I took uniform just for example.
Could you suggest any faster approach? May be there is special function in any lib?
If you didn't mind a few repeating 1s, you could do something like this:
def square_list(integer):
components = []
total = 0
remaining = integer
while total != integer:
component = int(remaining ** 0.5)
remaining -= component ** 2
components.append(component)
total = sum([x ** 2 for x in components])
return components
This code works by finding the taking the largest square, and then decreasing to the next largest square. It continues until the largest square is 1, which could at worse result in 3 1s in a list.
If you are looking for a more random distribution, it might make sense to randomly transform remaining as a separate variable before subtracting from it.
IE:
value = transformation(remaining)
component = int(value ** 0.5)
which should give you more "random" values.
I am trying to perform a K-mean algorithm to obtain a lowest cost which would result in a KxN matrix. The value of K is determined by number of clusters the algorithm creates with optimal cost. For example, K=2 would imply 2 clusters ( or 2 centroids ) while N is the number of features. The K-mean is run in a loop for K=1 to 10 and the loop stops when best optimal cost is obtained for a particular value of K. for example if an optimal cost is obtained for K=2, the centroid returned would be an 2xN matrix. I want to store all the centroids returned by the loop into a list. Please note that in every increment of loop the value of K would change by k=K+1. Therefore my centroid returned would be of size 1xN, 2xN, 3xN.
How to store this into a list such that I can get something like this:-
List= [[10,12,13], [[10,20,30],[1,2,3]], [[5,6,9],[4,12,20],[40,50,60]],...
With every loop I return a KxN matrix which I want to store it into a list. I want to access the list later by an index , say List[i] to retrieve the KxN matrix.
I am mostly working with numpy.
any suggestions would be a big help.
N = 5
lst = []
for K in range(1,11):
lst.append(np.empty((K,N)))
Question:
Given a text t[1...n, 1...n] and p[1...m, 1...m], n = 2m, from alphabet [0, Sigma-1], we say p matches t at [i,j] if t[i+k-1, j+L-1] = p[k,L] for all k,L. Design a randomized algorithm to find all matches in O(n^2) time with high probability.
Image:
Can someone help me understand what this text means? I believe it is saying that 't' has two words in it and the pattern is also two words but the length of both patterns is half of 't'. However, from here I don't understand how the range of [i,j] comes into play. That if statement goes over my head.
This could also be saying that t and p are 2D arrays and you are trying to match a "box" from the pattern in the t 2D array.
Any help would be appreciated, thank you!
The problem asks you to find a 2D pattern i.e defined by the p array in the t array which is also 2D.
The most obvious randomized solution to this problem would be to generate two random indexes i and j and then start searching for the pattern from that (i, j).
To avoid doing redundant searches you can keep track of which pairs of (i, j) you have visited before, this can be done using a simple look up 2D array.
The complexity of above would be O(n^3) in the worst case.
You can also use hashing for comparing the strings to reduce the complexity to O(n^2).
You first need to hash the t array row by row and store the value in an array like hastT, you can use the Rolling hash algorithm for that.
You can then hash the p array using Rolling hash algorithm and store the hashes row by row in the array hashP.
Then when you generate the random pair (i, j), you can get the hash of the corresponding t array using the array hashT in linear time instead of the brute force comparision that takes quadratic time and compare (Note there can be collisions in the hash you can brute force when a hash matches to be completely sure).
To find the corresponding hash using the hashT we can do the following, suppose the current pair (i, j) is (3, 4), and the dimensions of the p array are 2 x 3.
Then we can compare hashT[3][7] - hash[3][3] == hashP[3] to find the result, the above logic comes from the rolling hash algo.
Pseudocode for search in linear time using hashing :
hashT[][], hashS[]
i = rand(), j = rand();
for(int k = i;k < i + lengthOfColumn(p);i++){
if((hashT[i][j + lengthOfRow(p)] - hashT[i][j-1]) != hashP[i]){
//patter does not match.
return false;
}
}
I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,