I am trying to remove duplicate elements from a numpy array.
Eg:
a = np.array([[0.03,0.32],[0.09,0.26],[0.03,0.32]])
a = np.unique(a,axis=0)
This is perfectly working.
But the problem is this code is a part of a function. And I run the function say 10 times. At any one run the system gets hanged at exactly this line.
I notice that array would be of max 3500 size and each element (inner array) would be of length 60.
Why is this happening or any other efficient way?
There's quite a few issues with what you're doing.
First, observe that np.unique does not work well for floating point arithmetic, and will not in general filter out "unique" arrays of floats:
In [16]: a = np.array([[2.1*3, .123], [6.3, 2.05*.06]])
In [17]: a
Out[17]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
In [18]: np.unique(a, axis=0)
Out[18]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
Note that the duplicates are still in the result after calling np.unique. The reason for this is because np.unique is comparing on equality meaning, that the floats must match bit for bit. However, floating point arithmetic is not exact, so you are not guaranteed to filter out duplicates correctly.
Secondly, in terms of performance, you can do better than np.unique with a hashable type. np.unique will always run in O(n log n) since it does a sort. You can verify this in the source code:
if optional_indices:
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
aux = ar[perm]
else:
ar.sort()
aux = ar
So, regardless of how the conditional evaluates, a sort is performed over ar (which is the input array, see here for more detail: https://github.com/numpy/numpy/blob/v1.15.0/numpy/lib/arraysetops.py#L277). The reason for this is because np.unique supports a rich set of functionality (like getting the indices of dups, returning the count of dups, etc).
You don't have to sort to get unique elements. If you beat your type into a hashable type (like tuple), then you can filter out duplicates in O(n), linear time. Here's an example:
In [37]: b
Out[37]:
[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
In [39]: np.unique(b, axis=0)
Out[39]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [40]: set(b)
Out[40]: {(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)}
In [41]: %timeit np.unique(b, axis=0)
21.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [42]: %timeit set(b)
627 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So, as you can see, just using the built-in set runs about 30x faster than np.unique. Please note this will not work correctly for arrays of floats, but I just wanted to show that np.unique is not particularly performant from an algorithmic perspective.
Lastly, 3500x60 is not really that big. You can loop through that pretty easily, even with a subpar algorithm, and it should not hang on any modern hardware. It should run pretty fast:
In [43]: np.random.seed(0)
In [46]: x = np.random.random((3500, 60))
In [49]: %timeit np.unique(x, axis=0)
2.57 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So it takes 2.57 millisecond on my MacBook Pro, which isn't exactly a powerhouse in terms of hardware (2.3 GHz i5, 8GB of RAM). Make sure you're profiling your code, and make sure that the line in this question is actually the trouble line.
HTH.
Related
I'd like to take a vector and get an array of vectors in which the i-th element of each vector are the k neighbors of the i-th element of the original vector. Also, I'm looking for the fastest way to do so.
I've already done that in MATLAB:
a=zeros(k, length(v));
I=cell(1,k);
a(1,:) = v;
for j=2:k
a(k,:)=[a(k-1,2:end),a(k-1,1)];
end
aux1=[a(:,(end-r+1):end),a(:,1:(end-r))];
for j=1:k
I{k}=aux1(k,:);
end
For example, v = [1, 2, 3, 4, 5] and k = 1; and I want to get:
M = [[5, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 1]]
so that, for the 1st element of each vector, I get [5; 1; 2], which are the element 1 and its neighbors.
Hope it makes sense. Thanks for reading :)
You could use the numpy roll function:
import numpy as np
def get_neighbors(v, k):
N = len(v)
M = np.zeros((k*2+1, N), dtype=int)
for i in range(-k, k+1):
M[i+k, :] = np.roll(v, -i)
return M
v = np.array([1, 2, 3, 4, 5])
k = 1
M = get_neighbors(v, k)
print(M)
Output:
[[5 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 1]]
Using sliding_window_view on a repetition of your array can do it "vectorized" way
# Example array
a = np.arange(1,16)
k = 2 # Window of neighbors
# My solution
np.lib.stride_tricks.sliding_window_view(np.hstack([a,a,a]), (len(a),))[len(a)-k:len(a)+k+1]
Returns
array([[14, 15, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[15, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2]])
Note that sliding_window_view creates just a view. It doesn't create new data. Hence the reason why I do not hesitate creating (in this example) 31 lines (3*15-15+1), and then subset only 5 of them: I do not really create them.
So only real cost of that solution is in hstack, both cpu-wise and memory-wise.
That subset, btw, was done to abide strictly by what you asked. But, depending on what you intend to do, you may drop the subset. Important point is that if
T=np.lib.stride_tricks.sliding_window_view(np.hstack([a,a,a]), (len(a),))
Then T[len(a)+k] is a row made of the kth neighbor, whether k is positive, negative or 0 (the original row)
See timings, since it matters for you
sizes
This method
Roll method
len=15/k=2
51 μs
132 μs
len=15/k=7
51 μs
383 μs
len=1000/k=7
52 μs
422 μs
len=1M/k=7
6 ms
160 ms
len=1M/k=100
6 ms
2.2 s
Roll method is obviously proportional to the size of the window (O(k) — it has one roll to perform per row of output), when sliding_window_view is just a view, and does not really create rows, so is O(1) as far as k is concerned. Both method are equally impacted by len of data (O(n) really, but it shows only for n big enough).
So, all together, this method is O(n) while roll method is O(kn)
This is a pretty specific usage case, but I'm hoping someone out there is more familiar with PyTorch tensors than I am and can help me speed this up.
I'm working on implementing a custom similarity metric for a neural network and have successfully gotten it to work, but it is incredibly slow to calculate. Each epoch takes about a minute to run, which simply isn't going to work with how I wanted to compare it with other metrics. So, I've been trying to utilize PyTorch tensors more effectively to speed things up, but haven't had much success.
Basically, I need to sum up the integers in the 'counts' tensor between the min and max indices specified in the 'min' and 'max' tensors for each sample and cluster combination.
As mentioned, my original implementation using loops took about a minute per epoch to run, but I did manage to reduce that to about 18-20 seconds using list comprehensions:
# counts has size (16, 100), max and min have size (2708, 7, 16)
data_mass = torch.sum(torch.tensor([[[torch.pow(torch.sum(counts[k][min[i][j][k]:max[i][j][k]+1]) / divisor, 2) for k in range(len(counts))] for j in range(len(min[i]))] for i in range(len(min))]), 2)
This feels super janky, and I've seen some clever things done with PyTorch functions, but I haven't been able to find anything yet that addresses quite what I want to do. Thanks in advance! I'm happy to clarify anything that may not be clear, I understand the use case is a bit convoluted.
EDIT: I'll try and break down the code snippet above and provide a minimal example. Examples of minimal inputs might look like the following:
'min' and 'max' are both 3-dimensional tensors of shape (num_samples, num_clusters, num_features), such as this one of size (2, 3, 4)
min = tensor([[[1, 2, 3, 1],
[2, 1, 1, 2],
[1, 2, 2, 1]],
[[2, 3, 2, 1],
[3, 3, 1, 2],
[1, 0, 2, 1]]])
max = tensor([[[3, 3, 4, 4],
[3, 2, 3, 4],
[2, 4, 3, 2]],
[[4, 4, 3, 3],
[4, 4, 2, 3],
[2, 1, 3, 2]]])
'counts' is a 2-dimensional tensor of size(num_features, num_bins),
so for this example we'll say size (4, 5)
counts = tensor([[1, 2, 3, 4, 5],
[2, 5, 3, 1, 1],
[1, 2, 3, 4, 5],
[2, 5, 3, 1, 1]])
The core part of the code snippet given above is the summation of the counts tensor between the values given by the min and max tensors for each pair of indices given at each index in max/min. For the first sample/cluster combo above:
mins = [1, 2, 3, 1]
maxes = [3, 3, 4, 4]
#Starting with feature #1 (leftmost element of min/max, top row of counts),
we sum the values in counts between the indices specified by min and max:
min_value = mins[0] = 1
max_value = maxes[0] = 3
counts[0] = [1, 2, 3, 4, 5]
subset = counts[0][mins[0]:maxes[0]+1] = [2, 3, 4]
torch.sum(subset) = 9
#Second feature
min_value = mins[1] = 2
max_value = maxes[1] = 3
counts[1] = [2, 5, 3, 1, 1]
subset = counts[0][mins[0]:maxes[0]+1] = [3, 1]
torch.sum(subset) = 4
In my code snippet, I perform a few additional operations, but if we ignore those and just sum all the index pairs, the output will have the form
pre_sum_output = tensor([[[9, 4, 9, 10],
[7, 8, 9, 5]
[5, 5, 7, 8]],
[[12, 2, 7, 9],
[9, 2, 5, 4],
[5, 7, 7, 8]]])
Finally, I sum the output one final time along the third dimension:
data_mass = torch.sum(pre_sum_output, 2) = torch.tensor([[32, 39, 25],
[30, 20, 27]])
I then need to repeat this for every pair of mins and maxes in 'min' and 'max' (each [i][j][k]), hence the list comprehension above iterating through i and j to get each sample and cluster respectively.
By noticing that torch.sum(counts[0][mins[0]:maxes[0]+1]) is equal to cumsum[maxes[0]] - cumsum[mins[0]-1] where cumsum = torch.cumsum(counts[0]), you can get rid of the loops like so:
# Dim of sample, clusters, etc.
S, C, F, B = range(4)
# Copy min and max over bins
min = min.unsqueeze(B)
max = max.unsqueeze(B)
# Copy counts over samples and clusters
counts = counts.reshape(
1, # S
1, # C
*counts.shape # F x B
)
# Number of samples, clusters, etc.
ns, nc, nf, nb = min.size(S), min.size(C), min.size(F), counts.size(B)
# Calculate cumulative sum and copy over samples and clusters
cum_counts = counts.cumsum(dim=B).expand(ns, nc, nf, nb)
# Prevent index error when min index is 0
is_zero = min == 0
lo = (min - 1).masked_fill(is_zero, 0)
# Compute the contiguous sum from min to max (inclusive)
lo_sum = cum_counts.gather(dim=B, index=lo)
hi_sum = cum_counts.gather(dim=B, index=max)
sum_counts = torch.where(is_zero, hi_sum, hi_sum - lo_sum)
pre_sum_output = sum_counts.squeeze(B)
You can then sum over the 2nd dim to get data_mass.
I would like to know how to calculate the arithmetic mean for all of two consecutive elements in a python-numpy array, and save the values in another array
col1sortedunique = [0.0610754, 0.27365186, 0.37697331, 0.46547072, 0.69995587, 0.72998093, 0.85794189]
thank you
If I understood you correctly you want to do something like this:
import numpy as np
arr = np.arange(0,10)
>>> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
conse_mean = (arr[:-1]+arr[1:])/2
>>> array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5])
so that would be a mapping from an array with length N to one with length N-1.
Maybe an additional explanation of the syntax:
arr[1:])
>>> array([1, 2, 3, 4, 5, 6, 7, 8, 9])
would give you your array from without the first element and
arr[:-1])
>>> array([0,1, 2, 3, 4, 5, 6, 7, 8])
without the last.
Therefore you have two smaller arrays where a element and its consecutive neighbor have the same index and you can just calculate the mean as it is done above.
I know this has been answered many times and I went through every SO question on this topic, but none of them seemed to tackle my problem.
This code yields an exception:
TypeError: only integer scalar arrays can be converted to a scalar index
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
sindex = np.array([0, 3, 4])
eindex = np.array([2, 5, 6])
r = a[sindex: eindex]
I have an array with start indexes and another one with end indexes and I simply wanted to extract whatever is in between them. Notice the difference between sindex and eindex is constant, for example 2. So eindex is always what ever is in sindex + 2.
So the expected result should be:
[1, 2, 4, 5, 5, 6]
Is there a way to do this without a for loop?
For a constant interval difference, we can setup sliding windows and simply index with the starting indices array. Thus, we can use broadcasting_app or strided_app from this post -
d = 2 # interval difference
out = broadcasting_app(a, L = d, S = 1)[sindex].ravel()
out = strided_app(a, L = d, S = 1)[sindex].ravel()
Or use scikit-image's built-in view_as_windows -
from skimage.util.shape import view_as_windows
out = view_as_windows(a,d)[sindex].ravel()
To set d, we can use -
d = eindex[0] - sindex[0]
You can't tell compiled numpy to take multiple slices directly. The alternatives to joining multiple slices involve some sort of advanced indexing.
In [509]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
...:
...: sindex = np.array([0, 3, 4])
...: eindex = np.array([2, 5, 6])
The most obvious loop:
In [511]: np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
Out[511]: array([1, 2, 4, 5, 5, 6])
A variation that uses the loop to construct indices first:
In [516]: a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
Out[516]: array([1, 2, 4, 5, 5, 6])
Since the slice size is all the same, we can generate one arange and step that with sindex:
In [521]: a[np.arange(eindex[0]-sindex[0]) + sindex[:,None]]
Out[521]:
array([[1, 2],
[4, 5],
[5, 6]])
and then ravel. This is a more direct expression of #Divakar'sbroadcasting_app`.
With this small example, timings are similar.
In [532]: timeit np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
13.4 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [533]: timeit a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
21.2 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [534]: timeit a[np.arange(eindex[0]-sindex[0])+sindex[:,None]].ravel()
10.1 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [535]: timeit strided_app(a, L=2, S=1)[sindex].ravel()
21.8 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
strided_app and view_as_windows use striding tricks to view the array as an array of size d windows, and use sindex to select a subset of them.
In larger cases, relative timings may vary with the size of the slices versus the number of slices.
You can just use sindex. Refer the following image
Eg let A = [3,4]
and Y be a array of multiple values like
Y = [2,3,2,2,2,2,2,3,3,3,3,3]
then I want to select all those labels of Y where Y is in A
So I wrote the following code:
`Yij = [Y[Y == x] for x in a]`
Output:
[array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4])]
but this will lead a list of list.
I on other hand want a normal array.
Any suggestion on how can I make this work?
A list comprehension solution:
>>> A = set([3, 4])
>>> Y = [2,3,2,2,2,2,2,3,3,3,3,3]
>>> Z = [y for y in Y if y in A]
>>> Z
[3, 3, 3, 3, 3, 3]
Here are some timings to show the performance difference between using set lookup and list lookup:
In [21]: A = set(range(0, 1000, 5))
In [22]: B = list(range(0, 1000, 5))
In [23]: C = list(range(0, 1000))
In [24]: %timeit [y for y in C if y in A]
59.6 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [25]: %timeit [y for y in C if y in B]
2.94 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)