Slicing an array with arrays - python-3.x

I know this has been answered many times and I went through every SO question on this topic, but none of them seemed to tackle my problem.
This code yields an exception:
TypeError: only integer scalar arrays can be converted to a scalar index
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
sindex = np.array([0, 3, 4])
eindex = np.array([2, 5, 6])
r = a[sindex: eindex]
I have an array with start indexes and another one with end indexes and I simply wanted to extract whatever is in between them. Notice the difference between sindex and eindex is constant, for example 2. So eindex is always what ever is in sindex + 2.
So the expected result should be:
[1, 2, 4, 5, 5, 6]
Is there a way to do this without a for loop?

For a constant interval difference, we can setup sliding windows and simply index with the starting indices array. Thus, we can use broadcasting_app or strided_app from this post -
d = 2 # interval difference
out = broadcasting_app(a, L = d, S = 1)[sindex].ravel()
out = strided_app(a, L = d, S = 1)[sindex].ravel()
Or use scikit-image's built-in view_as_windows -
from skimage.util.shape import view_as_windows
out = view_as_windows(a,d)[sindex].ravel()
To set d, we can use -
d = eindex[0] - sindex[0]

You can't tell compiled numpy to take multiple slices directly. The alternatives to joining multiple slices involve some sort of advanced indexing.
In [509]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
...:
...: sindex = np.array([0, 3, 4])
...: eindex = np.array([2, 5, 6])
The most obvious loop:
In [511]: np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
Out[511]: array([1, 2, 4, 5, 5, 6])
A variation that uses the loop to construct indices first:
In [516]: a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
Out[516]: array([1, 2, 4, 5, 5, 6])
Since the slice size is all the same, we can generate one arange and step that with sindex:
In [521]: a[np.arange(eindex[0]-sindex[0]) + sindex[:,None]]
Out[521]:
array([[1, 2],
[4, 5],
[5, 6]])
and then ravel. This is a more direct expression of #Divakar'sbroadcasting_app`.
With this small example, timings are similar.
In [532]: timeit np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
13.4 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [533]: timeit a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
21.2 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [534]: timeit a[np.arange(eindex[0]-sindex[0])+sindex[:,None]].ravel()
10.1 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [535]: timeit strided_app(a, L=2, S=1)[sindex].ravel()
21.8 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
strided_app and view_as_windows use striding tricks to view the array as an array of size d windows, and use sindex to select a subset of them.
In larger cases, relative timings may vary with the size of the slices versus the number of slices.

You can just use sindex. Refer the following image

Related

What is wrong when a for loop cannot be finished in the python console?

Can anyone tell me why this for loop does not work?
lst = []
c_lst = []
for i in range(182):
c_lst.append(df.loc[i, 'LE']
c_lst = 5 * c_lst
lst = lst + c_lst
I cannot finish the loop (the prompt does not appear again) in the python console and don't see why this wouldn't work.
df is a dataframe with 182 rows, 'LE' is one of the columns name. I want to create a list lst with every element of the column 'LE' that appears 5 times in lst.
Rather than using a for-loop, consider using array operations with numpy. The numpy.tile method will repeat the entire df.LE vector, which you could then flatten with numpy.array.ravel.
Using a sample dataframe which counts from 0 to 499:
In [4]: df = pd.DataFrame({'LE': np.arange(500)})
The array can be repeated 5 times horizontally, then unrolled to get the desired output [0, 0, 0, 0, 0, 1, 1, ..., 499, 499, 499, 4999, 499]:
In [5]: np.tile(df[['LE']], (1, 5)).ravel()
Out[5]: array([ 0, 0, 0, ..., 499, 499, 499])
The vectorized method is significantly faster:
In [11]: %timeit np.tile(df[['LE']], (1, 5)).ravel()
453 µs ± 51.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]: %%timeit
...: lst = []
...: for i in range(len(df)):
...: c_lst = []
...: c_lst.append(df.loc[i, 'LE'])
...: c_lst = 5 * c_lst
...: lst = lst + c_lst
...:
4.75 ms ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here, the for-loop takes 10x longer. But for a larger array, e.g. one with 10k elements, the difference really appears:
In [13]: df = pd.DataFrame({'LE': np.arange(10000)})
In [14]: %timeit np.tile(df[['LE']], (1, 5)).ravel()
623 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %%timeit
...: lst = []
...: for i in range(len(df)):
...: c_lst = []
...: c_lst.append(df.loc[i, 'LE'])
...: c_lst = 5 * c_lst
...: lst = lst + c_lst
...:
609 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here the for loop is 1000x slower. I tried this with 1 million elements, but got tired of waiting for the for loop to complete... haha.
Can it be because you have a missing paranthesis ) at the end of the 4th line and you are trying to write the code in the terminal window?
I could find what was wrong. The code should be:
lst = []
for i in range(182):
c_lst = []
c_lst.append(df.loc[i, 'LE'])
c_lst = 5 * c_lst
lst = lst + c_lst
This way the for loop can come to an end and the lst is like I wanted it to be. Thanks.

How to find an index of a permutation of an array with package "numpy_indexed"? [duplicate]

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

converting sequence of nucleotide into 2D array of integers

I am trying to convert nucleotide to integer using the following mapping:
A -> 0
C -> 1
G -> 2
T -> 3
The sequence of nucleotide is saved in a pandas dataframe and it looks like:
0
0 GGATAATA
1 CGATAACC
I have used the df.apply() method to do the task. Here is the code:
import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
mapping = df[0].apply(lambda s: np.array([d[i] for i in s]))
It returns the following numpy array which is one dimensional:
print(mapping.values)
array([array([2, 2, 0, 3, 0, 0, 3, 0]), array([1, 2, 0, 3, 0, 0, 1, 1])],
dtype=object)
However, the expected output should be two dimensional array:
[[2,2,0,3,0,0,3,0],
[1,2,0,3,0,0,1,1]]
IIUC
df['0'].apply(list).explode().replace(d).groupby(level=0).agg(list).to_list()
Out[579]: [[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]
Use map:
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
Output
[[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]
or
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
I think the first solusion is faster
%%timeit
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
11.7 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
5.02 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using .str.split() and stack with map
seq = {'A' : 0,
'C' : 1,
'G' : 2,
'T' : 3}
df[0].str.split('',expand=True).stack().map(seq).dropna().groupby(level=0).agg(list)
#out:
0 [2.0, 2.0, 0.0, 3.0, 0.0, 0.0, 3.0, 0.0]
1 [1.0, 2.0, 0.0, 3.0, 0.0, 0.0, 1.0, 1.0]
dtype: object
import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
# implement mapping
mapping = str.maketrans('ACGT', '0123')
df[0] = df[0].map(lambda x: x.translate(mapping))
# expected output
output = df[0].map(lambda x: [int(x) for i in list(x)]).tolist()

Numpy Array - Drop Duplicates

I am trying to remove duplicate elements from a numpy array.
Eg:
a = np.array([[0.03,0.32],[0.09,0.26],[0.03,0.32]])
a = np.unique(a,axis=0)
This is perfectly working.
But the problem is this code is a part of a function. And I run the function say 10 times. At any one run the system gets hanged at exactly this line.
I notice that array would be of max 3500 size and each element (inner array) would be of length 60.
Why is this happening or any other efficient way?
There's quite a few issues with what you're doing.
First, observe that np.unique does not work well for floating point arithmetic, and will not in general filter out "unique" arrays of floats:
In [16]: a = np.array([[2.1*3, .123], [6.3, 2.05*.06]])
In [17]: a
Out[17]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
In [18]: np.unique(a, axis=0)
Out[18]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
Note that the duplicates are still in the result after calling np.unique. The reason for this is because np.unique is comparing on equality meaning, that the floats must match bit for bit. However, floating point arithmetic is not exact, so you are not guaranteed to filter out duplicates correctly.
Secondly, in terms of performance, you can do better than np.unique with a hashable type. np.unique will always run in O(n log n) since it does a sort. You can verify this in the source code:
if optional_indices:
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
aux = ar[perm]
else:
ar.sort()
aux = ar
So, regardless of how the conditional evaluates, a sort is performed over ar (which is the input array, see here for more detail: https://github.com/numpy/numpy/blob/v1.15.0/numpy/lib/arraysetops.py#L277). The reason for this is because np.unique supports a rich set of functionality (like getting the indices of dups, returning the count of dups, etc).
You don't have to sort to get unique elements. If you beat your type into a hashable type (like tuple), then you can filter out duplicates in O(n), linear time. Here's an example:
In [37]: b
Out[37]:
[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
In [39]: np.unique(b, axis=0)
Out[39]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [40]: set(b)
Out[40]: {(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)}
In [41]: %timeit np.unique(b, axis=0)
21.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [42]: %timeit set(b)
627 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So, as you can see, just using the built-in set runs about 30x faster than np.unique. Please note this will not work correctly for arrays of floats, but I just wanted to show that np.unique is not particularly performant from an algorithmic perspective.
Lastly, 3500x60 is not really that big. You can loop through that pretty easily, even with a subpar algorithm, and it should not hang on any modern hardware. It should run pretty fast:
In [43]: np.random.seed(0)
In [46]: x = np.random.random((3500, 60))
In [49]: %timeit np.unique(x, axis=0)
2.57 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So it takes 2.57 millisecond on my MacBook Pro, which isn't exactly a powerhouse in terms of hardware (2.3 GHz i5, 8GB of RAM). Make sure you're profiling your code, and make sure that the line in this question is actually the trouble line.
HTH.

how to have multiple conditions in a list comprehension in which conditions are in an array

Eg let A = [3,4]
and Y be a array of multiple values like
Y = [2,3,2,2,2,2,2,3,3,3,3,3]
then I want to select all those labels of Y where Y is in A
So I wrote the following code:
`Yij = [Y[Y == x] for x in a]`
Output:
[array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4])]
but this will lead a list of list.
I on other hand want a normal array.
Any suggestion on how can I make this work?
A list comprehension solution:
>>> A = set([3, 4])
>>> Y = [2,3,2,2,2,2,2,3,3,3,3,3]
>>> Z = [y for y in Y if y in A]
>>> Z
[3, 3, 3, 3, 3, 3]
Here are some timings to show the performance difference between using set lookup and list lookup:
In [21]: A = set(range(0, 1000, 5))
In [22]: B = list(range(0, 1000, 5))
In [23]: C = list(range(0, 1000))
In [24]: %timeit [y for y in C if y in A]
59.6 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [25]: %timeit [y for y in C if y in B]
2.94 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Resources