Why is numpy's kron so fast?

Why is numpy's kron so fast? - python-3.x

I was trying to implement a kronecker product function. Below are three ideas that I have:
def kron(arr1, arr2):
"""columnwise outer product, avoiding relocate elements.
"""
r1, c1 = arr1.shape
r2, c2 = arr2.shape
nrows, ncols = r1 * r2, c1 * c2
res = np.empty((nrows, ncols))
for idx1 in range(c1):
for idx2 in range(c2):
new_c = idx1 * c2 + idx2
temp = np.zeros((r2, r1))
temp_kron = scipy.linalg.blas.dger(
alpha=1.0, x=arr2[:, idx2], y=arr1[:, idx1], incx=1, incy=1,
a=temp)
res[:, new_c] = np.ravel(temp_kron, order='F')
return res
def kron2(arr1, arr2):
"""First outer product, then rearrange items.
"""
r1, c1 = arr1.shape
r2, c2 = arr2.shape
nrows, ncols = r1 * r2, c1 * c2
tmp = np.outer(arr2, arr1)
res = np.empty((nrows, ncols))
for idx in range(arr1.size):
for offset in range(c2):
orig = tmp[offset::c2, idx]
dest_coffset = idx % c1 * c2 + offset
dest_roffset = (idx // c1) * r2
res[dest_roffset:dest_roffset+r2, dest_coffset] = orig
return res
def kron3(arr1, arr2):
"""First outer product, then rearrange items.
"""
r1, c1 = arr1.shape
r2, c2 = arr2.shape
nrows, ncols = r1 * r2, c1 * c2
tmp = np.outer(np.ravel(arr2, 'F'), np.ravel(arr1, 'F'))
res = np.empty((nrows, ncols))
for idx in range(arr1.size):
for offset in range(c2):
orig_offset = offset * r2
orig = tmp[orig_offset:orig_offset+r2, idx]
dest_c = idx // r1 * c2 + offset
dest_r = idx % r1 * r2
res[dest_r:dest_r+r2, dest_c] = orig
return res
Based on this stackoverflow post I created a MeasureTime decorator. A natural benchmark would be to compare against numpy.kron. Below are my test functions:
#MeasureTime
def test_np_kron(arr1, arr2, number=1000):
for _ in range(number):
np.kron(arr1, arr2)
return
#MeasureTime
def test_kron(arr1, arr2, number=1000):
for _ in range(number):
kron(arr1, arr2)
#MeasureTime
def test_kron2(arr1, arr2, number=1000):
for _ in range(number):
kron2(arr2, arr1)
#MeasureTime
def test_kron3(arr1, arr2, number=1000):
for _ in range(number):
kron3(arr2, arr1)
Turned out that Numpy's kron function performances much better:
arr1 = np.array([[1,-4,7], [-2, 3, 3]], dtype=np.float64, order='F')
arr2 = np.array([[8, -9, -6, 5], [1, -3, -4, 7], [2, 8, -8, -3], [1, 2, -5, -1]], dtype=np.float64, order='F')
In [243]: test_np_kron(arr1, arr2, number=10000)
Out [243]: "test_np_kron": 0.19688990000577178s
In [244]: test_kron(arr1, arr2, number=10000)
Out [244]: "test_kron": 0.6094115000014426s
In [245]: test_kron2(arr1, arr2, number=10000)
Out [245]: "test_kron2": 0.5699560000066413s
In [246]: test_kron3(arr1, arr2, number=10000)
Out [246]: "test_kron3": 0.7134822000080021s
I would like to know why that is the case? Is that because Numpy's reshape method is much more performant than manually copying over stuff (although still using numpy)? I was puzzled, since otherwise, I was using np.outer / blas.dger as well. The only difference I recognized here was how we arranged the ending results.
How come NumPy's reshape perform this good?
Here is the link to NumPy 1.17 kron source.
Updates:
Forgot to mention in the first place that I was trying to prototype in python, and then implement kron using C++ with cblas/lapack. Had some existing 'kron' needing to be refactored. I then came across Numpy's reshape and got really impressed.
Thanks in advance for your time!

Let's experiment with 2 small arrays:
In [124]: A, B = np.array([[1,2],[3,4]]), np.array([[10,11],[12,13]])
kron produces:
In [125]: np.kron(A,B)
Out[125]:
array([[10, 11, 20, 22],
[12, 13, 24, 26],
[30, 33, 40, 44],
[36, 39, 48, 52]])
outer produces the same numbers, but with a different arangement:
In [126]: np.outer(A,B)
Out[126]:
array([[10, 11, 12, 13],
[20, 22, 24, 26],
[30, 33, 36, 39],
[40, 44, 48, 52]])
kron reshapes it to a combination of the shapes of A and B:
In [127]: np.outer(A,B).reshape(2,2,2,2)
Out[127]:
array([[[[10, 11],
[12, 13]],
[[20, 22],
[24, 26]]],
[[[30, 33],
[36, 39]],
[[40, 44],
[48, 52]]]])
it then recombines 4 dimensions into 2 with concatenate:
In [128]: np.concatenate(np.concatenate(_127, 1),1)
Out[128]:
array([[10, 11, 20, 22],
[12, 13, 24, 26],
[30, 33, 40, 44],
[36, 39, 48, 52]])
An alternative is to swap axes, and reshape:
In [129]: _127.transpose(0,2,1,3).reshape(4,4)
Out[129]:
array([[10, 11, 20, 22],
[12, 13, 24, 26],
[30, 33, 40, 44],
[36, 39, 48, 52]])
The first reshape and transpose produce a view, but the second reshape has to produce a copy. Concatenate makes a copy. But all those actions are done in compiled numpy code.
Defining functions:
def foo1(A,B):
temp = np.outer(A,B)
temp = temp.reshape(A.shape + B.shape)
return np.concatenate(np.concatenate(temp, 1), 1)
def foo2(A,B):
temp = np.outer(A,B)
nz = temp.shape
temp = temp.reshape(A.shape + B.shape)
return temp.transpose(0,2,1,3).reshape(nz)
testing:
In [141]: np.allclose(np.kron(A,B), foo1(A,B))
Out[141]: True
In [142]: np.allclose(np.kron(A,B), foo2(A,B))
Out[142]: True
timing:
In [143]: timeit np.kron(A,B)
42.4 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [145]: timeit foo1(A,B)
26.3 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [146]: timeit foo2(A,B)
13.8 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
My code may need some generalization, but it demonstrates the validity of the approach.
===
With your kron:
In [150]: kron(A,B)
Out[150]:
array([[10., 11., 20., 22.],
[12., 13., 24., 26.],
[30., 33., 40., 44.],
[36., 39., 48., 52.]])
In [151]: timeit kron(A,B)
55.3 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
edit
einsum can do both the outer and transpose:
In [265]: np.einsum('ij,kl->ikjl',A,B).reshape(4,4)
Out[265]:
array([[10, 11, 20, 22],
[12, 13, 24, 26],
[30, 33, 40, 44],
[36, 39, 48, 52]])
In [266]: timeit np.einsum('ij,kl->ikjl',A,B).reshape(4,4)
9.87 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

converting sequence of nucleotide into 2D array of integers

I am trying to convert nucleotide to integer using the following mapping:
A -> 0
C -> 1
G -> 2
T -> 3
The sequence of nucleotide is saved in a pandas dataframe and it looks like:
0
0 GGATAATA
1 CGATAACC
I have used the df.apply() method to do the task. Here is the code:
import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
mapping = df[0].apply(lambda s: np.array([d[i] for i in s]))
It returns the following numpy array which is one dimensional:
print(mapping.values)
array([array([2, 2, 0, 3, 0, 0, 3, 0]), array([1, 2, 0, 3, 0, 0, 1, 1])],
dtype=object)
However, the expected output should be two dimensional array:
[[2,2,0,3,0,0,3,0],
[1,2,0,3,0,0,1,1]]

IIUC
df['0'].apply(list).explode().replace(d).groupby(level=0).agg(list).to_list()
Out[579]: [[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]

Use map:
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
Output
[[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]
or
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
I think the first solusion is faster
%%timeit
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
11.7 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
5.02 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

using .str.split() and stack with map
seq = {'A' : 0,
'C' : 1,
'G' : 2,
'T' : 3}
df[0].str.split('',expand=True).stack().map(seq).dropna().groupby(level=0).agg(list)
#out:
0 [2.0, 2.0, 0.0, 3.0, 0.0, 0.0, 3.0, 0.0]
1 [1.0, 2.0, 0.0, 3.0, 0.0, 0.0, 1.0, 1.0]
dtype: object

import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
# implement mapping
mapping = str.maketrans('ACGT', '0123')
df[0] = df[0].map(lambda x: x.translate(mapping))
# expected output
output = df[0].map(lambda x: [int(x) for i in list(x)]).tolist()

Slicing an array with arrays

I know this has been answered many times and I went through every SO question on this topic, but none of them seemed to tackle my problem.
This code yields an exception:
TypeError: only integer scalar arrays can be converted to a scalar index
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
sindex = np.array([0, 3, 4])
eindex = np.array([2, 5, 6])
r = a[sindex: eindex]
I have an array with start indexes and another one with end indexes and I simply wanted to extract whatever is in between them. Notice the difference between sindex and eindex is constant, for example 2. So eindex is always what ever is in sindex + 2.
So the expected result should be:
[1, 2, 4, 5, 5, 6]
Is there a way to do this without a for loop?

For a constant interval difference, we can setup sliding windows and simply index with the starting indices array. Thus, we can use broadcasting_app or strided_app from this post -
d = 2 # interval difference
out = broadcasting_app(a, L = d, S = 1)[sindex].ravel()
out = strided_app(a, L = d, S = 1)[sindex].ravel()
Or use scikit-image's built-in view_as_windows -
from skimage.util.shape import view_as_windows
out = view_as_windows(a,d)[sindex].ravel()
To set d, we can use -
d = eindex[0] - sindex[0]

You can't tell compiled numpy to take multiple slices directly. The alternatives to joining multiple slices involve some sort of advanced indexing.
In [509]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
...:
...: sindex = np.array([0, 3, 4])
...: eindex = np.array([2, 5, 6])
The most obvious loop:
In [511]: np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
Out[511]: array([1, 2, 4, 5, 5, 6])
A variation that uses the loop to construct indices first:
In [516]: a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
Out[516]: array([1, 2, 4, 5, 5, 6])
Since the slice size is all the same, we can generate one arange and step that with sindex:
In [521]: a[np.arange(eindex[0]-sindex[0]) + sindex[:,None]]
Out[521]:
array([[1, 2],
[4, 5],
[5, 6]])
and then ravel. This is a more direct expression of #Divakar'sbroadcasting_app`.
With this small example, timings are similar.
In [532]: timeit np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
13.4 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [533]: timeit a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
21.2 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [534]: timeit a[np.arange(eindex[0]-sindex[0])+sindex[:,None]].ravel()
10.1 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [535]: timeit strided_app(a, L=2, S=1)[sindex].ravel()
21.8 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
strided_app and view_as_windows use striding tricks to view the array as an array of size d windows, and use sindex to select a subset of them.
In larger cases, relative timings may vary with the size of the slices versus the number of slices.

You can just use sindex. Refer the following image

Numpy Array - Drop Duplicates

I am trying to remove duplicate elements from a numpy array.
Eg:
a = np.array([[0.03,0.32],[0.09,0.26],[0.03,0.32]])
a = np.unique(a,axis=0)
This is perfectly working.
But the problem is this code is a part of a function. And I run the function say 10 times. At any one run the system gets hanged at exactly this line.
I notice that array would be of max 3500 size and each element (inner array) would be of length 60.
Why is this happening or any other efficient way?

There's quite a few issues with what you're doing.
First, observe that np.unique does not work well for floating point arithmetic, and will not in general filter out "unique" arrays of floats:
In [16]: a = np.array([[2.1*3, .123], [6.3, 2.05*.06]])
In [17]: a
Out[17]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
In [18]: np.unique(a, axis=0)
Out[18]:
array([[6.3 , 0.123],
[6.3 , 0.123]])
Note that the duplicates are still in the result after calling np.unique. The reason for this is because np.unique is comparing on equality meaning, that the floats must match bit for bit. However, floating point arithmetic is not exact, so you are not guaranteed to filter out duplicates correctly.
Secondly, in terms of performance, you can do better than np.unique with a hashable type. np.unique will always run in O(n log n) since it does a sort. You can verify this in the source code:
if optional_indices:
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
aux = ar[perm]
else:
ar.sort()
aux = ar
So, regardless of how the conditional evaluates, a sort is performed over ar (which is the input array, see here for more detail: https://github.com/numpy/numpy/blob/v1.15.0/numpy/lib/arraysetops.py#L277). The reason for this is because np.unique supports a rich set of functionality (like getting the indices of dups, returning the count of dups, etc).
You don't have to sort to get unique elements. If you beat your type into a hashable type (like tuple), then you can filter out duplicates in O(n), linear time. Here's an example:
In [37]: b
Out[37]:
[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
In [39]: np.unique(b, axis=0)
Out[39]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [40]: set(b)
Out[40]: {(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)}
In [41]: %timeit np.unique(b, axis=0)
21.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [42]: %timeit set(b)
627 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So, as you can see, just using the built-in set runs about 30x faster than np.unique. Please note this will not work correctly for arrays of floats, but I just wanted to show that np.unique is not particularly performant from an algorithmic perspective.
Lastly, 3500x60 is not really that big. You can loop through that pretty easily, even with a subpar algorithm, and it should not hang on any modern hardware. It should run pretty fast:
In [43]: np.random.seed(0)
In [46]: x = np.random.random((3500, 60))
In [49]: %timeit np.unique(x, axis=0)
2.57 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So it takes 2.57 millisecond on my MacBook Pro, which isn't exactly a powerhouse in terms of hardware (2.3 GHz i5, 8GB of RAM). Make sure you're profiling your code, and make sure that the line in this question is actually the trouble line.
HTH.

how to have multiple conditions in a list comprehension in which conditions are in an array

Eg let A = [3,4]
and Y be a array of multiple values like
Y = [2,3,2,2,2,2,2,3,3,3,3,3]
then I want to select all those labels of Y where Y is in A
So I wrote the following code:
`Yij = [Y[Y == x] for x in a]`
Output:
[array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4])]
but this will lead a list of list.
I on other hand want a normal array.
Any suggestion on how can I make this work?

A list comprehension solution:
>>> A = set([3, 4])
>>> Y = [2,3,2,2,2,2,2,3,3,3,3,3]
>>> Z = [y for y in Y if y in A]
>>> Z
[3, 3, 3, 3, 3, 3]
Here are some timings to show the performance difference between using set lookup and list lookup:
In [21]: A = set(range(0, 1000, 5))
In [22]: B = list(range(0, 1000, 5))
In [23]: C = list(range(0, 1000))
In [24]: %timeit [y for y in C if y in A]
59.6 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [25]: %timeit [y for y in C if y in B]
2.94 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy apply_along_axis vectorisation

I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?

Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why is numpy's kron so fast? - python-3.x

Related

converting sequence of nucleotide into 2D array of integers

Slicing an array with arrays

Numpy Array - Drop Duplicates

how to have multiple conditions in a list comprehension in which conditions are in an array

numpy apply_along_axis vectorisation

Categories

Resources