Neighbors sum of numpy array with mask - python-3.x

I have two large arrays, one containing values, and one being a mask basically. The code below shows the function I want to implement.
from scipy.signal import convolve2d
import numpy as np
sample = np.array([[6, 4, 5, 5, 5],
[7, 1, 0, 8, 3],
[2, 5, 4, 8, 4],
[2, 0, 2, 6, 0],
[5, 7, 2, 3, 2]])
mask = np.array([[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
neighbors_sum = convolve2d(sample, np.ones((3,3), dtype=int), mode='same', boundary='wrap')
# neighbors_sum = np.array([[40, 37, 35, 33, 44],
# [37, 34, 40, 42, 48],
# [24, 23, 34, 35, 40],
# [27, 29, 37, 31, 32],
# [31, 33, 34, 30, 34]])
result = np.where(mask, neighbors_sum, 0)
print(result)
This code works, and gets me what I expects:
np.array([[40, 0, 35, 33, 0],
[ 0, 0, 40, 0, 48],
[ 0, 23, 0, 0, 0],
[ 0, 0, 0, 31, 0],
[31, 33, 0, 0, 34]])
So far, so good. However, where I'm encountering some large issue is when I increase the size of the arrays. In my case, instead of a 5x5 input and a 3x3 summing mask, I need a 50,000x20,000 input and a 100x100 summing mask. And when I move to that, the convolve2d function is in all kinds of trouble and the calculation is extremely long.
Given that I only care about the masked result, and thus only care about the summation from convolve2d at those points, can anyone think of a smart approach to take here? Going to a for loop and selecting only the points of interest would lose the speed advantage of the vectorization so I'm not convinced this would be worth it.
Any suggestion welcome!

convolve2d is very inefficient in this case. Since the mask is np.ones, you can split the filter in two trivial ones thanks to separable filtering: one np.ones(100, 1) filter and one np.ones(1, 100) filter. Moreover, a rolling sum can be used to speed up even more the computation.
Here is a simple solution without a rolling sum:
# Simple faster implementation
tmp = convolve2d(sample, np.ones((1,100), dtype=int), mode='same', boundary='wrap')
neighbors_sum = convolve2d(tmp, np.ones((100,1), dtype=int), mode='same', boundary='wrap')
result = np.where(mask, neighbors_sum, 0)
You can compute the rolling sum efficiently using Numba. The strategy is to split the computation in 3 parts: the horizontal rolling sum, the vertical rolling sum and the final masking. Each step can be fully parallelized using multiple threads (although parallelizing the vertical rolling sum is harder with Numba). Each part needs to work line by line so to be cache friendly.
# Complex very-fast implementation
import numba as nb
# Numerical results may diverge if the input contains big
# values with many small ones.
# Does not support inputs containing NaN values or +/- Inf ones.
#nb.njit('float64[:,::1](float64[:,::1], int_)', parallel=True, fastmath=True)
def horizontalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
s = 0.0
lShift = fs//2
rShift = (fs-1)//2
for j in range(m-lShift, m):
s += sample[i, j]
for j in range(0, rShift+1):
s += sample[i, j]
tmp[i, 0] = s
for j in range(1, m):
jLeft, jRight = (j-1-lShift)%m, (j+rShift)%m
s += sample[i, jRight] - sample[i, jLeft]
tmp[i, j] = s
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_)', fastmath=True)
def verticaltalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
tShift = fs//2
bShift = (fs-1)//2
for j in range(m):
tmp[0, j] = 0.0
for i in range(n-tShift, n):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(0, bShift+1):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(1, n):
iTop = (i-1-tShift)%n
iBot = (i+bShift)%n
for j in range(m):
tmp[i, j] = tmp[i-1, j] + (sample[iBot, j] - sample[iTop, j])
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_[:,::1], int_)', parallel=True, fastmath=True)
def compute(sample, mask, filterSize):
n, m = sample.shape
tmp = horizontalRollingSum(sample, filterSize)
neighbors_sum = verticaltalRollingSum(tmp, filterSize)
res = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
for j in range(n):
res[i, j] = neighbors_sum[i, j] * mask[i, j]
return res
Benchmark & Notes
Here is the testing code:
n, m = 5000, 2000
sample = np.random.rand(n, m)
mask = (np.random.rand(n, m) < 0.05).astype(int)
Here are the results on my 6-core machine:
Initial solution: 174366 ms (x1)
With separate filters: 5710 ms (x31)
Final Numba solution: 40 ms (x4359)
Optimal theoretical time: 10 ms (optimistic)
Thus, the Numba implementation is 4359 times faster than the initial one.
That being said, be careful of possible numerical issues that this last implementation can have regarding the input array (see the comments in the code). It should be fine as long as np.std(sample) is relatively small and np.all(np.isfinite(sample)) is true.
Note that the code can be further optimized: the vertical rolling sum can be parallelized; modulus operations can be avoided in the horizontal rolling sum; the vertical rolling sum and the masking steps can be merged together (ie. by computing res on-the-fly and not storing tmp); tiling can be used to compute all the steps simultaneously in a more cache-friendly way. However, these optimizations make the code more complex and some of them are very hard to perform (especially the last one with Numba).
Note that using a boolean mask (instead of an integer-based one) should make the algorithm faster since it takes less memory and processors can fetch values faster.

Related

Efficiently Loop Through Millions of Elements

I have a list of 262144 elements in a list created through "itertools.product". Now I have to loop over these elements and multiply it with all other elements, which is taking too much time. (I don't have any issue of memory / cpu)
elements = []
for e in itertools.product(range(4), repeat=9):
elements.append(e)
for row in elements:
for col in elements:
do_calculations(row, col)
def do_calculations(ro, co):
t = {}
t[0] = [multiply(c=ro[0], r=co[0])]
for i in range(1, len(ro)):
_t = []
for j in range(i+1):
_t.append(multiply(c=ro[j], r=co[i-j]))
t[i] = _t
for vals in t.values():
nx = len(vals)
_co = ro[nx:]
_ro = co[nx:]
for k in range(len(_ro)):
vals.append(multiply(c=_co[k], r=_ro[k]))
_t = []
for k in t.values():
s = k[0]
for j in range(1, len(k)):
s = addition(c=s, r=k[j])
_t.append(s)
return _t
def addition(c, r) -> int:
__a = [[0, 3, 1, 2],
[3, 2, 0, 1],
[0, 3, 2, 1],
[1, 0, 2, 3]]
return __a[c][r]
def multiply(c, r) -> int:
__m = [[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 3, 1, 2],
[0, 2, 3, 1]]
return __m[c][r]
it is taking too much time to process single col with rows....
can any one help me in this?
regards
Not much of a python guy but
make sure col is a higher number than row (small optimization, but optimization nevertheless)
use a multiprocessing library (alink). that should cut the calculation time.
(as noted in comment by #Skam, multithreading does not increase performance in such case)
also, you might consider some optimizations in the calculation itself.

Count number of repeated elements in list considering the ones larger than them

I am trying to do some clustering analysis on a dataset. I am using a number of different approaches to estimate the number of clusters, then I put what every approach gives (number of clusters) in a list, like so:
total_pred = [0, 0, 1, 1, 0, 1, 1]
Now I want to estimate the real number of clusters, so I let the methods above vote, for example, above, more models found 1 cluster than 0, so I take 1 as the real number of clusters.
I do this by:
counts = np.bincount(np.array(total_pred))
real_nr_of_clusters = np.argmax(counts))
There is a problem with this method, however. If the above list contains something like:
[2, 0, 1, 0, 1, 0, 1, 0, 1]
I will get 0 clusters as the average, since 0 is repeated more often. However, if one model found 2 clusters, it's safe to assume it considers at least 1 cluster is there, hence the real number would be 1.
How can I do this by modifying the above snippet?
To make the problem clear, here are a few more examples:
[1, 1, 1, 0, 0, 0, 3]
should return 1,
[0, 0, 0, 1, 1, 3, 4]
should also return 1 (since most of them agree there is AT LEAST 1 cluster).
There is a problem with your logic
Here is an implementation of the described algorithm.
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
l = sorted(l, reverse=True)
votes = {x: i for i, x in enumerate(l, start=1)}
Output
{2: 1, 1: 5, 0: 9}
Notice that since you define a vote as agreeing with anything smaller than itself, then min(l) will always win, because everyone will agree that there are at least min(l) clusters. In this case min(l) == 0.
How to fix it
Mean and median
Beforehand, notice that taking the mean or the median are valid and light-weight options that both satisfy the desired output on your examples.
Bias
Although, taking the mean might not be what you want if, for say, you encounter votes with high variance such as [0, 0, 7, 8, 10] where it is unlikely that the answer is 5.
A more general way to fix that is to include a voter's bias toward votes close to theirs. Surely that a 2-voter will agree more to a 1 than a 0.
You do that by implementing a metric (note: this is not a metric in the mathematical sense) that determines how much an instance that voted for x is willing to agree to a vote for y on a scale of 0 to 1.
Note that this approach will allow voters to agree on a number that is not on the list.
We need to update our code to account for applying that pseudometric.
def d(x, y):
return x <= y
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
Output
{0: 9, 1: 5, 2: 1}
The above metric is a sanity check. It is the one your provided in your question and it indeed ends up determining that 0 wins.
Metric choices
You will have to toy a bit with your metrics, but here are a few which may make sense.
Inverse of the linear distance
def d(x, y):
return 1 / (1 + abs(x - y))
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 6.33, 1: 6.5, 2: 4.33}
Inverse of the nth power of the distance
This one is a generalization of the previous. As n grows, voters tend to agree less and less with distant vote casts.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 5.25, 2: 2.44}
Upper-bound distance
Similar to the previous metric, this one is close to what you described at first in the sense that a voter will never agree to a vote higher than theirs.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n if x >= y else 0
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 4.25, 2: 1.0}
Normal distribution
An other option that would be sensical is a normal distribution or a skewed normal distribution.
While the other answer provides a comprehensive review of possible metrics and methods, it seems what you are seeking is to find the closest number of clusters to mean!
So something as simple as:
cluster_num=int(np.round(np.mean(total_pred)))
Which returns 1 for all your cases as you expect.

calculate the sum of the intervals based on the binary array

I have two matrix:
Binary A = [[1, 0, 1, 0], [0, 0, 1, 0]];
Matrix of values B = [[100, 200, 300, 400], [400, 300, 100, 200]];
I want to calculate the sum of the intervals that are formed by the rows of the matrix A. For my exmpl. result will be follow: R = [[300, 0, 700, 0], [0, 0, 300, 0]] (generally, it is not necessary to set zeros [[300, 700], [300]] - it's right solution too)
I already wrote the code, but very very terrible (although it works correctly)
def find_halfsum(row1, row2):
i = 0
result = []
count = 0
for j in range(len(row1)):
if row1[j] == 1 and count == 0:
i = j
count += 1
elif row1[j] == 1:
count += 1
if count == 2:
if j == i + 1:
result.append(row2[i])
else:
result.append(sum(row2[i:j]))
i = j
count = 1
if j == len(row1) - 1:
result.append(sum(row2[i:j + 1]))
return result
Someone knows beautiful solutions (which will be faster)(preferably with the help of a numpy)?
Thanks
Not familiar with python, but I don't think you need that many lines
define halfSum(matrixA, matrixB):
sum = 0;
for i in range(len(matrixA)):
if matrixA[i] == 1:
sum += matrixB[i]
return sum;
You can use numpy.add.reduceat:
>>> A = np.array([[1, 0, 1, 0], [0, 0, 1, 0]])
>>> B = np.array([[100, 200, 300, 400], [400, 300, 100, 200]])
>>>
>>> [np.add.reduceat(b, np.flatnonzero(a)) for a, b in zip(A, B)]
[array([300, 700]), array([300])]

numpy apply_along_axis vectorisation

I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?
Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.

How to generate all the permutations of a multiset?

A multi-set is a set in which all the elements may not be unique.How to enumerate all the possible permutations among the set elements?
Generating all the possible permutations and then discarding the repeated ones is highly inefficient. Various algorithms exist to directly generate the permutations of a multiset in lexicographical order or other kind of ordering. Takaoka's algorithm is a good example, but probably that of Aaron Williams is better
http://webhome.csc.uvic.ca/~haron/CoolMulti.pdf
moreover, it has been implemented in the R package ''multicool''.
Btw, if you just want the total number of distinct permutations, the answer is the Multinomial coefficient:
e.g., if you have, say, n_a elements 'a', n_b elements 'b', n_c elements 'c',
the total number of distinct permutations is (n_a+n_b+n_c)!/(n_a!n_b!n_c!)
This is my translation of the Takaoka multiset permutations algorithm into Python (available here and at repl.it):
def msp(items):
'''Yield the permutations of `items` where items is either a list
of integers representing the actual items or a list of hashable items.
The output are the unique permutations of the items given as a list
of integers 0, ..., n-1 that represent the n unique elements in
`items`.
Examples
========
>>> for i in msp('xoxox'):
... print(i)
[1, 1, 1, 0, 0]
[0, 1, 1, 1, 0]
[1, 0, 1, 1, 0]
[1, 1, 0, 1, 0]
[0, 1, 1, 0, 1]
[1, 0, 1, 0, 1]
[0, 1, 0, 1, 1]
[0, 0, 1, 1, 1]
[1, 0, 0, 1, 1]
[1, 1, 0, 0, 1]
Reference: "An O(1) Time Algorithm for Generating Multiset Permutations", Tadao Takaoka
https://pdfs.semanticscholar.org/83b2/6f222e8648a7a0599309a40af21837a0264b.pdf
'''
def visit(head):
(rv, j) = ([], head)
for i in range(N):
(dat, j) = E[j]
rv.append(dat)
return rv
u = list(set(items))
E = list(reversed(sorted([u.index(i) for i in items])))
N = len(E)
# put E into linked-list format
(val, nxt) = (0, 1)
for i in range(N):
E[i] = [E[i], i + 1]
E[-1][nxt] = None
head = 0
afteri = N - 1
i = afteri - 1
yield visit(head)
while E[afteri][nxt] is not None or E[afteri][val] < E[head][val]:
j = E[afteri][nxt] # added to algorithm for clarity
if j is not None and E[i][val] >= E[j][val]:
beforek = afteri
else:
beforek = i
k = E[beforek][nxt]
E[beforek][nxt] = E[k][nxt]
E[k][nxt] = head
if E[k][val] < E[head][val]:
i = k
afteri = E[i][nxt]
head = k
yield visit(head)
sympy provides multiset_permutations.
from the doc:
>>> from sympy.utilities.iterables import multiset_permutations
>>> from sympy import factorial
>>> [''.join(i) for i in multiset_permutations('aab')]
['aab', 'aba', 'baa']
>>> factorial(len('banana'))
720
>>> sum(1 for _ in multiset_permutations('banana'))
60
There are O(1) (per permutation) algorithms for multiset permutation generation, for example, from Takaoka (with implementation)
Optimisation of smichr's answer, I unzipped the nxts to make the visit function more efficient with an accumulate() (the map() is faster than a list comprehension and it seemed shallow and pedantic to have to nest it in a second one with a constant index)
from itertools import accumulate
def msp(items):
def visit(head):
'''(rv, j) = ([], head)
for i in range(N):
(dat, j) = E[j]
rv.append(dat)
return(rv)'''
#print(reduce(lambda e,dontCare: (e[0]+[E[e[1]]],nxts[e[1]]),range(N),([],head))[0])
#print(list(map(E.__getitem__,accumulate(range(N-1),lambda e,N: nxts[e],initial=head))))
return(list(map(E.__getitem__,accumulate(range(N-1),lambda e,N: nxts[e],initial=head))))
u=list(set(items))
E=list(sorted(map(u.index,items)))
N=len(E)
nxts=list(range(1,N))+[None]
head=0
i,ai,aai=N-3,N-2,N-1
yield(visit(head))
while aai!=None or E[ai]>E[head]:
beforek=(i if aai==None or E[i]>E[aai] else ai)
k=nxts[beforek]
if E[k]>E[head]:
i=k
nxts[beforek],nxts[k],head = nxts[k],head,k
ai=nxts[i]
aai=nxts[ai]
yield(visit(head))
Here are the test results (the second has (13!/2!/3!/3!/4!)/10! = 143/144 times as many permutations but takes longer due to being more of a multiset, I suppose), mine seems 9% and 7% faster respectively:
cProfile.run("list(msp(list(range(10))))")
cProfile.run("list(msp([0,1,1,2,2,2,3,3,3,3,4,4,4]))")
original:
43545617 function calls in 28.452 seconds
54054020 function calls in 32.469 seconds
modification:
39916806 function calls in 26.067 seconds
50450406 function calls in 30.384 seconds
I have insufficient reputation to comment upon answers, but for an items input list, Martin Böschen's answer has time complexity the product of the factorials of the number of instances of each element value times greater, or
reduce(int.__mul__,map(lambda n: reduce(int.__mul__,range(1,n+1)),map(items.count,set(items))))
This can grow large quickly when computing large multisets with many occurrences. For instance, it will take 1728 times longer per permutation for my second example than my first.
You can reduce your problem to enumerate all permutations of a list. The typcial permutation generation algorithm takes a list and don't check if elements are equal. So you only need to generate a list out of your multiset, and feed it to your permutation generating algorithm.
For example, you have the multiset {1,2,2}.
You transform it to the list [1,2,2].
And generate all permutations, for example in python:
import itertools as it
for i in it.permutations([1,2,2]):
print i
And you will get the output
(1, 2, 2)
(1, 2, 2)
(2, 1, 2)
(2, 2, 1)
(2, 1, 2)
(2, 2, 1)
The problem is, that you get some permutations repeatedly. A simple solution would be just to filter them out:
import itertools as it
permset=set([i for i in it.permutations([1,2,2])])
for x in permset:
print x
Output:
(1, 2, 2)
(2, 2, 1)
(2, 1, 2)

Resources