Splitting a matrix using an array of indices - python-3.x

I have a matrix that I want to split up into two. The two new are sort of tangled together, but I do have a "start" and "stop" array indicating what rows belong to each new matrix.
I have given a small example below including my own solution which I do not find satisfying.
Is there a smarter way of splitting the matrix?
Note that there is a certain periodicity in this example, which in not the case in the real matrix.
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
b = a[b_start[0]:b_stop[0], :]
c = a[c_start[0]:c_stop[0], :]
for i in range(1, len(b_start)):
b = np.append(b, a[b_start[i]:b_stop[i], :], axis=0)
c = np.append(c, a[c_start[i]:c_stop[i], :], axis=0)
print(b)
print(c)

You can use fancy indexing functionality of numpy.
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
This will give you the same output.
Test run:
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
print(b)
print(c)
Output:
[[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]
[ 1.46210794 -2.06014071]
[-0.3224172 -0.38405435]
[-1.10061918 1.14472371]
[ 0.90159072 0.50249434]
[-0.69166075 -0.39675353]
[-0.6871727 -0.84520564]]
[[ 0.86540763 -2.3015387 ]
[ 1.74481176 -0.7612069 ]
[ 0.3190391 -0.24937038]
[ 1.13376944 -1.09989127]
[-0.17242821 -0.87785842]
[ 0.04221375 0.58281521]
[ 0.90085595 -0.68372786]
[-0.12289023 -0.93576943]
[-0.26788808 0.53035547]
[-0.67124613 -0.0126646 ]
[-1.11731035 0.2344157 ]
[ 1.65980218 0.74204416]]
I did 100 runs of two approaches, running time is:
0.008551359176635742#python for loop
0.0034341812133789062#fancy indexing
And 10000 runs:
0.18994426727294922#python for loop
0.26583170890808105#fancy indexing

Congratulations on using np.append correctly. A lot of posters have problems with it.
But it is faster to collect values in a list, and do one concatenate. np.append makes a whole new array each time; list append just adds a pointer to the list in-place.
b = []
c = []
for i in range(1, len(b_start)):
b.append(a[b_start[i]:b_stop[i], :])
c.append(a[c_start[i]:c_stop[i], :])
b = np.concatenate(b, axis=0)
c = np.concatenate(c, axis=0)
or even
b = np.concatenate([a[i:j,:] for i,j in zip(b_start, b_stop)], axis=0)
The other answer does
idx = np.hstack([np.arange(i,j) for i,j in zip(b_start, b_stop)])
a[idx,:]
Based on previous SO questions I expect the two approaches to have about the same speed.

Related

Python optimization of time-series data re-indexing based on multiple-parameter multi-varialbe input and singular value output

I am trying to optimize a funciton that is trying to maximize the correlation between two (pandas) time series arrays (X and Y). This is done by using three parameters (a, b, c) and a third time series array (Z). The Z array is used to reindex the values in the X array (based on the parameters a, b, c) in such a way as to maximize the correlation of the reindexed X array (Xnew) with the Y array.
Below is some pseudo-code to demonstrate what I amy trying to do. I have attempted this using LMfit and scipy optimize but I am not sure how to make this task work in those packages. For example in LMfit if I tried to minimize the MyOpt function (which passes back a single value of the correlation metric) then it complains that I have more parameters than outputs. However, if I pass back the time series of the corrlation metric (diff) the the parameter values remain fixed at their input values.
I know the reindexing function I am using works because using the rather crude methods similar to the code below give signifianct changes in the mean (diff) metric passed back.
My knowledge of these optimizaiton packages is not up to scratch for this job so if anyone has a suggestion on how to tackle this, I would be greatfull.
def GetNewIndex(Z, a, b ,c):
old_index = np.arange(0, len(Z))
index_adj = some_func(a,b,c)
new_index = old_index + index_adj
max_old = np.max(old_index)
new_index[new_index > max_old] = max_old
new_index[new_index < 0] = 0
return new_index
def MyOpt(params, X, Y ,Z):
a = params['A']
b = params['B']
c = params['C']
# estimate lag (in samples) based on ambient RH
new_index = GetNewIndex(Z, a, b, c)
# assign old values to new locations and convert back to pandas series
Xnew = np.take(X.values, new_index)
Xnew = pd.Series(Xnew, index=X.index)
cc = Y.rolling(1201, center=True).corr(Xnew)
cc = cc.interpolate(limit_direction='both', limit_area=None)
diff = 1-np.abs(cc)
return np.mean(diff)
#==================================================
X = some long pandas time series data
Y = some long pandas time series data
Z = some long pandas time series data
As = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
Bs = [0, 0 ,0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Cs = [5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6]
outs = []
for A, B, C in zip(As, Bs, Cs):
params={'A':A, 'B':B, 'C':C}
out = MyOpt(params, X, Y, Z)
outs.append(out)

Find the index location of an element in a Numpy array

If I have:
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
for item in range(3):
choice = random.choice(x)
How can I get the index number of the random choice taken from the array?
I tried:
indexNum = np.where(x == choice)
print(indexNum[0])
But it didn't work.
I want the output, for example, to be something like:
chosenIndices = [1 5 8]
Another possibility is using np.where and np.intersect1d. Here random choice without repetition.
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
res=[]
cont = 0
while cont<3:
choice = random.choice(x)
ind = np.intersect1d(np.where(choice[0]==x[:,0]),np.where(choice[1]==x[:,1]))[0]
if ind not in res:
res.append(ind)
cont+=1
print (res)
# Output [8, 1, 5]
You can achieve this by converting the numpy array to list of tuples and then apply the index function.
This would work:
import random
import numpy as np
chosenIndices = []
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
x = x.T
x = list(zip(x[0],x[1]))
item = 0
while len(chosenIndices)!=3:
choice = random.choice(x)
indexNum = x.index(choice)
if indexNum in chosenIndices: # if index already exist, then it will rerun that particular iteration again.
item-=1
else:
chosenIndices.append(indexNum)
print(chosenIndices) # Thus all different results.
Output:
[1, 3, 2]

How to make combination, if any one of the element exists that can be added to make sum?

To find all possible combinations that can be added to make given sum.
Combinations can be formed with multiple elements and also if any single element exists.
Input:
l1 = [9,1, 2, 7, 6, 1, 5]
target = 8
**Constraints**
1<=(len(l1))<=500000
1<=each_list_element<=1000
Output:
Format : {index:element}
{1:1, 5:1, 4:6} #Indices : 1,5,4 Elements : 1,1,6
{1:1, 2:2, 6:5}
{5:1, 2:2, 6:5}
{1:1, 3:7}
{5:1, 3:7}
{2:2, 4:6}
More Scenarios:
Input = [4,6,8,5,3]
target = 3
Output {4:3}
Input = [4,6,8,3,5,3]
target = 3
Output {5:3,3:3}
Input = [1,2,3,15]
target = 15
Output {3:15}
Below code covers for all above scenarios.
Scenarios to be handled, along with above.
Input =[1,6,7,1,3]
target=5
Output={0:1,3:1,4:3} , {0:1,0:1,4:3}, {3:1,3:1,4:3}
Input=[9,6,8,1,7]
target=5
Output={3:1,3:1,3:1,3:1,3:1}
As suggested by #Chris Doyle in previous question, will be using that code.
(How to find indices and combinations that adds upto given sum?)
Code:
from itertools import combinations
def find_sum_with_index(l1, target):
index_vals = [iv for iv in enumerate(l1) if iv[1] < target]
for r in range(1, len(index_vals) + 1):
for perm in combinations(index_vals, r):
if sum([p[1] for p in perm]) == target:
yield perm
l1 = [9, 1, 2, 7, 6, 1, 5]
target = 8
for match in find_sum_with_index(l1, target):
print(dict(match))
You can use dictionary comprehension
from itertools import combinations
l1 = [9,1, 2, 7, 6, 1, 5]
target = 8
for i in range(len(l1)):
for c in combinations(l1,i):
if sum(c) == target:
res = { i:x for i,x in enumerate(c)}
print(res)

Generate a list with two unique elements with specific length [duplicate]

Simple question here:
I'm trying to get an array that alternates values (1, -1, 1, -1.....) for a given length. np.repeat just gives me (1, 1, 1, 1,-1, -1,-1, -1). Thoughts?
I like #Benjamin's solution. An alternative though is:
import numpy as np
a = np.empty((15,))
a[::2] = 1
a[1::2] = -1
This also allows for odd-length lists.
EDIT: Also just to note speeds, for a array of 10000 elements
import numpy as np
from timeit import Timer
if __name__ == '__main__':
setupstr="""
import numpy as np
N = 10000
"""
method1="""
a = np.empty((N,),int)
a[::2] = 1
a[1::2] = -1
"""
method2="""
a = np.tile([1,-1],N)
"""
method3="""
a = np.array([1,-1]*N)
"""
method4="""
a = np.array(list(itertools.islice(itertools.cycle((1,-1)), N)))
"""
nl = 1000
t1 = Timer(method1, setupstr).timeit(nl)
t2 = Timer(method2, setupstr).timeit(nl)
t3 = Timer(method3, setupstr).timeit(nl)
t4 = Timer(method4, setupstr).timeit(nl)
print 'method1', t1
print 'method2', t2
print 'method3', t3
print 'method4', t4
Results in timings of:
method1 0.0130500793457
method2 0.114426136017
method3 4.30518102646
method4 2.84446692467
If N = 100, things start to even out but starting with the empty numpy arrays is still significantly faster (nl changed to 10000)
method1 0.05735206604
method2 0.323992013931
method3 0.556654930115
method4 0.46702003479
Numpy arrays are special awesome objects and should not be treated like python lists.
use resize():
In [38]: np.resize([1,-1], 10) # 10 is the length of result array
Out[38]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
it can produce odd-length array:
In [39]: np.resize([1,-1], 11)
Out[39]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1])
Use numpy.tile!
import numpy
a = numpy.tile([1,-1], 15)
use multiplication:
[1,-1] * n
If you want a memory efficient solution, try this:
def alternator(n):
for i in xrange(n):
if i % 2 == 0:
yield 1
else:
yield -1
Then you can iterate over the answers like so:
for i in alternator(n):
# do something with i
Maybe you're looking for itertools.cycle?
list_ = (1,-1,2,-2) # ,3,-3, ...
for n, item in enumerate(itertools.cycle(list_)):
if n==30:
break
print item
I'll just throw these out there because they could be more useful in some circumstances.
If you just want to alternate between positive and negative:
[(-1)**i for i in range(n)]
or for a more general solution
nums = [1, -1, 2]
[nums[i % len(nums)] for i in range(n)]

Roll of different amount along a single axis in a 3D matrix [duplicate]

I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.

Resources