I'm writing a python script that uses a model to predict a large number of values by groupID, where efficiency is important (N on the order of 10^8). I initialize a results matrix and am trying to sequentially update a running sum of values in the results matrix.
Trying to be efficient, in my current method I use groupID as row numbers of the results matrix to avoid merging (merging is expensive, as far as I understand).
My attempt:
import numpy as np
# Initialize results matrix
results = np.zeros((5,3)) # dimension: number of groups x timestep
# Now I loop over batches, with batch size 4. Here's an example of one iteration:
batch_groupIDs = [3,1,0,1] # Note that multiple values can be generated for same groupID
batch_results = np.ones((4,3))
# My attempt at appending the results (low dimension example):
results[batch_groupIDs] += batch_results
print(results)
This outputs:
[[1. 1. 1.]
[1. 1. 1.]
[0. 0. 0.]
[1. 1. 1.]
[0. 0. 0.]]
My desired output is the following (since group 1 shows up twice, and should be appended twice):
[[1. 1. 1.]
[2. 2. 2.]
[0. 0. 0.]
[1. 1. 1.]
[0. 0. 0.]]
The actual dimensions of my problem are approximately 100 timesteps x a batch size of 1 million+ and 2000 groupIDs
Here is my understanding, and please correct me if I'm wrong:
We want a resultant matrix that has the shape number of groups x timestep which in this case would be 2000 x 100. This matrix needs to be efficiently updated sequentially for a batch size of 10^6.
If the summary is correct, here is my approach. We divide the 10^6 sequences into let's say "mini-batches" of 10^4 each. Another assumption is that the timesteps are weighted equally across all groups. :
Convert batch_groupIDs to a numpy array of frequencies. For batch_groupIDs = [3,1,0,1], the array would look like [1,2,0,1,0].
Convert the timestep values to numpy array. For batch_results = np.ones((4,3)), timestep array would look like [1,1,1] since 3 steps add equal weight
Find the outer product and add it to the result.
Repeat until all sequences are generated.
Here is the equivalent function:
import random
GROUPIDs = 2000
MINIBATCH_SIZE = 10000
TIMESTEPS = 100
def getMatrixOuter():
result = np.zeros((GROUPIDs,TIMESTEPS))
for x in range(100): #100 mini-batches each of size 10000 to replicate a sequence
batch_groupIDs = [random.randrange(GROUPIDs) for i in range(MINIBATCH_SIZE)] #10000 random sequences in range(0,GROUPIDs)
counts = {k: 0 for k in range(GROUPIDs)}
for i in batch_groupIDs:
counts[i] = counts.get(i, 0) + 1
vals = np.fromiter(counts.values(), dtype=int) #array of frequencies of GROUPIDs
steps=np.ones((TIMESTEPS,)) #weight of each timestep
np.add(np.outer(vals, steps), result, out=result)
return result
Here is how it fared when timed:
%timeit getMatrixOuter()
1 loop, best of 3: 704 ms per loop
A possible benchmark could be evaluating it against a sequential add using np.add.at()
Here is a possible benchmark function:
GROUPIDs = 2000
MINIBATCH_SIZE = 10000
TIMESTEPS = 100
def getMatrixnp():
result = np.zeros((GROUPIDs,TIMESTEPS))
for i in range(int(1e6)):
np.add.at(result, [random.randrange(2000) for i in range(1)], 1)
return result
And here is how it fared on the same system:
%timeit getMatrixnp()
1 loop, best of 3: 10.8 s per loop
The solution certainly rests on some assumptions that may very well turn out to be false. Nevertheless, my 2 cents.
Using pandas, you can count the occurrence of each batch_groupID using the groupby method. Once performed, you can simply add this result to your initial matrix using the add() method (you need to ensure the axis is here is set to 0). If you specifically need a numpy.array, then you can just use the .to_numpy() method on the DataFrame.
import pandas as pd
s = pd.Series(batch_groupIDs)
array = np.zeros((5,3))
#create a new series counting the occurrence of each index
group = s.groupby(s).count()
# leverage pandas add each count occurrence
results = pd.DataFrame(array).add(group, 0).fillna(0).to_numpy()
which gives
array([[1., 1., 1.],
[2., 2., 2.],
[0., 0., 0.],
[1., 1., 1.],
[0., 0., 0.]])
Related
I have two general functions Estability3 and Lstability3 where I would like to evaluate both two dimensional slices of arrays and one dimensional ranges of vectors. I have explored the error outside the functions in a jupyter notebook with some of the arguments to the functions.
These function compute energy and angular momentum. The position and velocity data needed to compute the energy and angular momentum is stored in a two dimensional matrix called xvec where the position and velocity are along a row and the three entries represent the three stars. xvec0 is the initial data for the simulation (timestep 0).
xvec0
array([[-5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -2.23606798e+00, 0.00000000e+00],
[ 5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 2.23606798e+00, 0.00000000e+00],
[ 9.95024876e+02, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 4.46099737e-01, 0.00000000e+00]])
I select the first star of the zeroth timestep by selecting the first row of this matrix. If I were looping over thousands of timesteps like usual I would use thousands of matrices like these and append them to a list then convert to a numpy array with thousands of columns. (so xvec1_0 would have thousands of columns instead of one).
xvec1_0=xvec0[0]
Since xvec1_0 has only one column, here I am trying to force numpy to recognize it as a matrix. It doesn't work.
np.reshape(xvec1_0,(1,6))
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
I see that it has two outer brackets, which implies that it is a matrix. But when I try to use the colon index over the one column like I normally do over the 1000s of columns, I get an error.
xvec1_0[:,0:3]
IndexError Traceback (most recent call last)
<ipython-input-115-79d26475ac10> in <module>
----> 1 xvec1_0[:,0:3]
IndexError: too many indices for array
Why can't I use the : operator to obtain the first row of this two dimensional array? How can I do that in this more general code that also applies to matrices?
Thanks,
Steven
I think I misread the function definition for reshape. I thought it changed it in place. It doesn't, I needed to assign an output, like this
xvec0_1 = np.reshape(xvec1_0,(1,6))
xvec1_0[:,0:3]
array([[-5., 0., 0.]])
xvec1_0
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
xvec1_0.shape
(1, 6)
Thanks to a friend's help, I discovered that the following works just fine.
import numpy as np
x = np.zeros((1,6))
print(x.shape)
print(x[:,0:3])
x[:,0:3]
(1, 6)
[[0. 0. 0.]]
array([[0., 0., 0.]])
x = np.zeros((6,))
print(x.shape)
x = np.reshape(x, (1,6))
print(x[:,0:3])
x[:,0:3]
(6,)
[[0. 0. 0.]]
array([[0., 0., 0.]])
Probably I should have thought of some of these tests, but I thought I already had found the most basic test when I saw the output from np.reshape. I really appreciate the help from my friend, and hope my question did not waste anyone's time too badly.
For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.
I made a simple function that produces a weighted average of several time series using supplied weights. It is designed to handle missing values (NaNs), which is why I am not using numpy's supplied average function.
However, when I feed it my array containing missing values, the array has its nan values replaced by 0s! I would have assumed that since I am changing the name of the array and it is not a global variable this should not happen. I want my X array to retain its original form including the nan value
I am a relative novice using python (obviously).
Example:
X = np.array([[1, 2, 3], [1, 2, 3], [1, 2, np.nan]]) # 3 time series to be weighted together
weights = np.array([[1,1,1]]) # simple example with weights for each series as 1
def WeightedMeanNaN(Tseries, weights):
## calculates weighted mean
N_Tseries = Tseries
Weights = np.repeat(weights, len(N_Tseries), axis=0) # make a vector of weights matching size of time series
loc = np.where(np.isnan(N_Tseries)) # get location of nans
Weights[loc] = 0
N_Tseries[loc] = 0
Weights = Weights/Weights.sum(axis=1)[:,None] # normalize each row so that weights sum to 1
WeightedAve = np.multiply(N_Tseries,Weights)
WeightedAve = WeightedAve.sum(axis=1)
return WeightedAve
WeightedMeanNaN(Tseries = X, weights = weights)
Out[161]: array([2. , 2. , 1.5])
In:X
Out:
array([[1., 2., 3.],
[1., 2., 3.],
[1., 2., 0.]]) # no longer nan!! ```
Where you call
loc = np.where(np.isnan(N_Tseries)) # get location of nans
Weights[loc] = 0
N_Tseries[loc] = 0
You remove all NaNs and set them to zeros.
To reverse this you could iterate over the array and replace zeros with NaNs.
However, this would also set regular zeros to Nans.
So it turns out this is a mistake caused by me being used to working in Matlab. Python treats arguments supplied to the function as pointers to the original object. In contrast, Matlab creates copies that are discarded when the function ends.
I solved my problem by adding ".copy()" when assigning variables in the function, so that the first line in the function above becomes:
N_Tseries = Tseries.copy().
However, one thing that puzzles me is that some people have suggested that using Tseries[:] should also create a copy of Tseries rather than a pointer to the original variable. This did not work for me though.
I found this answer useful:
Python function not supposed to change a global variable
SUM function results explanation when given two 2-d arrays
When I run the Code in Spyder IDE the Sum function and numpy.add function is showing different results. Can anyone help me to understand how the "SUM" function output is coming when we had given two , 2-d arrays for two parameters in the sum function instead of array and a number. Thank you
import numpy as np
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print(x)
print(y)
print (x+y)
print(sum(x,y))
print(np.add(x,y))
Output
[[1. 2.]
[3. 4.]]
[[5. 6.]
[7. 8.]]
[[ 6. 8.]
[10. 12.]]
[[ 9. 12.]
[11. 14.]]
[[ 6. 8.]
[10. 12.]]
In Numpy, the + operator is defined to be element-wise addition and in fact equivalent to np.add(...).
The sum(iterable, [start]) built-in function
Sums start and the items of an iterable from left to right and returns the total. start defaults to 0.
So if only given one matrix, it will perform a column-wise summation. If given a second argument, it will (element-wise) add to the sum. So some smaller examples might be
sum(x)
> array([4., 6.])
# i.e. [(1+3), (2+4)]
sum(x, 1)
> array([5., 7.])
# i.e. [(1+1+3), (1+2+4)]
sum(y)
> array([12., 14.])
# i.e. [(5+7), (6+8)]
sum(x, sum(y))
> array([16., 20.])
# i.e. [((5+7)+1+3), ((6+8)+2+4)]
sum(x, y)
> array([[ 9., 12.],
[11., 14.]])
# i.e. [[(5+1+3), (6+2+4)],
# [(7+1+3), (8+2+4)]]
The last sum() is performing the column-wise sum of x, and then adding the result to each element of y with a shared column. Written with Numpy, it's equivalent to
sum(x, y) == x.sum(axis=0) + y
I want to create multiple matrices that have the property that their diagonal is zero and that are symmetric. Matrices of dimension n of this form need n*(n-1)/2 parameters to be completely specified.
These parameters shall later be learned...
In numpy I'm able to compute these by using numpy.triu_indices to get the indices of the upper triangular matrix starting at the first diagonal above the main diagonal and then fill it by the provided parameters as in the following code snippet:
import numpy as np
R = np.array([[1,2,1,1,2,1], [1,1,1,1,1,1]])
s = R.shape[1]
M = R.shape[0]
iu_r, iu_c = np.triu_indices(s,1)
Q = np.zeros((M,s,s),dtype=float)
Q[:,iu_r,iu_c] = R
Q = Q + np.transpose(Q,(0,2,1))
Output:
[[[0. 1. 2. 1.]
[1. 0. 1. 2.]
[2. 1. 0. 1.]
[1. 2. 1. 0.]]
[[0. 1. 1. 1.]
[1. 0. 1. 1.]
[1. 1. 0. 1.]
[1. 1. 1. 0.]]]
But apparently one can not directly translates this to tensorflow, as
import tensorflow as tf
import numpy as np
M = 2
s = 4
iu_r, iu_c = np.triu_indices(s,1)
rates = tf.get_variable(shape=(M,s*(s-1)/2), name="R", dtype=float)
Q = tf.get_variable(shape=(M,s,s), dtype=float, initializer=tf.initializers.zeros, name="Q")
Q = Q[:,iu_r,iu_c].assign(rates)
fails with
TypeError: Tensors in list passed to 'values' of 'Pack' Op have types [int32, int64, int64] that don't all match.
What would be the correct way to define this tensor of matrices from a tensor of vectors in tensorflow?
EDIT:
My current solution is to embed using the scatter_nd function provided by tensorflow as it fits the need that no redundant variables need to be allocated as in the case of fill_triangular. Though, the indexing is not compatible with the indexes generated by numpy. Currently hardcoded the following example works:
import tensorflow as tf
import numpy as np
M = 2
s = 4
iu_r, iu_c = np.triu_indices(s,1)
rates = tf.get_variable(shape=(M,s*(s-1)/2), name="R", dtype=float)
iupper = [[[0,0,1],[0,0,2],[0,0,3],[0,1,2],[0,1,3],[0,2,3]],[[1,0,1],[1,0,2],[1,0,3],[1,1,2],[1,1,3],[1,2,3]]]
Q = tf.scatter_nd(iupper,rates,shape=(M,s,s), name="rate_matrix")
It should be no problem to translate the indices obtained by
iu_r, iu_c = np.triu_indices(s,1)
But maybe someone has a more elegant solution for that?
This part is unclear to me how it works:
import numpy as np
R = np.array([[1,2,1,1,2,1], [1,1,1,1,1,1]])
s = R.shape[1]
M = R.shape[0]
iu_r, iu_c = np.triu_indices(s,1)
Q = np.zeros((M,s,s),dtype=float)
Q[:,iu_r,iu_c] = R
Q = Q + np.transpose(Q,(0,2,1))
because this will fail in error.
You may use simpler code like this:
import numpy as np
R = [1,2,1,1,2,1]
N = 4
Q = np.zeros((N,N),dtype=float)
for i in range(0,N):
for j in range(0,N):
if (i<j):
Q[i][j] = R.pop(0)
Q would be:
[[0. 1. 2. 1.]
[0. 0. 1. 2.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]]
<class 'numpy.ndarray'>
To get the symmetric Q just use this: Q = Q + np.transpose(Q)
Whatever zigzag you do with your rates later you can convert to Tensors like this:
import tensorflow as tf
data_tf = tf.convert_to_tensor(Q, np.float32)
sess = tf.InteractiveSession()
print(data_tf.eval())
sess.close()
The other answers suggest to use the convert_to_tensor function, to convert your numpy array to a TensorFlow tensor.
This indeed can give you matrices with the desired property of being symmetric with a zero diagonal. However, once you start training, these properties might not hold anymore, as there is no guarantee in general that the weight updates will keep this property fixed.
If you do need to keep the matrices symmetric with a zero diagonal all along the training process, you can do the following:
import tensorflow as tf
from tensorflow.contrib.distributions import fill_triangular
M = 2 # batch size
s = 4 # matrix size
rates = tf.get_variable(shape=(M,s*(s+1)/2), name="R", dtype=float)
# Q will be triangular (with a non-zero diagonal!)
Q = fill_triangular(rates)
# set the diagonal of Q to zero.
Q = tf.linalg.set_diag(Q,tf.zeros((M,s)))
# make Q symmetric
Q = Q + tf.transpose(Q,[0,2,1])
Here is a test that verifies that the matrices hold the required properties, even after training:
import numpy as np
# define some arbitrary loss function
Q_target = tf.constant(np.random.normal(size=(1,s,s)).astype(np.float32))
loss = tf.nn.l2_loss(Q-Q_target)
# a single training step (which will update the matrices)
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# this is Q before training
print(sess.run(Q))
#[[[ 0. -0.564 0.318 -0.446]
# [-0.564 0. -0.028 0.2 ]
# [ 0.318 -0.028 0. 0.369]
# [-0.446 0.2 0.369 0. ]]
#
# [[ 0. 0.412 0.216 0.063]
# [ 0.412 0. 0.221 -0.336]
# [ 0.216 0.221 0. -0.653]
# [ 0.063 -0.336 -0.653 0. ]]]
# this is Q after training
sess.run(train_step)
print(sess.run(Q))
#[[[ 0. -0.548 0.235 -0.284]
# [-0.548 0. -0.055 0.074]
# [ 0.235 -0.055 0. 0.25 ]
# [-0.284 0.074 0.25 0. ]]
#
# [[ 0. 0.233 0.153 0.123]
# [ 0.233 0. 0.144 -0.354]
# [ 0.153 0.144 0. -0.568]
# [ 0.123 -0.354 -0.568 0. ]]]
Apparently you need something like convert_to_tensor.
This function converts Python objects of various types to Tensor objects. It accepts Tensor objects, numpy arrays, Python lists, and Python scalars.
Note: TensorFlow operations automatically convert NumPy ndarrays to Tensors.