Error converting covariance to correlation using scipy - python-3.x

I am trying to convert a covaraince matrix (from scipy.optimize.curve_fit) to a correlation matrix using the method here:
https://math.stackexchange.com/questions/186959/correlation-matrix-from-covariance-matrix
My test data is from here https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
My code is here
import numpy as np
S = [[1.0, 1.0, 8.1],
[1.0, 16.0, 18.0],
[8.1, 18.0, 81.0] ]
S = np.array(S)
diag = np.sqrt(np.diag(np.diag(S)))
gaid = np.linalg.inv(diag)
corl = gaid * S * gaid
print(corl)
I was expecting to see [[1. 0.25 0.9 ], [0.25 1. 0.5 ], [0.9 0.5 1. ]] but instead get [[1. 0. 0.], [0. 1. 0.], [0. 0. 1.]]. I am obviously doing something silly but just not sure what so all suggestions gratefully received - thanks!

you've probably figured it out by now but you have to use the # operator for matrix multiplication in numpy. The operator * is for an element-wise multiplication.
So
corl = gaid # S # gaid
gives the answer you are looking for.

Related

sklearn.preprocessing.MinMaxScaler() only returns 0 or 1 and not float

For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.

How to efficiently append running sum in Python?

I'm writing a python script that uses a model to predict a large number of values by groupID, where efficiency is important (N on the order of 10^8). I initialize a results matrix and am trying to sequentially update a running sum of values in the results matrix.
Trying to be efficient, in my current method I use groupID as row numbers of the results matrix to avoid merging (merging is expensive, as far as I understand).
My attempt:
import numpy as np
# Initialize results matrix
results = np.zeros((5,3)) # dimension: number of groups x timestep
# Now I loop over batches, with batch size 4. Here's an example of one iteration:
batch_groupIDs = [3,1,0,1] # Note that multiple values can be generated for same groupID
batch_results = np.ones((4,3))
# My attempt at appending the results (low dimension example):
results[batch_groupIDs] += batch_results
print(results)
This outputs:
[[1. 1. 1.]
[1. 1. 1.]
[0. 0. 0.]
[1. 1. 1.]
[0. 0. 0.]]
My desired output is the following (since group 1 shows up twice, and should be appended twice):
[[1. 1. 1.]
[2. 2. 2.]
[0. 0. 0.]
[1. 1. 1.]
[0. 0. 0.]]
The actual dimensions of my problem are approximately 100 timesteps x a batch size of 1 million+ and 2000 groupIDs
Here is my understanding, and please correct me if I'm wrong:
We want a resultant matrix that has the shape number of groups x timestep which in this case would be 2000 x 100. This matrix needs to be efficiently updated sequentially for a batch size of 10^6.
If the summary is correct, here is my approach. We divide the 10^6 sequences into let's say "mini-batches" of 10^4 each. Another assumption is that the timesteps are weighted equally across all groups. :
Convert batch_groupIDs to a numpy array of frequencies. For batch_groupIDs = [3,1,0,1], the array would look like [1,2,0,1,0].
Convert the timestep values to numpy array. For batch_results = np.ones((4,3)), timestep array would look like [1,1,1] since 3 steps add equal weight
Find the outer product and add it to the result.
Repeat until all sequences are generated.
Here is the equivalent function:
import random
GROUPIDs = 2000
MINIBATCH_SIZE = 10000
TIMESTEPS = 100
def getMatrixOuter():
result = np.zeros((GROUPIDs,TIMESTEPS))
for x in range(100): #100 mini-batches each of size 10000 to replicate a sequence
batch_groupIDs = [random.randrange(GROUPIDs) for i in range(MINIBATCH_SIZE)] #10000 random sequences in range(0,GROUPIDs)
counts = {k: 0 for k in range(GROUPIDs)}
for i in batch_groupIDs:
counts[i] = counts.get(i, 0) + 1
vals = np.fromiter(counts.values(), dtype=int) #array of frequencies of GROUPIDs
steps=np.ones((TIMESTEPS,)) #weight of each timestep
np.add(np.outer(vals, steps), result, out=result)
return result
Here is how it fared when timed:
%timeit getMatrixOuter()
1 loop, best of 3: 704 ms per loop
A possible benchmark could be evaluating it against a sequential add using np.add.at()
Here is a possible benchmark function:
GROUPIDs = 2000
MINIBATCH_SIZE = 10000
TIMESTEPS = 100
def getMatrixnp():
result = np.zeros((GROUPIDs,TIMESTEPS))
for i in range(int(1e6)):
np.add.at(result, [random.randrange(2000) for i in range(1)], 1)
return result
And here is how it fared on the same system:
%timeit getMatrixnp()
1 loop, best of 3: 10.8 s per loop
The solution certainly rests on some assumptions that may very well turn out to be false. Nevertheless, my 2 cents.
Using pandas, you can count the occurrence of each batch_groupID using the groupby method. Once performed, you can simply add this result to your initial matrix using the add() method (you need to ensure the axis is here is set to 0). If you specifically need a numpy.array, then you can just use the .to_numpy() method on the DataFrame.
import pandas as pd
s = pd.Series(batch_groupIDs)
array = np.zeros((5,3))
#create a new series counting the occurrence of each index
group = s.groupby(s).count()
# leverage pandas add each count occurrence
results = pd.DataFrame(array).add(group, 0).fillna(0).to_numpy()
which gives
array([[1., 1., 1.],
[2., 2., 2.],
[0., 0., 0.],
[1., 1., 1.],
[0., 0., 0.]])

I am trying slicing but I have the following error message: slice indices must be integers or None or have an __index__ method

I am trying slicing but I have the following error message: slice indices must be integers or None or have an __index__ method
descriptors = numpy.fft.fftshift(descriptors)
center_index = len(descriptors) / 2
descriptors = descriptors[center_index - degree / 2:center_index + degree / 2]
In python3 you need to use // for floor division unlike python2 where it was just /:
import numpy as np
descriptors = [ 0., 1., 2., 3., 4., -5., -4., -3., -2., -1.]
descriptors = np.fft.fftshift(descriptors)
print(descriptors)
center_index = len(descriptors) // 2
degree = 4
descriptors = descriptors[center_index - degree // 2 : center_index + degree // 2]
print(descriptors)
Output:
[-5. -4. -3. -2. -1. 0. 1. 2. 3. 4.]
[-2. -1. 0. 1.]

Reconstruct Matrix from svd components with Pyspark

I'm working on SVD using pyspark. But in the documentation as well as any other place I didn't find how to reconstruct the matrix back using the segemented vectors.For example, using the svd of pyspark, I got U, s and V matrix as below.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
Now, I want to reconstruct back the original matrix by multiplying it back. The equation is:
mat_cal = U.diag(s).V.T
In python, we can easily do it. But in pyspark I'm not getting the result.
I found this link. But it's in scala and I don't know the how to convert it in pyspark. If someone can guide me, it will be very helpful.
Thanks!
Convert u to diagonal matrix Σ:
import numpy as np
from pyspark.mllib.linalg import DenseMatrix
Σ = DenseMatrix(len(s), len(s), np.diag(s).ravel("F"))
Transpose V, convert to column major and then convert back to DenseMatrix
V_ = DenseMatrix(V.numCols, V.numRows, V.toArray().transpose().ravel("F"))
Multiply:
mat_ = U.multiply(Σ).multiply(V_)
Inspect the results:
for row in mat_.rows.take(3):
print(row.round(12))
[0. 1. 0. 7. 0.]
[2. 0. 3. 4. 5.]
[4. 0. 0. 6. 7.]
Check the norm
np.linalg.norm(np.array(rows.collect()) - np.array(mat_.rows.collect())
1.2222842061189339e-14
Of course the last two steps are used only for testing, and won't be feasible on real life data.

Embedding a tensor of vectors into a tensor of matrices

I want to create multiple matrices that have the property that their diagonal is zero and that are symmetric. Matrices of dimension n of this form need n*(n-1)/2 parameters to be completely specified.
These parameters shall later be learned...
In numpy I'm able to compute these by using numpy.triu_indices to get the indices of the upper triangular matrix starting at the first diagonal above the main diagonal and then fill it by the provided parameters as in the following code snippet:
import numpy as np
R = np.array([[1,2,1,1,2,1], [1,1,1,1,1,1]])
s = R.shape[1]
M = R.shape[0]
iu_r, iu_c = np.triu_indices(s,1)
Q = np.zeros((M,s,s),dtype=float)
Q[:,iu_r,iu_c] = R
Q = Q + np.transpose(Q,(0,2,1))
Output:
[[[0. 1. 2. 1.]
[1. 0. 1. 2.]
[2. 1. 0. 1.]
[1. 2. 1. 0.]]
[[0. 1. 1. 1.]
[1. 0. 1. 1.]
[1. 1. 0. 1.]
[1. 1. 1. 0.]]]
But apparently one can not directly translates this to tensorflow, as
import tensorflow as tf
import numpy as np
M = 2
s = 4
iu_r, iu_c = np.triu_indices(s,1)
rates = tf.get_variable(shape=(M,s*(s-1)/2), name="R", dtype=float)
Q = tf.get_variable(shape=(M,s,s), dtype=float, initializer=tf.initializers.zeros, name="Q")
Q = Q[:,iu_r,iu_c].assign(rates)
fails with
TypeError: Tensors in list passed to 'values' of 'Pack' Op have types [int32, int64, int64] that don't all match.
What would be the correct way to define this tensor of matrices from a tensor of vectors in tensorflow?
EDIT:
My current solution is to embed using the scatter_nd function provided by tensorflow as it fits the need that no redundant variables need to be allocated as in the case of fill_triangular. Though, the indexing is not compatible with the indexes generated by numpy. Currently hardcoded the following example works:
import tensorflow as tf
import numpy as np
M = 2
s = 4
iu_r, iu_c = np.triu_indices(s,1)
rates = tf.get_variable(shape=(M,s*(s-1)/2), name="R", dtype=float)
iupper = [[[0,0,1],[0,0,2],[0,0,3],[0,1,2],[0,1,3],[0,2,3]],[[1,0,1],[1,0,2],[1,0,3],[1,1,2],[1,1,3],[1,2,3]]]
Q = tf.scatter_nd(iupper,rates,shape=(M,s,s), name="rate_matrix")
It should be no problem to translate the indices obtained by
iu_r, iu_c = np.triu_indices(s,1)
But maybe someone has a more elegant solution for that?
This part is unclear to me how it works:
import numpy as np
R = np.array([[1,2,1,1,2,1], [1,1,1,1,1,1]])
s = R.shape[1]
M = R.shape[0]
iu_r, iu_c = np.triu_indices(s,1)
Q = np.zeros((M,s,s),dtype=float)
Q[:,iu_r,iu_c] = R
Q = Q + np.transpose(Q,(0,2,1))
because this will fail in error.
You may use simpler code like this:
import numpy as np
R = [1,2,1,1,2,1]
N = 4
Q = np.zeros((N,N),dtype=float)
for i in range(0,N):
for j in range(0,N):
if (i<j):
Q[i][j] = R.pop(0)
Q would be:
[[0. 1. 2. 1.]
[0. 0. 1. 2.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]]
<class 'numpy.ndarray'>
To get the symmetric Q just use this: Q = Q + np.transpose(Q)
Whatever zigzag you do with your rates later you can convert to Tensors like this:
import tensorflow as tf
data_tf = tf.convert_to_tensor(Q, np.float32)
sess = tf.InteractiveSession()
print(data_tf.eval())
sess.close()
The other answers suggest to use the convert_to_tensor function, to convert your numpy array to a TensorFlow tensor.
This indeed can give you matrices with the desired property of being symmetric with a zero diagonal. However, once you start training, these properties might not hold anymore, as there is no guarantee in general that the weight updates will keep this property fixed.
If you do need to keep the matrices symmetric with a zero diagonal all along the training process, you can do the following:
import tensorflow as tf
from tensorflow.contrib.distributions import fill_triangular
M = 2 # batch size
s = 4 # matrix size
rates = tf.get_variable(shape=(M,s*(s+1)/2), name="R", dtype=float)
# Q will be triangular (with a non-zero diagonal!)
Q = fill_triangular(rates)
# set the diagonal of Q to zero.
Q = tf.linalg.set_diag(Q,tf.zeros((M,s)))
# make Q symmetric
Q = Q + tf.transpose(Q,[0,2,1])
Here is a test that verifies that the matrices hold the required properties, even after training:
import numpy as np
# define some arbitrary loss function
Q_target = tf.constant(np.random.normal(size=(1,s,s)).astype(np.float32))
loss = tf.nn.l2_loss(Q-Q_target)
# a single training step (which will update the matrices)
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# this is Q before training
print(sess.run(Q))
#[[[ 0. -0.564 0.318 -0.446]
# [-0.564 0. -0.028 0.2 ]
# [ 0.318 -0.028 0. 0.369]
# [-0.446 0.2 0.369 0. ]]
#
# [[ 0. 0.412 0.216 0.063]
# [ 0.412 0. 0.221 -0.336]
# [ 0.216 0.221 0. -0.653]
# [ 0.063 -0.336 -0.653 0. ]]]
# this is Q after training
sess.run(train_step)
print(sess.run(Q))
#[[[ 0. -0.548 0.235 -0.284]
# [-0.548 0. -0.055 0.074]
# [ 0.235 -0.055 0. 0.25 ]
# [-0.284 0.074 0.25 0. ]]
#
# [[ 0. 0.233 0.153 0.123]
# [ 0.233 0. 0.144 -0.354]
# [ 0.153 0.144 0. -0.568]
# [ 0.123 -0.354 -0.568 0. ]]]
Apparently you need something like convert_to_tensor.
This function converts Python objects of various types to Tensor objects. It accepts Tensor objects, numpy arrays, Python lists, and Python scalars.
Note: TensorFlow operations automatically convert NumPy ndarrays to Tensors.

Resources