I am trying to find the eigenvectors of matrix A using QR method. I found the eigenvalues and eigenvector which corresponds to the largest eigenvalue. How do I find the rest of the eigenvectors without using numpy.linalg.eig?
import numpy as np
A = np.array([
[1, 0.3],
[0.45, 1.2]
def eig_evec_decomp(A, max_iter=100):
A_k = A
Q_k = np.eye(A.shape[1])
for k in range(max_iter):
Q, R = np.linalg.qr(A_k)
Q_k = Q_k.dot(Q)
A_k = R.dot(Q)
eigenvalues = np.diag(A_k)
eigenvectors = Q_k
return eigenvalues, eigenvectors
evals, evecs = eig_evec_decomp(A)
# array([1.48078866, 0.71921134])
# array([[ 0.52937334, -0.84838898],
# [ 0.84838898, 0.52937334]])
Next I check the condition:
A - Original matrix;
x - eigenvector;
w - eigenvalue.
Check the conditions:
print(np.allclose(A.dot(evecs[:,0]), evals[0] * evecs[:,0]))
# True
print(np.allclose(A.dot(evecs[:,1]), evals[1] * evecs[:,1]))
# False

There is no promise in the algorithm that Q_k will have the eigenvectors as columns. It is even rather rare that there will be an orthogonal eigenbasis. This is so special that this case has a name, these are the normal matrices, characterized in that they commute with their transpose.
In general, the A_k you converge to will still be upper triangular with non-trivial content above the diagonal. Check by computing Q_k.T # A # Q_k. What is known from the structure is that the ith eigenvector is a linear combination of the first k columns of Q_k. This could simplify solving the eigen-vector equation somewhat. Or directly determine the eigenvectors of the converged A_k and transform back with Q_k.


Is there a way to generate correlated variable array from an existing array in Python 3? [duplicate]

I have a non-generated 1D NumPy array. For now, we will use a generated one.
import numpy as np
arr1 = np.random.uniform(0, 100, 1_000)
I need an array that will be correlated 0.3 with it:
arr2 = '?'
print(np.corrcoef(arr1, arr2))
Out[1]: 0.3
I've adapted this answer by whuber on stats.SE to NumPy. The idea is to generate a second array noise randomly, and then compute the residuals of a least-squares linear regression of noise on arr1. The residuals necessarily have a correlation of 0 with arr1, and of course arr1 has a correlation of 1 with itself, so an appropriate linear combination of a*arr1 + b*residuals will have any desired correlation.
import numpy as np
def generate_with_corrcoef(arr1, p):
n = len(arr1)
# generate noise
noise = np.random.uniform(0, 1, n)
# least squares linear regression for noise = m*arr1 + c
m, c = np.linalg.lstsq(np.vstack([arr1, np.ones(n)]).T, noise)[0]
# residuals have 0 correlation with arr1
residuals = noise - (m*arr1 + c)
# the right linear combination a*arr1 + b*residuals
a = p * np.std(residuals)
b = (1 - p**2)**0.5 * np.std(arr1)
arr2 = a*arr1 + b*residuals
# return a scaled/shifted result to have the same mean/sd as arr1
# this doesn't change the correlation coefficient
return np.mean(arr1) + (arr2 - np.mean(arr2)) * np.std(arr1) / np.std(arr2)
The last line scales the result so that the mean and standard deviation are the same as arr1's. However, arr1 and arr2 will not be identically distributed.
>>> arr1 = np.random.uniform(0, 100, 1000)
>>> arr2 = generate_with_corrcoef(arr1, 0.3)
>>> np.corrcoef(arr1, arr2)
array([[1. , 0.3],
[0.3, 1. ]])

Numpy finding the number of points within a specific distance in absolute value

I have a bumpy array. I want to find the number of points which lies within an epsilon distance from each point.
My current code is (for a n*2 array, but in general I expect the array to be n * m)
epsilon = np.array([0.5, 0.5])
np.array([ 1/np.float(np.sum(np.all(np.abs(X-x) <= epsilon, axis=1))) for x in X])
But this code might not be efficient when it comes to an array of let us say 1 million rows and 50 columns. Is there a better and more efficient method ?
For example data
X = np.random.rand(10, 2)
you can solve this using broadcasting:
1 / np.sum(np.all(np.abs(X[:, None, ...] - X[None, ...]) <= epsilon, axis=-1), axis=-1)

KNN algorithm that return 2 or more nearest neighbours

For example, I have a vector x and a is it's nearest neigbour. Then, b is it's next nearest neighbour. Is there any package in Pyton or R that outputs something like [a, b] meaning that a is its nearest neighbour(maybe by majority vote), while b is it's second nearest neighbour.
This is exactly what those metric-trees are build for.
Your question reads as you are asking for something as simple as that using sklearn's KDTree (consider BallTree depending on your metric in play):
import numpy as np
from sklearn.neighbors import KDTree
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X)
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
[[0 1]]
[[ 0.43011626 0.99247166]]
And just to be clear: KNN usually refers to some pre-build algorithm based on metric-trees (KDTree, BallTree) for the task of classification. Often those data-structures are the only thing one is interested in.
If i interpret your comment correctly, you want to use the manhattan / taxicab / l1 metric.
Look here for the compatibility lists of those spatial-trees.
You just would use it like that:
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X, metric='l1') # !!!
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
[[0 1]]
[[ 0.6 1.4]]

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
After fitting, I transform the data:
transformed = model.transform(data)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]
Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.
Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).
We first need to define an intermediate function, estimatedCovariance, as follows:
import numpy as np
def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean
return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
Then, we can write a main pca function as follows:
from numpy.linalg import eigh
def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues
return components.T, score, eigVals
Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
Then, with our method:
comp, score, eigVals = pca(df)
[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.
Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382
Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change. [...] Similarly, the score vectors are unique
up to a sign flip, since the variance of Z is the same as the variance of −Z.
Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:
def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)
# 0.79439325322305299
As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):
# 1.0
[Developed & tested with Spark 1.5.0 & 1.5.1]
PCA and SVD are finally both available in pyspark starting spark 2.2.0 according to this resolved JIRA ticket SPARK-6227.
Original answer:
The answer given by #desertnaut is actually excellent from a theoretical perspective, but I wanted to present another approach on how to compute the SVD and to extract then eigenvectors.
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix
class SVD(JavaModelWrapper):
"""Wrapper around the SVD scala case class"""
def U(self):
""" Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
u = self.call("U")
if u is not None:
return RowMatrix(u)
def s(self):
"""Returns a DenseVector with singular values in descending order."""
return self.call("s")
def V(self):
""" Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
return self.call("V")
This defines our SVD object. We can define now our computeSVD method using the Java Wrapper.
def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
* s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
* U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
* v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
:param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
:param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
:param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
:returns: SVD object
java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
return SVD(java_model)
Now, let's apply that to an example :
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features
# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
# DenseVector([9.491, 4.6253])
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)
In spark 2.2+ you can now easily get the explained variance as:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
The easiest answer to your question is to input an identity matrix to your model.
identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
(Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
(Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)
This should give you principle components.
I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:
from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)
Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.
I have summarized PCA and its use in Spark and sklearn in this blog post.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.
