Reconstruct Matrix from svd components with Pyspark - apache-spark

I'm working on SVD using pyspark. But in the documentation as well as any other place I didn't find how to reconstruct the matrix back using the segemented vectors.For example, using the svd of pyspark, I got U, s and V matrix as below.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
Now, I want to reconstruct back the original matrix by multiplying it back. The equation is:
mat_cal = U.diag(s).V.T
In python, we can easily do it. But in pyspark I'm not getting the result.
I found this link. But it's in scala and I don't know the how to convert it in pyspark. If someone can guide me, it will be very helpful.
Thanks!

Convert u to diagonal matrix Σ:
import numpy as np
from pyspark.mllib.linalg import DenseMatrix
Σ = DenseMatrix(len(s), len(s), np.diag(s).ravel("F"))
Transpose V, convert to column major and then convert back to DenseMatrix
V_ = DenseMatrix(V.numCols, V.numRows, V.toArray().transpose().ravel("F"))
Multiply:
mat_ = U.multiply(Σ).multiply(V_)
Inspect the results:
for row in mat_.rows.take(3):
print(row.round(12))
[0. 1. 0. 7. 0.]
[2. 0. 3. 4. 5.]
[4. 0. 0. 6. 7.]
Check the norm
np.linalg.norm(np.array(rows.collect()) - np.array(mat_.rows.collect())
1.2222842061189339e-14
Of course the last two steps are used only for testing, and won't be feasible on real life data.

Related

Different values on Tensorflow matrix multiplication vs. manual calculation

I am working on an optimsation on tensorflow where matrix multiplication gives differnt values compared to manual calculation. The difference is just on the 6 decimal and i know its very tiny but as epochs goes on i get quite different elbo values.
Here is a small example:
import tensorflow as tf
import numpy as np
a = np.array([[0.2751678 , 0.00671141, 0.39597315, 0.4966443 , 0.17449665,
0.00671141, 0.32214764, 0.02013423, 1. , 0.40939596,
0. , 0.9597315 , 0.4161074 , 0. , 0.2147651 ,
0.22147651, 0.5771812 , 0.70469797, 0.44966444, 0.36241612]],dtype=np.float32)
b = np.array([[2.6560298e-04, 0.0000000e+00, 7.9084152e-01, 8.2393251e-03,
0.0000000e+00, 9.8140877e-01, 6.5296537e-01, 2.6107374e-01,
1.2936005e-03, 5.2952105e-01, 2.2449312e-01, 9.9892569e-01,
8.4370503e-04, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 9.5679509e-03, 0.0000000e+00]],dtype=np.float32)
a_t = tf.constant(a)
b_t = tf.constant(b.T)
Matrix multiplication
tf.matmul(a_t,b_t)
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[1.7209427]], dtype=float32)>
Manual calculation
tf.reduce_sum(tf.transpose(a_t)*b_t)
<tf.Tensor: shape=(), dtype=float32, numpy=1.7209429>
What is the reason for this difference? Is ther a fix for this?
You are comparing the results from different algorithms that rely on float arithmetic. It's completely normal to get different results at the last significant decimal digit. Actually, that is the best case scenario. Sometimes the difference will be even higher.
For example, you may try different values for n in the following code:
from numpy import random as np_random
from numpy import matmul as np_matmul
from numpy import multiply as np_multiply
from numpy import transpose as np_transpose
from numpy import sum as np_sum
n = 10
for i in range(10000):
a = np_random.rand(1,n).astype('float32')
b = np_random.rand(n,1).astype('float32')
c = np_matmul(a,b)
d = np_sum(np_multiply(a,np_transpose(b)),axis=1)
e = c-d
if abs(e) > 0:
print("%.16f" % c)
print("%.16f" % d)
print("%e" % e)
break
Anyway, single-precision floating-point format (float32) gives from 6 to 9 significant decimal digits precision. If you need more precision, you still have double precision (float64).
More information can be found in Floating Point Arithmetic: Issues and Limitations from Python Documentation.

sklearn.preprocessing.MinMaxScaler() only returns 0 or 1 and not float

For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.

Perform matrix multiplication with cosine similarity function

I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.

KMeans clustering - Value error: n_samples=1 should be >= n_cluster

I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
plt.figure(); plt.clf()
plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
plt.ylim(0, 1.0001)
plt.title(protname)
plt.xlabel("time")
plt.ylabel("quotient")
plt.legend()
plt.show()
And this produces the following three points - one for each dataset I shared.
As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?
Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129,
0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 ,
0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581,
0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:
import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
This gives the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[0.7 0.7 0.4973262 0.7008547 0.71287129 0.704
0.49723757 0.49723757 0.70676692 0.5 0.5 0.70754717
0.5 0.49723757 0.70322581 0.5 0.49723757 0.49723757
0.5 0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.
Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:
k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
Please try the code below. A brief explanation on what I've done:
First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
dataset = np.empty((0, 2))
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
sample = np.vstack((quotient_times, quotient)).T
dataset = np.append(dataset, sample, axis=0)
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)
k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)
colors = [i for i in k_means.labels_]
plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()
You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.
Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
After fitting, I transform the data:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]
Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.
Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).
We first need to define an intermediate function, estimatedCovariance, as follows:
import numpy as np
def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.
Note:
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.
Args:
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.
Returns:
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
"""
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean
return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
Then, we can write a main pca function as follows:
from numpy.linalg import eigh
def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.
Note:
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.
Returns:
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
"""
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues
return components.T, score, eigVals
Test
Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
model.transform(df).collect()
[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
Then, with our method:
comp, score, eigVals = pca(df)
score.collect()
[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.
Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382
Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change. [...] Similarly, the score vectors are unique
up to a sign flip, since the variance of Z is the same as the variance of −Z.
Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:
def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.
Returns:
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
"""
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)
varianceExplained(df,1)
# 0.79439325322305299
As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):
varianceExplained(df,5)
# 1.0
[Developed & tested with Spark 1.5.0 & 1.5.1]
EDIT :
PCA and SVD are finally both available in pyspark starting spark 2.2.0 according to this resolved JIRA ticket SPARK-6227.
Original answer:
The answer given by #desertnaut is actually excellent from a theoretical perspective, but I wanted to present another approach on how to compute the SVD and to extract then eigenvectors.
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix
class SVD(JavaModelWrapper):
"""Wrapper around the SVD scala case class"""
#property
def U(self):
""" Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
u = self.call("U")
if u is not None:
return RowMatrix(u)
#property
def s(self):
"""Returns a DenseVector with singular values in descending order."""
return self.call("s")
#property
def V(self):
""" Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
return self.call("V")
This defines our SVD object. We can define now our computeSVD method using the Java Wrapper.
def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
"""
Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
* s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
* U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
* v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
:param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
:param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
:param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
:returns: SVD object
"""
java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
return SVD(java_model)
Now, let's apply that to an example :
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features
# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)
In spark 2.2+ you can now easily get the explained variance as:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
sum(model.explainedVariance)
The easiest answer to your question is to input an identity matrix to your model.
identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
(Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
(Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)
This should give you principle components.
I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:
from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
inputCol='features',
outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)
Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.
I have summarized PCA and its use in Spark and sklearn in this blog post.

Resources