python spark: narrowing down most relevant features using PCA - apache-spark

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
At this time I have run PCA with 2 components and I can look at its values as:
m = model.pc.values.reshape(3, 2)
which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:
| PC1 | PC2 |
---------|-----|-----|
col1 | | |
---------|-----|-----|
col2 | | |
---------|-----|-----|
col3 | | |
---------+-----+-----+
Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?
Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?

are the (...) rows here in the same order in which I had specified my input columns
Yes, they are. Let's trace what is going on:
from pyspark.ml.feature import PCA, VectorAssembler
data = [
(0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0),
(4.0, 0.0, 0.0, 6.0, 7.0)
]
df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler follows the order of columns:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")
vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
# {'idx': 1, 'name': 'v'},
# {'idx': 2, 'name': 'x'},
# {'idx': 3, 'name': 'y'},
# {'idx': 4, 'name': 'z'}]},
# 'num_attrs': 5}}
So are principal components
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)
?model.pc
# Type: property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:
# Returns a principal components Matrix.
# Each column is one principal component.
#
# .. versionadded:: 2.0.0
Finally sanity check:
import numpy as np
x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())
np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16

You can see the actual order of the columns here
df.schema["features"].metadata["ml_attr"]["attrs"]
there will be two classes usually, ["binary] & ["numeric"]
pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Should give the exact order of all the columns.
You can verify, the order of input & output remains same.

Related

how to use ColumnTransformer() to return a dataframe?

I have a dataframe like this:
department review projects salary satisfaction bonus avg_hrs_month left
0 operations 0.577569 3 low 0.626759 0 180.866070 0
1 operations 0.751900 3 medium 0.443679 0 182.708149 0
2 support 0.722548 3 medium 0.446823 0 184.416084 0
3 logistics 0.675158 4 high 0.440139 0 188.707545 0
4 sales 0.676203 3 high 0.577607 1 179.821083 0
I want to try ColumnTransformer() and return a transformed dataframe.
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(
transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features ),
]
)
df_new = ct.fit_transform(df)
df_new
which gives me a 'sparse matrix of type '<class 'numpy.float64'>'
if I use pd.DataFrame(ct.fit_transform(df)) then I'm getting a single column:
0
0 (0, 0)\t1.0\n (0, 7)\t1.0
1 (0, 0)\t2.0\n (0, 7)\t1.0
2 (0, 0)\t2.0\n (0, 10)\t1.0
3 (0, 5)\t1.0
4 (0, 9)\t1.0
however, I was expecting to see the transformed dataframe like this?
review projects salary satisfaction bonus avg_hrs_month operations support ...
0 0.577569 3 1 0.626759 0 180.866070 1 0
1 0.751900 3 2 0.443679 0 182.708149 1 0
2 0.722548 3 2 0.446823 0 184.416084 0 1
3 0.675158 4 3 0.440139 0 188.707545 0 0
4 0.676203 3 3 0.577607 1 179.821083 0 0
Is it possible with ColumnTransformer()?
As quickly sketched in the comment there are a couple of considerations to be done on your example:
method .fit_transform() generally returns either a sparse matrix or a numpy array. Returning a sparse matrix serves the purpose of saving memory; think to the example where you one-hot-encode a categorical attribute with many categories. You'll end up having a matrix with many columns and a single non-zero entry per row; with a sparse matrix you can store the location of the non-zero element only. In these situation you can call .toarray() on the output of .fit_transform() to get a numpy array back to be passed to the pd.DataFrame constructor.
Actually, on a five-rows dataset similar to the one you provided
df = pd.DataFrame({
'department': ['operations', 'operations', 'support', 'logistics', 'sales'],
'review': [0.577569, 0.751900, 0.722548, 0.675158, 0.676203],
'projects': [3, 3, 3, 4, 3],
'salary': ['low', 'medium', 'medium', 'low', 'high'],
'satisfaction': [0.626759, 0.751900, 0.722548, 0.675158, 0.676203],
'bonus': [0, 0, 0, 0, 1],
'avg_hrs_month': [180.866070, 182.708149, 184.416084, 188.707545, 179.821083],
'left': [0, 0, 1, 0, 0]
})
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features),
])
I can't reproduce your issue (namely, I directly obtain a numpy array), but basically pd.DataFrame(ct.fit_transform(df).toarray()) should work for your case. This is the output you would get:
As you can see, with respect to your expected output, this only contains the transformed (ordinally encoded) salary column as first column and the transformed (one-hot-encoded) department column from the second to the last column. That's because, as you can see within the docs, parameter remainder is set to 'drop' by default, which implies that all columns which are not subject to transformation are dropped. To avoid this, you should set it to 'passthrough'; this will help you to transform the columns you need and keep the other untouched.
ct = ColumnTransformer(transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features )],
remainder='passthrough'
)
This would be the output of your pd.DataFrame(ct.fit_transform(df).toarray()) in such a case:
Again, as you can see also column order is not the one you would expect after the transformation. Long story short, that's because in a ColumnTransformer
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
I would aggest reading Preserve column order after applying sklearn.compose.ColumnTransformer at this proposal.
Eventually, for what concerns column names you should probably apply a custom solution passing what you want directly to the columns parameter to be passed to the pd.DataFrame constructor. Indeed, OrdinalEncoder (differently from OneHotEncoder) does not provide a .get_feature_names_out() method that makes it generally easy to pass columns=ct.get_feature_names_out() to the pd.DataFrame constructor. See ColumnTransformer & Pipeline with OHE - Is the OHE encoded field retained or removed after ct is performed? for an example of its usage.
Update 10/2022 - sklearn version 1.2.dev0
With sklearn version 1.2.0 it will be possible to solve the problem of returning a DataFrame when transforming a ColumnTransformer instance much more easily. Such version has not been released yet, but you can test the following in dev (version 1.2.dev0), by installing the nightly builds as such:
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U
The ColumnTransformer (and other transformers as well) now exposes a .set_output() method which gives the possibility to configure a transformer to output pandas DataFrames, by passing parameter transform='pandas' to it.
Therefore, the example becomes:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df = pd.DataFrame({
'department': ['operations', 'operations', 'support', 'logistics', 'sales'],
'review': [0.577569, 0.751900, 0.722548, 0.675158, 0.676203],
'projects': [3, 3, 3, 4, 3],
'salary': ['low', 'medium', 'medium', 'low', 'high'],
'satisfaction': [0.626759, 0.751900, 0.722548, 0.675158, 0.676203],
'bonus': [0, 0, 0, 0, 1],
'avg_hrs_month': [180.866070, 182.708149, 184.416084, 188.707545, 179.821083],
'left': [0, 0, 1, 0, 0]
})
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
ct = ColumnTransformer(transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features )],
remainder='passthrough'
)
ct.set_output('pandas')
df_pandas = ct.fit_transform(df)
df_pandas
The output also becomes much easier to read as it has proper column names (indeed, at each step, the transformers of which ColumnTransformer is made of do have the attribute feature_names_in_; so you don't lose column names anymore while transforming the input).
Last note. Observe that the example now requires parameter sparse_output=False to be passed to the OneHotEncoder instance in order to work.
This answer skips the workaround and directly provides a solution for scikit-learn version 1.2+
From sklearn version 1.2 on, transformers can return a pandas DataFrame directly without further handling. It is done with set_output, which can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas"). See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API
In your case the solution would be:
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(
transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features ),
]
)
# Add the following line to your code
ct.set_output(transform="pandas")
df_new = ct.fit_transform(df)
df_new

Pandas dataframe float index not self-consistent

I need/want to work with float indices in pandas but I get a keyerror when running something like this:
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
df[df.index[0]]
I have seen some errors regarding precision, but shouldn't this work?
You get the KeyError because df[df.index[0]] would try to access a column with label 1.1 in this case - which does not exist here.
What you can do is use loc or iloc to access rows based on indices:
import numpy as np
import pandas as pd
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
# to access e.g. the first row use
df.loc[df.index[0]]
# or more general
df.iloc[0]
# 5.4 1.531411
# 6.7 -0.341232
# Name: 1.1, dtype: float64
In principle, if you can, avoid equal comparisons for floating point numbers for the reason you already came across: precision. The 1.1 displayed to you might be != 1.1 for the computer - simply because that would theoretically require infinite precision. Most of the time, it will work though because certain tolerance checks will kick in; for example if the difference of the compared numbers is < 10^6.

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

How does Spark model treat vector column?

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?
Example1:
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])
Example2:
vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])
What is the difference? Which one is better?
There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.
Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):
spark.version
# u'2.3.1'
# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
[1, 40.4, -20.5, 12., 2.2],
[2, 28., -23.9, -2., -1.7],
[3, 29.5, -19.0, -0.5, -0.2],
[4, 32.8, -18.84, 1.5, 1.8]
],
["id","lat", "long", "other", "label"])
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])
model = pipeline.fit(df)
model.transform(df).show()
The result is:
+---+----+------+-----+-----+-------------+-----------------+
| id| lat| long|other|label| loc| features|
+---+----+------+-----+-----+-------------+-----------------+
| 0|33.3| -17.5| 10.0| 0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
| 1|40.4| -20.5| 12.0| 2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
| 2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
| 3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
| 4|32.8|-18.84| 1.5| 1.8|[-18.84,32.8]|[-18.84,32.8,1.5]|
+---+----+------+-----+-----+-------------+-----------------+
i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
After fitting, I transform the data:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]
Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.
Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).
We first need to define an intermediate function, estimatedCovariance, as follows:
import numpy as np
def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.
Note:
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.
Args:
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.
Returns:
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
"""
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean
return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
Then, we can write a main pca function as follows:
from numpy.linalg import eigh
def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.
Note:
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.
Returns:
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
"""
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues
return components.T, score, eigVals
Test
Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
model.transform(df).collect()
[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
Then, with our method:
comp, score, eigVals = pca(df)
score.collect()
[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.
Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382
Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change. [...] Similarly, the score vectors are unique
up to a sign flip, since the variance of Z is the same as the variance of −Z.
Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:
def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.
Returns:
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
"""
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)
varianceExplained(df,1)
# 0.79439325322305299
As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):
varianceExplained(df,5)
# 1.0
[Developed & tested with Spark 1.5.0 & 1.5.1]
EDIT :
PCA and SVD are finally both available in pyspark starting spark 2.2.0 according to this resolved JIRA ticket SPARK-6227.
Original answer:
The answer given by #desertnaut is actually excellent from a theoretical perspective, but I wanted to present another approach on how to compute the SVD and to extract then eigenvectors.
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix
class SVD(JavaModelWrapper):
"""Wrapper around the SVD scala case class"""
#property
def U(self):
""" Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
u = self.call("U")
if u is not None:
return RowMatrix(u)
#property
def s(self):
"""Returns a DenseVector with singular values in descending order."""
return self.call("s")
#property
def V(self):
""" Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
return self.call("V")
This defines our SVD object. We can define now our computeSVD method using the Java Wrapper.
def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
"""
Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
* s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
* U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
* v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
:param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
:param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
:param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
:returns: SVD object
"""
java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
return SVD(java_model)
Now, let's apply that to an example :
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features
# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)
In spark 2.2+ you can now easily get the explained variance as:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
sum(model.explainedVariance)
The easiest answer to your question is to input an identity matrix to your model.
identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
(Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
(Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)
This should give you principle components.
I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:
from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
inputCol='features',
outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)
Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.
I have summarized PCA and its use in Spark and sklearn in this blog post.

Resources