Anomaly detection with PCA in Spark - apache-spark

I read the following article
Anomaly detection with Principal Component Analysis (PCA)
In the article is written following:
• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.
• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.
• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.
Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables“. How to use Mahalanobis distance when there is no more correlation between the variables?
Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?

Lets assume you have a dataset of 3-dimensional points.
Each point has coordinates (x, y, z).
Those (x, y, z) are dimensions.
Point represented by three values e. g. (8, 7, 4). It called input vector.
When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).
Example: (8, 7, 4) => (-4, 13)
Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.
Example solution:
import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
object SparkApp extends App {
val session = SparkSession.builder()
.appName("spark-app").master("local[*]").getOrCreate()
session.sparkContext.setLogLevel("ERROR")
import session.implicits._
val df = Seq(
(1, 4, 0),
(3, 4, 0),
(1, 3, 0),
(3, 3, 0),
(67, 37, 0) //outlier
).toDF("x", "y", "z")
val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
.setWithStd(true)
val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
val pipeline = new Pipeline().setStages(
Array(vectorAssembler, standardScalar, pca)
)
val pcaDF = pipeline.fit(df).transform(df)
def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
val mahalanobois = udf[Double, Vector] { v =>
val vB = DenseVector(v.toArray)
vB.t * invCovariance * vB
}
df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
}
val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
session.close()
}

Related

Does anybody know how to use the Approximate Nearest Neighbor Search provide by Spark MLlib?

I want to use the Approximate Nearest Neighbor Search provide by Spark MLlib (ref.) but I'm super lost because I didn't find an example or something to guide me. The only info provided for the previous link is:
Approximate nearest neighbor search takes a dataset (of feature
vectors) and a key (a single feature vector), and it approximately
returns a specified number of rows in the dataset that are closest to
the vector.
Approximate nearest neighbor search accepts both transformed and
untransformed datasets as input. If an untransformed dataset is used,
it will be transformed automatically. In this case, the hash signature
will be created as outputCol.
A distance column will be added to the output dataset to show the true
distance between each output row and the searched key.
Note: Approximate nearest neighbor search will return fewer than k
rows when there are not enough candidates in the hash bucket.
Does anybody know how to use the Approximate Nearest Neighbor Search provide by Spark MLlib?
Here you can find an example https://spark.apache.org/docs/2.1.0/ml-features.html#lsh-algorithms :
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors
val dfA = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 1.0)),
(1, Vectors.dense(1.0, -1.0)),
(2, Vectors.dense(-1.0, -1.0)),
(3, Vectors.dense(-1.0, 1.0))
)).toDF("id", "keys")
val dfB = spark.createDataFrame(Seq(
(4, Vectors.dense(1.0, 0.0)),
(5, Vectors.dense(-1.0, 0.0)),
(6, Vectors.dense(0.0, 1.0)),
(7, Vectors.dense(0.0, -1.0))
)).toDF("id", "keys")
val key = Vectors.dense(1.0, 0.0)
val brp = new BucketedRandomProjectionLSH()
.setBucketLength(2.0)
.setNumHashTables(3)
.setInputCol("keys")
.setOutputCol("values")
val model = brp.fit(dfA)
// Feature Transformation
model.transform(dfA).show()
// Cache the transformed columns
val transformedA = model.transform(dfA).cache()
val transformedB = model.transform(dfB).cache()
// Approximate similarity join
model.approxSimilarityJoin(dfA, dfB, 1.5).show()
model.approxSimilarityJoin(transformedA, transformedB, 1.5).show()
// Self Join
model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < datasetB.id").show()
// Approximate nearest neighbor search
model.approxNearestNeighbors(dfA, key, 2).show()
model.approxNearestNeighbors(transformedA, key, 2).show()
The code above is from spark documentation.

Ax = b solver on coordinate matrix Apache Spark

How can I solve the Ax = b problem using Apache spark. My input is a coordinate matrix:
import numpy as np
import scipy
from scipy import sparse
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
A = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
#take the first column of A
b = sparse.coo_matrix((data, (row, 1)), shape=(4, 1))
#Solve Ax = b
np.linalg.solve(A,b)
Now I want to solve for x in Ax=b using the python libraries of the Apache Spark framework so the solution should be [1,0,0,0] since b is the 1st column of A
Below is the Apache Spark linear regression. Now, how do I set up the problem such that the input is a coordinate matrix (A) and coordinate vector (b)?
from pyspark.ml.regression import LinearRegression
# Load training data
training = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
How can I solve the Ax = b problem using Apache spark.
Directly (analytically) you cannot. Spark doesn't provide linear algebra library.
Indirectly - use pyspark.ml.regression to approximately solve OLS problem. You can refer to:
API docs.
MLlib guide
for details regarding expected input and required steps.

How to evaluate implicit trained model in spark MLlib [duplicate]

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out.
I can get this to work for an explicit data model with a regression RMSE evaluator, as follows:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import rand
conf = SparkConf() \
.setAppName("MovieLensALS") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)
paramMapExplicit = ParamGridBuilder() \
.addGrid(alsExplicit.rank, [8, 12]) \
.addGrid(alsExplicit.maxIter, [10, 15]) \
.addGrid(alsExplicit.regParam, [1.0, 10.0]) \
.build()
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)
predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()
When I try to do this for implicit data (let's say counts of views rather than ratings), I get an error that I can't quite figure out. Here's the code (very similar to the above):
dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)
predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()
I tried doing this with an RMSE evaluator and I get an error. As I understand, I should also be able to use the AUC metric for the binary classification evaluator, because the predictions of the implicit matrix factorization are a confidence matrix c_ui for predictions of a binary matrix p_ui per this paper, which the documentation for pyspark ALS cites.
Using either evaluator gives me an error and I can't find any fruitful discussion about cross-validating implicit ALS models online. I'm looking through the CrossValidator source code to try to figure out what's wrong, but am having trouble. One of my thoughts is that after the process converts the implicit data matrix r_ui to the binary matrix p_ui and confidence matrix c_ui, I'm not sure what it's comparing the predicted c_ui matrix against during the evaluation stage.
Here is the error:
Traceback (most recent call last):
File "<ipython-input-16-6c43b997005e>", line 1, in <module>
cvModel = cv.fit(dfCounts)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
return self._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
model = est.fit(train, epm[j])
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
return self.copy(params)._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
etc.......
UPDATE
I tried scaling the input so it's in the range of 0 to 1 and using a RMSE evaluator. It seems to work well until I try to insert it into the CrossValidator.
The following code works. I get predictions and i get an RMSE value from my evaluator.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
conf = SparkConf() \
.setAppName("ALSPractice") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0), (0,3,0.0), (0,4,0.0),
(1,0,5.0), (1,2,4.0), (1,3,0.0), (1,4,0.0),
(2,0,0.0), (2,2,0.0), (2,3,5.0), (2,4,5.0),
(3,0,0.0), (3,1,0.0), (3,3,4.0) ],
["user", "item", "rating"])
dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
(1,0), (1,1), (1,2), (1,3), (1,4),
(2,0), (2,1), (2,2), (2,3), (2,4),
(3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])
# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)
evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))
preds = defaultModelImplicit.transform(dfCountsTest2)
What I don't understand is why the following doesn't work. I'm using the same estimator, the same evaluator and fitting the same data. Why would these work above but not within the CrossValidator:
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)
Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.
you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator
You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.
map label column to [0, 1] range and use RMSE.
convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this
from pyspark.ml.recommendation import *
from pyspark.sql.functions import udf, col
from pyspark.mllib.linalg import DenseVector, VectorUDT
class BinaryALS(ALS):
def fit(self, df):
assert self.getImplicitPrefs()
model = super(BinaryALS, self).fit(df)
return ALSBinaryModel(model._java_obj)
class ALSBinaryModel(ALSModel):
def transform(self, df):
transformed = super(ALSBinaryModel, self).transform(df)
as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn(
"rawPrediction", as_vector(col("prediction")))
# Add binary label column
with_binary = dfCounts.withColumn(
"label_binary", (col("rating") > 0).cast("double"))
als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)
evaluatorB = BinaryClassificationEvaluator(
metricName="areaUnderROC", labelCol="label_binary")
evaluatorB.evaluate(als_binary_model.transform(with_binary))
## 1.0
Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.
With implicit feedbacks we don't have user reactions to our recommendations. Thus, we cannot use precision based metrics.
In the already cited paper, the expected percentile ranking metric is used instead.
You can try to implement an Evaluator based on a similar metric in the Spark ML lib, and use it in your Cross Validation pipeline.
Very late to the party here, but I'll post in case anyone stumbles upon this question like I did.
I was getting a similar error when trying to use CrossValidator with an ALS model. I resolved it by setting the coldStartStrategy parameter in ALS to "drop". That is:
alsImplicit = ALS(implicitPrefs=True, coldStartStrategy="drop")
and keep the rest of the code the same.
I expect what was happening in my example is that the cross-validation splits created scenarios where I had items in the validation set that did not appear in the training set, which results in NaN prediction values. The best solution is to drop the NaN values when evaluating, as described in the documentation.
I don't know if we were getting the same error so can't guarantee this would solve OP's problem, but it's good practice to set coldStartStrategy="drop" for cross validation anyway.
Note: my error message was "Params must be either a param map or a list/tuple of param maps", which didn't seem to imply an issue with the coldStartStrategy parameter or NaN values but this solution resolved the error.
In order to cross validate my ALS model with implicitPrefs=True, I needed to adapt #zero323's answer slightly for pyspark==2.3.0 where I was getting the following exception:
xspy4j.Py4JException: Target Object ID does not exist for this gateway :o2733\\n\tat py4j.Gateway.invoke(Gateway.java...java:79)\\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\\n\tat java.lang.Thread.run(Thread.java:748)\\n
ALS extends JavaEstimator which provides the hooks necessary for fitting Estimators that wrap Java/Scala implementations. We need to override _create_model in BinaryALS so PySpark can keep all the Java object references straight:
import pyspark.sql.functions as F
from pyspark.ml.linalg import DenseVector, VectorUDT
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.dataframe import DataFrame
class ALSBinaryModel(ALSModel):
def transform(self, df: DataFrame) -> DataFrame:
transformed = super().transform(df)
as_vector = F.udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn("rawPrediction", as_vector(F.col("prediction")))
class BinaryALS(ALS):
def fit(self, df: DataFrame) -> ALSBinaryModel:
assert self.getImplicitPrefs()
return super().fit(df)
def _create_model(self, java_model) -> ALSBinaryModel:
return ALSBinaryModel(java_model=java_model)

Using Kmeans to cluster small phrases in Spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?
k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
After fitting, I transform the data:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]
Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.
Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).
We first need to define an intermediate function, estimatedCovariance, as follows:
import numpy as np
def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.
Note:
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.
Args:
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.
Returns:
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
"""
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean
return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
Then, we can write a main pca function as follows:
from numpy.linalg import eigh
def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.
Note:
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.
Returns:
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
"""
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues
return components.T, score, eigVals
Test
Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
model.transform(df).collect()
[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
Then, with our method:
comp, score, eigVals = pca(df)
score.collect()
[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.
Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382
Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change. [...] Similarly, the score vectors are unique
up to a sign flip, since the variance of Z is the same as the variance of −Z.
Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:
def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.
Returns:
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
"""
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)
varianceExplained(df,1)
# 0.79439325322305299
As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):
varianceExplained(df,5)
# 1.0
[Developed & tested with Spark 1.5.0 & 1.5.1]
EDIT :
PCA and SVD are finally both available in pyspark starting spark 2.2.0 according to this resolved JIRA ticket SPARK-6227.
Original answer:
The answer given by #desertnaut is actually excellent from a theoretical perspective, but I wanted to present another approach on how to compute the SVD and to extract then eigenvectors.
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix
class SVD(JavaModelWrapper):
"""Wrapper around the SVD scala case class"""
#property
def U(self):
""" Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
u = self.call("U")
if u is not None:
return RowMatrix(u)
#property
def s(self):
"""Returns a DenseVector with singular values in descending order."""
return self.call("s")
#property
def V(self):
""" Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
return self.call("V")
This defines our SVD object. We can define now our computeSVD method using the Java Wrapper.
def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
"""
Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
* s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
* U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
* v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
:param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
:param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
:param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
:returns: SVD object
"""
java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
return SVD(java_model)
Now, let's apply that to an example :
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features
# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)
In spark 2.2+ you can now easily get the explained variance as:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
sum(model.explainedVariance)
The easiest answer to your question is to input an identity matrix to your model.
identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
(Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
(Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)
This should give you principle components.
I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:
from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
inputCol='features',
outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)
Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.
I have summarized PCA and its use in Spark and sklearn in this blog post.

Resources