I have a DataFrame with document ids doc_id, line ids for a set of lines in each document line_id, and a dense vector representation of each line vectors. For each document (doc_id), I want to convert the set of vectors representing each line into a mllib.linalg.distributed.BlockMatrix
It is relatively straight forward to convert the vectors of the entire DataFrame, or DataFrame filtered by doc_id into a BlockMatrix by first converting the vectors into an RDD of (numRows, numCols), DenseMatrix). A coded example of that below.
However, I am having trouble converting the RDD of Iterator[(numRows, numCols), DenseMatrix)] returned by mapPartition, which converted the vectors column for each doc_id partition, into a separate BlockMatrix for each doc_id partition.
My cluster has 3 worker nodes with 16 cores and 62 GB of memory each.
Imports and start spark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg import MatrixUDT
from pyspark.mllib.linalg.distributed import BlockMatrix
spark = (
SparkSession.builder
.master('yarn')
.appName("linalg_test")
.getOrCreate()
)
Create test dataframe
nRows = 25000
""" Create ids dataframe """
win = (W
.partitionBy(F.col('doc_id'))
.rowsBetween(W.unboundedPreceding, W.currentRow)
)
df_ids = (
spark.range(0, nRows, 1)
.withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
.withColumn('doc_id', F.floor(F.col('rand1')/3).cast(T.IntegerType()) )
.withColumn('int', F.lit(1))
.withColumn('line_id', F.sum(F.col('int')).over(win))
.select('id', 'doc_id', 'line_id')
)
""" Create vector dataframe """
df_vecSchema = T.StructType([
T.StructField('vectors', T.StructType([T.StructField('vectors', VectorUDT())] ) ),
T.StructField('id', T.LongType())
])
vecDim = 50
df_vec = (
spark.createDataFrame(
RandomRDDs.normalVectorRDD(sc, numRows=nRows, numCols=vecDim, seed=54321)
.map(lambda x: Row(vectors=Vectors.dense(x),))
.zipWithIndex(), schema=df_vecSchema)
.select('id', 'vectors.*')
)
""" Create final test dataframe """
df_SO = (
df_ids.join(df_vec, on='id', how='left')
.select('doc_id', 'line_id', 'vectors')
.orderBy('doc_id', 'line_id')
)
numDocs = df_SO.agg(F.countDistinct(F.col('doc_id'))).collect()[0][0]
# numDocs = df_SO.groupBy('doc_id').agg(F.count(F.col('line_id'))).count()
df_SO = df_SO.repartition(numDocs, 'doc_id')
RDD functions to create matrices out of Vector column
def vec2mat(row):
return (
(row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
create dense matrix out of each line_id vector
mat = df_SO.rdd.map(vec2mat)
create distributed BlockMatrix from RDD of DenseMatrix
blk_mat = BlockMatrix(mat, 1, vecDim)
check output
blk_mat
<pyspark.mllib.linalg.distributed.BlockMatrix at 0x7fe1da370a50>
blk_mat.blocks.take(1)
[((273, 0),
DenseMatrix(1, 50, [1.749, -1.4873, -0.3473, 0.716, 2.3916, -1.5997, -1.7035, 0.0105, ..., -0.0579, 0.3074, -1.8178, -0.2628, 0.1979, 0.6046, 0.4566, 0.4063], 0))]
Problem
I cannot get the same thing to work after converting each partition of doc_id with mapPartitions. The mapPartitions function works, but I cannot get the RDD that it returns converted into a BlockMatrix.
RDD function to create dense matrix out of each line_id vector separately for each doc_id partition
def vec2mat_p(iter):
yield [((row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
for row in iter]
create dense matrix out of each line_id vector separately for each doc_id partition
mat_doc = df_SO.rdd.mapPartitions(vec2mat_p, preservesPartitioning=True)
Check
mat_doc
PythonRDD[4991] at RDD at PythonRDD.scala:48
mat_test.take(1)
[[((0, 0),
DenseMatrix(1, 50, [1.814, -1.1681, -2.1887, -0.5371, -0.7509, 2.3679, 0.2795, 1.4135, ..., -0.3584, 0.5059, -0.6429, -0.6391, 0.0173, 1.2109, 1.804, -0.9402], 0)),
((1, 0),
DenseMatrix(1, 50, [0.3884, -1.451, -0.0431, -0.4653, -2.4541, 0.2396, 1.8704, 0.8471, ..., -2.5164, 0.1298, -1.2702, -0.1286, 0.9196, -0.7354, -0.1816, -0.4553], 0)),
((2, 0),
DenseMatrix(1, 50, [0.1382, 1.6753, 0.9563, -1.5251, 0.1753, 0.9822, 0.5952, -1.3924, ..., 0.9636, -1.7299, 0.2138, -2.5694, 0.1701, 0.2554, -1.4879, -1.6504], 0)),
...]]
Check types
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [(type(m[0]), (type(m[0][0]),type(m[0][1])), type(m[1])) for m in mlst] )
.first()
)
[(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
...]
Seems correct, however, running:
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [BlockMatrix((m[0], m[1])[0], 1, vecDim) for m in mlst] )
.first()
)
results in the following type error:
TypeError: blocks should be an RDD of sub-matrix blocks as ((int, int), matrix) tuples, got
Unfortunately, the error stops short and does not tell me what it 'got'.
Also, I cannot call sc.parallelize() inside of a map() call.
How do I convert each item in the RDD iterator that mapPartitions returns into a RDD that BlockMatrix will accept?
I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -
use string indexer to index string columns
use one hot encoder for all columns
use a vectorassembler to create the feature column containing the feature vector
Some sample code from the docs for steps 1,2,3 -
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary
SparseVectors
# encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index",
outputCol=categoricalCol + "classVec")
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]
numericCols = ["age", "fnlwgt", "education_num", "capital_gain",
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
# - fit() computes feature statistics as needed.
# - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)
finally train the model
after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
print dtModel_1.featureImportances
(38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
How do I map it back to the initial column names and the values? So that I can plot ?**
Extract metadata as shown here by user6910411
attrs = sorted(
(attr["idx"], attr["name"])
for attr in (
chain(*dataset.schema["features"].metadata["ml_attr"]["attrs"].values())
)
)
and combine with feature importance:
[
(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]
]
The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)
When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:
features_imp_pd = (
pd.DataFrame(
dtModel_1.featureImportances.toArray(),
index=assemblerInputs,
columns=['importance'])
)
I read the following article
Anomaly detection with Principal Component Analysis (PCA)
In the article is written following:
• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.
• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.
• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.
Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables“. How to use Mahalanobis distance when there is no more correlation between the variables?
Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?
Lets assume you have a dataset of 3-dimensional points.
Each point has coordinates (x, y, z).
Those (x, y, z) are dimensions.
Point represented by three values e. g. (8, 7, 4). It called input vector.
When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).
Example: (8, 7, 4) => (-4, 13)
Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.
Example solution:
import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
object SparkApp extends App {
val session = SparkSession.builder()
.appName("spark-app").master("local[*]").getOrCreate()
session.sparkContext.setLogLevel("ERROR")
import session.implicits._
val df = Seq(
(1, 4, 0),
(3, 4, 0),
(1, 3, 0),
(3, 3, 0),
(67, 37, 0) //outlier
).toDF("x", "y", "z")
val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
.setWithStd(true)
val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
val pipeline = new Pipeline().setStages(
Array(vectorAssembler, standardScalar, pca)
)
val pcaDF = pipeline.fit(df).transform(df)
def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
val mahalanobois = udf[Double, Vector] { v =>
val vB = DenseVector(v.toArray)
vB.t * invCovariance * vB
}
df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
}
val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
session.close()
}
I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code:
val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true)
val piVector = new DenseVector(model.pi)
//val prob = thetaMatrix.multiply(test.features)
val x = test.map {p =>
val prob = thetaMatrix.multiply(p.features)
BLAS.axpy(1.0, piVector, prob)
prob
}
Does this work properly? The line BLAS.axpy(1.0, piVector, prob) keeps giving me an error that the value 'axpy' is not found.
In a recent pull-request this was added to the Spark trunk and will be released in Spark 1.5 (closing SPARK-4362). you can therefore call
def predictProbabilities(testData: RDD[Vector]): RDD[Vector]
or
def predictProbabilities(testData: Vector): Vector