spark auc and pr-auc not stable - apache-spark

when I try to use spark BinaryClassificationEvaluator , I will find that with same data and same raw prediction column and label col, evaluation result will change in multiple run. This will happen both in auc and pr-auc. Does this happen as expectation?

Related

Spark MLlib predict only if threshold greater than value

I have a multi class classification (38 classes) problem and implemented a pipeline in Spark ML in order to solve it. This is how I generated my model.
val nb = new NaiveBayes()
.setLabelCol("id")
.setFeaturesCol("features")
.setThresholds(Seq(1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25).toArray)
val pipeline = new Pipeline()
.setStages(Array(stages, assembler, nb))
val model = pipeline.fit(trainingSet)
I want my model to be able to predict a class only if it's confidence (probability) is greater than 0.8%.
I searched a lot in Spark documentation to understand better what the threshold parameter means, but the only relevant piece of information i've found is this one:
Thresholds in multi-class classification to adjust the probability of
predicting each class. Array must have length equal to the number of
classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original
probability of that class and t is the class's threshold.
This is why my thresholds are 1.25.
The problem is that no matter the value I'm inserting for the thresholds, it seams it doesn't affect my predictions at all.
Do you know if there is a possibility to predict only classes that have the confidence (probability) greater than a specific threshold?
Another way would be to select only the predictions that have the probability greater than that threshold, but I expect this can be achieved using the framework.
Thanks.

How to extract average metrics with Cross-Validation in PySpark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?
I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.
paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
model = crossval.fit(df)
evaluator.evaluate(model.transform(df))
For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.
In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.
For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.
If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

How can I compare KMeans model performance with GaussianMixture and LDA model performances in pyspark?

I am working on iris dataset using pyspark.ml.clustering library in order to understand fundamentals of pyspark and create a clustering template for me.
My spark version is 2.1.1 and i have hadoop 2.7.
I know that KMeans and BisectingKMeans have computeCost() method which gives model performance based on the sum of squared distances between the input points and their corresponding cluster centers.
Is there a way to compare KMeans model performance with GaussianMixture and LDA model performances on iris dataset in order to choose best model type (KMeans , GaussianMixture or LDA) ?
Short answer: no
Long answer:
You are trying to compare apples with oranges here: in Gaussian Mixtures & LDA models there is no concept of cluster center at all; hence, it is not strange that a function similar to computeCost() does not exist.
It is easy to see this, if you look at the actual output of a Gaussian Mixture model; adapting the example from the documentation:
from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors
data = [(Vectors.dense([-0.1, -0.05 ]),),
(Vectors.dense([-0.01, -0.1]),),
(Vectors.dense([0.9, 0.8]),),
(Vectors.dense([0.75, 0.935]),),
(Vectors.dense([-0.83, -0.68]),),
(Vectors.dense([-0.91, -0.76]),)]
df = spark.createDataFrame(data, ["features"])
gm = GaussianMixture(k=3, tol=0.0001,maxIter=10, seed=10) # here we ask for k=3 gaussians
model = gm.fit(df)
transformed_df = model.transform(df) # assign data to gaussian components ("clusters")
transformed_df.collect()
# Here's the output:
[Row(features=DenseVector([-0.1, -0.05]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])),
Row(features=DenseVector([-0.01, -0.1]), prediction=2, probability=DenseVector([0.0, 0.0007, 0.9993])),
Row(features=DenseVector([0.9, 0.8]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])),
Row(features=DenseVector([0.75, 0.935]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])),
Row(features=DenseVector([-0.83, -0.68]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])),
Row(features=DenseVector([-0.91, -0.76]), prediction=2, probability=DenseVector([0.0, 0.0006, 0.9994]))]
The actual output of a Gaussian Mixture "clustering" is the third feature above, i.e. the probability column: it is a 3-dimensional vector (because we asked for k=3), showing the "degree" to which the specific data point belongs to each one of the 3 "clusters". In general, the vector components will be less than 1.0, and that's why Gaussian Mixtures are a classic example of "soft clustering" (data points belonging to more than one cluster, to each one by some degree). Now, some implementations (including the one in Spark here) go a step further and assign a "hard" cluster membership (feature prediction above), by simply taking the index of the maximum component in probability - but that is simply an add-on.
What about the output of the model itself?
model.gaussiansDF.show()
+--------------------+--------------------+
| mean| cov|
+--------------------+--------------------+
|[0.82500000000150...|0.005625000000006...|
|[-0.4649980711427...|0.133224999996279...|
|[-0.4600024262536...|0.202493122264028...|
+--------------------+--------------------+
Again, it is easy to see that there are no cluster centers, only the parameters (mean and covariance) of our k=3 gaussians.
Similar arguments hold for the LDA case (not shown here).
It is true that the Spark MLlib Clustering Guide claims that the prediction column includes the "Predicted cluster center", but this term is quite unfortunate, to put it mildly (to put it bluntly, it is plain wrong).
Needless to say, the above discussion comes directly from the core concepts & theory behind Gaussian Mixture Models, and it is not specific to the Spark implementation...
Functions like computeCost() are there merely to help you evaluate different realizations of K-Means (due to different initializations and/or random seeds), as the algorithm may converge to a non-optimal local minimum.

spark ml pipeline handle unseen labels

To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation.
if the pipeline consists of 3 steps
preprocessing
learn most frequent item
stringIndexer for each categorical column
vector assembler
estimator e.g. random forest
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
I can fit and transform 1 for train and test data. This means all nan values were handled, i.e. imputed
2,3 will fit nicely as well as 4,5
then I can use
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
these deliberately introduced nulls are imputed by the most frequent imputer
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values?
I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.
I finally realized that this functionality is required to reside in the pipeline to work properly, i.e. requires an additional new PipelineStage component.

How to Adjust Classification Threshold with a Spark Decision Tree

I'm using Spark 2.0 and the new spark.ml. packages.
Is there a way to adjust the classification threshold so that I reduce the number of False Positives.
If it matters I'm also using the CrossValidator.
I see RandomForestClassifier and DecisionTreeClassifier both output a probability column (Which I could use manually, but GBTClassifier does not.
It sounds like you might be looking for the thresholds parameter:
final val thresholds: DoubleArrayParam
Param for Thresholds in multi-class classification to adjust the probability
of predicting each class. Array must have length equal to the number of
classes, with values >= 0. The class with largest value p/t is predicted,
where p is the original probability of that class and t is the class'
threshold.
You will need to set it by calling setThresholds(value: Array[Double]) on your classifier.

Resources