How to extract average metrics with Cross-Validation in PySpark - apache-spark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?
I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.
paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
model = crossval.fit(df)
evaluator.evaluate(model.transform(df))
For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.

In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.
For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.
If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

Related

When I run the spark sample for logistic regression, I get a perfect model. Did I screw up

I'm running the example code from the spark docs for logistic regression using pyspark and the attendant training summary code:
from pyspark.ml.classification import LogisticRegression
# Load training data
training = spark.read.format("libsvm").load("/user/tim/sample_svm/sample_libsvm_data.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
# Fit the model
mlrModel = mlr.fit(training)
# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))
# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = lrModel.summary
# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
print(objective)
# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show(500)
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
.select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
and get:
areaUnderROC: 1.0
which I wouldn't expect. Perhaps it overfit and simply memorized the data, but I've done train and test, even randomized labels, and tweaked all the hyper parameters and they all led to the same thing:AUC=1.0. I tried the sample code for the SVC models, which uses the same dataset, and I get the same thing.
I'd normally post the code, but I literally ran the example code only changing the path to the data file. I've searched and searched and can find no example of anyone having run this example and examined the results. What's odd is that this dataset, sample_libsvm_data.txt, is used throughout the docs yet I can find neither analysis of it nor even an explanation of what the data actually is.
As a result I've switched to using the RDD-based API of MLLIB because I can't make sense of the results of the sample code. I hope someone can tell me how I'm doing something wrong.
EDIT:
As requested, here's the entire datafile.
I've been using pyspark with dataframes for quite some time, and I've
had positive experinces using dataframes over RDD, as they resemble
the pandas dataframes. So for changing that in your approach I don't
find it as a good solution for your problem.
The hypothesis of "simply memorized the data" is not a valid solution
for your problem. Check it yourself, simple change the name of the
objects and variables and see the same output.
So looking at this specific piece of code you put there you say you
got an AUC_ROC=1.0. My first intuition is because your are analysing
the summary statistcs of the training set not the test. You are most
likely correct the model overfit on the training set. However, with a
test set I doubt the value would get the same results. So I went and
evaluated it myself:
result = lrModel.transform(training)
result.prediction
result.show()
List item
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
AUC_ROC = evaluator.evaluate(result,{evaluator.metricName: "areaUnderROC"})
print('AUC ROC:' + str(AUC_ROC))
final results
AUC ROC:1.0
In other words, you are correct the model overfits... But these are the results for the train set, and it's working correctly. And indeed the AUC have 1.0 for the piece of code you provided.
Bottom line is, it is a bad example to be included in the spark documentation. Try it in other datasets with this code
So checking the API documentation, here is another example but this time you have the expected results... sadly 1.0 again... really bad choice of examples from spark I must admit: https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

Adjust Intercept of Spark DataFrame API Logistic Regression Model

I'm training a logistic regression in Spark. However, due to specifics in my training data, I need to manually adjust the model afterwards, namely change the intercept.
That was easy to do with the RDD api - just instantiate a new LogisticRegressionModel:
val intercept = model.intercept() + adjustment
val model = new LogisticRegressionModel(model.weights(), intercept)
However, the LogisticRegressionModel constructor in the DataFrame API was made private. How can I make manual adjustments to the model?
I had the same problem this afternoon and I was in test mode, trying to make it happen no matter what, so I don't care how dirty it is: get the coefficients from your model, get the intercept, adjust it, then do your predictions by hand with the code they use in Spark (look for BLAS.dot, margin and score). At some point they use BLAS.dot, well BLAS is private to spark. Again do the same, retrieve the code for dot, deal with SparseVector/DenseVector and you can make it. Dirty but it works.

Spark Cross Validation with Training, Testing and Validation sets

I want to do two Cross Validation processes in Spark using RandomSplits like
CV_global: by splitting data into Training Set 90% and Testing Set 10%
1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.
1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.
1.3 Test Model: on Testing set (10%)
Report Average metrics per 10-fold and global metrics.
The problem is I only find examples using CV and Grid search on the whole training set.
How can I get the parameters of the best performing model from CV_grid?
How to do CV without grid search but get stats per fold? e.g.
sklearn.cross_validation.cross_val_score
You have things like
crossval.setEstimatorParamMaps(paramGrid)
and then
cvModel = crossval.fit(trainingSetDF).bestModel
For single models (at least for some) there are functions like explainParams(). It's available in spark 1.6 (maybe it goes back to 1.4.2, I'm not sure).
Hope this helps
You have three questions into one. The answers for each:
1. The problem is I only find examples using CV and Grid search on the whole training set.
if you need just a portion of your training dataset, then sample at the wanted percentage, e.g.
training = training.sample(false, .45, 78L)
2. How can I get the parameters of the best performing model from CV_grid?
crossValidatedModel.bestModel().getParamMap()
get from there the parameters names , and then values.
3. How to do CV without grid search but get stats per fold? e.g.
duplicate of How can I access computed metrics for each fold in a CrossValidatorModel
Take a look here: Spark CrossValidatorModel access other models than the bestModel?

How am I supposed to use RandomizedLogisticRegression in Scikit-learn?

I simply have failed to understand the documentation for this class.
I can fit data using it, and get the scores for features, but it this all this class is supposed to do?
I can't see how I can use it to actually perform regression using the model that was fit. The example in the documentation above is simply creating an instance of the class, so I can't see how that is supposed to help.
There are methods that perform 'transform' operation, but no mention of what kind of transform that is.
so is it possible to use this class to get actual predictions on new test data, and is it possible to use it in cross fold validation to compare performance with other methods I'm using?
I've used the highest ranking features in other classifiers, but I'm not sure if more than that is possible with this classifier.
Update: I've found the use for fit_transform under feature selection part of the documentation:
When the goal is to reduce the dimensionality of the data to use with another classifier, they expose a transform method to select the non-zero coefficient
Unless I get an answer that says I'm wrong, I'll assume that this classifier indeed does not do prediction. I'll wait before I answer my own question.
Randomized LR is supposed to be a feature selection method, not a classifier in and of itself. Its API matches that of a standard scikit-learn transformer:
randomlr = RandomizedLogisticRegression()
X_train = randomlr.fit_transform(X_train)
X_test = randomlr.transform(X_test)
Then fit a model to X_train and do classification on X_test as usual.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources