Pyspark retrieve metrics (AUC ROC) from each submodel in CrossValidator - apache-spark

How do I return the individual auc-roc score for each fold/submodel when using crossValidator.
The documentation indicates that collectSubModels=True should save all models rather than just the best or average, but after inspecting model.subModels I can't find how to print them.
The below example works just missing the model.subModels.aucScore
Desired Result would be each fold with their corresponding score like [fold1:0.85, fold2:0.07, fold3:0.55]
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
#Creating test dataframe
training = spark.createDataFrame([
(1,0,1),
(1,0,0),
(0,1,1),
(0,1,0)], ["label", "feature1", "feature2"])
#Vectorizing features for modelling
assembler = VectorAssembler(inputCols=['feature1','feature2'],outputCol="features")
prepped = assembler.transform(training).select('label','features')
#setting variables and configuring CrossValidator
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
params = ParamGridBuilder().build()
evaluator = BinaryClassificationEvaluator()
folds = 3
cv = CrossValidator(estimator=rf,
estimatorParamMaps=params,
evaluator=evaluator,
numFolds=folds,
collectSubModels=True
)
#Fitting model
model = cv.fit(prepped)
#Print Metrics
print(model)
print()
print(model.avgMetrics)
print()
print(model.subModels)
>>>>>Return:
>>>>>CrossValidatorModel_3a5c95c6d8d2
>>>>>()
>>>>>[0.8333333333333333]
>>>>>()
>>>>>[[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees], >>>>>[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees], >>>>>[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees]]

Related

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

Multiple Evaluators in CrossValidator - Spark ML

Is it possible to have more than 1 evaluator in a CrossValidator to get R2 and RMSE at the same time?
Instead of having two different CrossValidator:
val lr_evaluator_rmse = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("rmse")
val lr_evaluator_r2 = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("r2")
val lr_cv_rmse = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
val lr_cv_r2 = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Something like this:
val lr_cv = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEvaluator(lr_evaluator_r2)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Thanks in advance
The PySpark documentation on CrossValidator indicates that the evaluator argument is a single entity --> evaluator: Optional[pyspark.ml.evaluation.Evaluator] = None
The solution I went with was to create separate pipelines for each evaluator. For example,
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
# Convert inputs to vector assembler
vec_assembler = VectorAssembler(inputCols=[inputs], outputCol="features")
# Create Random Forest Classifier pipeline
rf = RandomForestClassifier(labelCol="label", seed=42)
multiclass_evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")
binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
# Plop model objects into cross validator
cv1 = CrossValidator(estimator=rf, evaluator=multiclass_evaluator, numFolds=3, parallelism=4, seed=42)
cv2 = CrossValidator(estimator=rf, evaluator=binary_evaluator, numFolds=3, parallelism=4, seed=42)
# Put all step in a pipeline
pipeline1 = Pipeline(stages=[vec_assembler, cv1])
pipeline2 = Pipeline(stages=[vec_assembler, cv2])

Confusion Matrix to get precsion,recall, f1score

I have a dataframe df. I have performed decisionTree classification algorithm on the dataframe. The two columns are label and features when algorithm is performed. The model is called dtc. How can I create a confusion matrix in pyspark?
dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics
preds = df.select(['label', 'features']) \
.df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)
# Confusion Matrix
print(metrics.confusionMatrix().toArray())```
You need to cast to an rdd and map to tuple before calling metrics.confusionMatrix().toArray().
From the official documentation,
class pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)[source]
Evaluator for multiclass classification.
Parameters: predictionAndLabels – an RDD of (prediction, label) pairs.
Here is an example to guide you.
ML part
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.
#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
(1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]
cols = ('a','b','c','d')
df = spark.createDataFrame(data, cols)
assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')
df_features = assembler.transform(df)
#df.show()
train_data, test_data = df_features.randomSplit([0.6,0.4])
dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')
dtcModel = dtc.fit(train_data)
predictions = dtcModel.transform(test_data)
Evaluation part
#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')
#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
print(metrics.confusionMatrix().toArray())
Use this:
import sklearn
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label', numTrees=500)
rfModel = rf.fit(train)
predictions_train = rfModel.transform(train)
y_true = predictions_train.select(['label']).collect()
y_pred = predictions_train.select(['prediction']).collect()
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
where train is your training data.

How to perform grid search for Random Forest using Apache Spark ML library

I want to perform grid search on my Random Forest Model in Apache Spark. But I am not able to find an example to do so. Is there any example on sample data where I can do hyper parameter tuning using Grid Search?
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
pipeline = Pipeline(stages=[rf])
paramGrid = ParamGridBuilder().addGrid(rf.numTrees, [10, 30]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
cvModel = crossval.fit(training_df)
hyperparameters and grid are defined in addGrid method

random forest with spark: get predicted values and R²

I am using MLlib of spark to perform a regression random forest.
I am using the python code here:
https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#tab_python_1
It works but now I would like to get the predicted values as well as the R or R² of the prediction model.
How to get that?
Here is how to save a csv file into RDD (spark data format):
# Imports
import csv
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
from collections import namedtuple
from operator import add, itemgetter
from pyspark import SparkConf, SparkContext
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
import shutil
import numpy
def parse(row):
"""
Parses a row and returns a named tuple.
"""
row[0] = str(row[0])
row[1] = float(row[1])
row[2] = float(row[2])
row[3] = float(row[3])
row[4] = float(row[4])
return LabeledPoint(row[4], row[:4])
def split(line):
"""
Operator function for splitting a line with csv module
"""
reader = csv.reader(StringIO(line), delimiter=';')
return next(reader)
#save csv file on a spark cluster (RDD format)
data = sc.textFile("datafile").map(split).map(parse)
Here is how to perform the random forest algorithm and how to get the predicted values:
def random_forest_regression(data):
"""
Run the random forest (regression) algorithm on the data to perform the prediction
"""
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, numTrees=100, featureSubsetStrategy="auto", impurity='variance', maxDepth=10, maxBins=32)
#increase number of trees to have a better prediction
# Evaluate model on TEST instances and compute test error
predictions_test = model.predict(testData.map(lambda x: x.features))
real_and_predicted_test = testData.map(lambda lp: lp.label).zip(predictions_test)
#get the list of real and predicted values FOR ALL THE POINTS
predictions = model.predict(data.map(lambda x: x.features))
real_and_predicted = data.map(lambda lp: lp.label).zip(predictions)
real_and_predicted=real_and_predicted.collect()
print("real and predicted values")
for value in real_and_predicted:
print(value)
return model, real_and_predicted
To get the correlation coefficient (R value), I used numpy:
def compute_correlation_coefficient(real_and_predicted):
"""
compute and display the correlation coefficient from a list of real and predicted values
"""
list1=[]
list2=[]
for tuple in real_and_predicted:
list1.append(tuple[0])
list2.append(tuple[1])
print("correlation coefficient")
print(numpy.corrcoef(list1, list2)[0, 1])
To get the R², take the square value of the correlation coefficient.
Voilà !

Resources