Confusion Matrix to get precsion,recall, f1score - python-3.x

I have a dataframe df. I have performed decisionTree classification algorithm on the dataframe. The two columns are label and features when algorithm is performed. The model is called dtc. How can I create a confusion matrix in pyspark?
dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics
preds = df.select(['label', 'features']) \
.df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)
# Confusion Matrix
print(metrics.confusionMatrix().toArray())```

You need to cast to an rdd and map to tuple before calling metrics.confusionMatrix().toArray().
From the official documentation,
class pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)[source]
Evaluator for multiclass classification.
Parameters: predictionAndLabels – an RDD of (prediction, label) pairs.
Here is an example to guide you.
ML part
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.
#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
(1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]
cols = ('a','b','c','d')
df = spark.createDataFrame(data, cols)
assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')
df_features = assembler.transform(df)
#df.show()
train_data, test_data = df_features.randomSplit([0.6,0.4])
dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')
dtcModel = dtc.fit(train_data)
predictions = dtcModel.transform(test_data)
Evaluation part
#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')
#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
print(metrics.confusionMatrix().toArray())

Use this:
import sklearn
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label', numTrees=500)
rfModel = rf.fit(train)
predictions_train = rfModel.transform(train)
y_true = predictions_train.select(['label']).collect()
y_pred = predictions_train.select(['prediction']).collect()
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
where train is your training data.

Related

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

Multiple Evaluators in CrossValidator - Spark ML

Is it possible to have more than 1 evaluator in a CrossValidator to get R2 and RMSE at the same time?
Instead of having two different CrossValidator:
val lr_evaluator_rmse = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("rmse")
val lr_evaluator_r2 = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("r2")
val lr_cv_rmse = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
val lr_cv_r2 = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Something like this:
val lr_cv = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEvaluator(lr_evaluator_r2)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Thanks in advance
The PySpark documentation on CrossValidator indicates that the evaluator argument is a single entity --> evaluator: Optional[pyspark.ml.evaluation.Evaluator] = None
The solution I went with was to create separate pipelines for each evaluator. For example,
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
# Convert inputs to vector assembler
vec_assembler = VectorAssembler(inputCols=[inputs], outputCol="features")
# Create Random Forest Classifier pipeline
rf = RandomForestClassifier(labelCol="label", seed=42)
multiclass_evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")
binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
# Plop model objects into cross validator
cv1 = CrossValidator(estimator=rf, evaluator=multiclass_evaluator, numFolds=3, parallelism=4, seed=42)
cv2 = CrossValidator(estimator=rf, evaluator=binary_evaluator, numFolds=3, parallelism=4, seed=42)
# Put all step in a pipeline
pipeline1 = Pipeline(stages=[vec_assembler, cv1])
pipeline2 = Pipeline(stages=[vec_assembler, cv2])

How to visualize TF-IDF terms?

I have a few thousands of rows of textual data. My sample data is:
I have used sklearn CountVectorizer and TfidfTransformer I calculated top terms with tfidf weights. Below is the code which I used for this:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = pd.read_csv('Sample_data.csv')
cvec = CountVectorizer(stop_words='english', min_df=5, max_df=0.95, ngram_range=(1,2))
cvec.fit(data['Text'])
list(islice(cvec.vocabulary_.items(), 30))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data['Text'])
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100.0 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(30)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(30)
print(tf_idf)
Now I want to plot (bar or line graph) the top 30 terms with their weights using matplotlib. How can I do this?
Thanks in Advance!

Pyspark retrieve metrics (AUC ROC) from each submodel in CrossValidator

How do I return the individual auc-roc score for each fold/submodel when using crossValidator.
The documentation indicates that collectSubModels=True should save all models rather than just the best or average, but after inspecting model.subModels I can't find how to print them.
The below example works just missing the model.subModels.aucScore
Desired Result would be each fold with their corresponding score like [fold1:0.85, fold2:0.07, fold3:0.55]
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
#Creating test dataframe
training = spark.createDataFrame([
(1,0,1),
(1,0,0),
(0,1,1),
(0,1,0)], ["label", "feature1", "feature2"])
#Vectorizing features for modelling
assembler = VectorAssembler(inputCols=['feature1','feature2'],outputCol="features")
prepped = assembler.transform(training).select('label','features')
#setting variables and configuring CrossValidator
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
params = ParamGridBuilder().build()
evaluator = BinaryClassificationEvaluator()
folds = 3
cv = CrossValidator(estimator=rf,
estimatorParamMaps=params,
evaluator=evaluator,
numFolds=folds,
collectSubModels=True
)
#Fitting model
model = cv.fit(prepped)
#Print Metrics
print(model)
print()
print(model.avgMetrics)
print()
print(model.subModels)
>>>>>Return:
>>>>>CrossValidatorModel_3a5c95c6d8d2
>>>>>()
>>>>>[0.8333333333333333]
>>>>>()
>>>>>[[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees], >>>>>[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees], >>>>>[RandomForestClassificationModel (uid=RandomForestClassifier_95da3a68af93) with 20 trees]]

Why am I getting the same value for the accuracy and recall when using spark's mllib's MulticlassClassificationEvaluator?

So, I am having a play around with some tree based algorithms from Spark's mllib. The code I have is here;
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.ml import Pipeline
from pyspark.ml.classification import (RandomForestClassifier, GBTClassifier, DecisionTreeClassifier)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
conf = SparkConf()
conf.set('spark.logConf', 'true').set("spark.ui.port", "4060")
spark = SparkSession.builder.config(conf=conf).appName("Gradient Boosted Tree").getOrCreate()
data = spark.read.parquet('/mydata/location)
def yt_func(x):
if x <= 10:
yt = 0
else:
yt = 1
return yt
yt_udf = udf(yt_func, IntegerType())
data = data.withColumn('yt_1',yt_udf(data['count']))
datasub = data.select('feature1', 'feature2',
'feature3', 'feature4',
'feature5', 'feature6',
'feature7', 'feature8',
'feature9', 'feature10',
'feature11','feature12',
'feature13')
datasub = datasub.na.fill(0)
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols = ['feature1', 'feature2',
'feature3', 'feature4',
'feature5', 'feature6',
'feature7', 'feature8',
'feature9', 'feature10',
'feature11','feature12',
'feature13'], outputCol = 'features')
output = assembler.transform(datasub)
finaldata = output.select('features','yt_1')
train_data,test_data = finaldata.randomSplit([0.7,0.3])
finaldata.show(20)
dtc = DecisionTreeClassifier(featuresCol='features',labelCol='yt_1')
rfc = RandomForestClassifier(featuresCol='features',labelCol='yt_1', numTrees=70)
gbt = GBTClassifier(featuresCol='features',labelCol='yt_1')
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)
dtc_preds.show()
rfc_preds.show()
gbt_preds.show()
accuracy_eval = MulticlassClassificationEvaluator(metricName = 'accuracy', labelCol='yt_1')
recall_eval = MulticlassClassificationEvaluator(metricName = 'weightedRecall', labelCol='yt_1')
print 'dtc accuracy:', accuracy_eval.evaluate(dtc_preds)
print 'dtc recall', recall_eval.evaluate(dtc_preds)
print 'rfc accuracy:', accuracy_eval.evaluate(rfc_preds)
print 'rfc recall', recall_eval.evaluate(rfc_preds)
print 'gbt accuracy:', accuracy_eval.evaluate(gbt_preds)
print 'gbt recall', recall_eval.evaluate(gbt_preds)
When I run this I get the following;
dtc accuracy: 0.98596755767033761
dtc recall: 0.98596755767033761
rfc accuracy: 0.98551077243825225
rfc recall: 0.98551077243825225
gbt accuracy: 0.98624595624862965
gbt recall: 0.98624595624862965
What is confusing me here is why I am getting the same values for the accuracy and the recall.... they are EXACTLY the same. Surely this isn't correct....??
Any ideas?
An answer to this question can be found where I posted the same question on the Data Science Stack Exchange
https://datascience.stackexchange.com/questions/40160/why-am-i-getting-the-same-value-for-the-accuracy-and-recall-when-using-sparks-m

Resources