I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning.
I have a multilayer perceptron classifier and I have only two labels.
My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1.
If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.
I have found a similar topic about this,
How to get classification probabilities from MultilayerPerceptronClassifier?
but they use Java and the solution they suggested doesn't work in python.
Thx
Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0 |[0.0,1.0]|
# | 1.0 |[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)
# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features| rawPrediction| probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...| 0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...| 1.0|
# +---------+--------------------+--------------------+----------+
rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+
# | features|rawPrediction| probability|prediction|
# +---------+-------------+--------------------+----------+
# |[0.2,0.5]| [1.0,2.0]|[0.33333333333333...| 1.0|
# |[0.5,0.2]| [1.0,2.0]|[0.33333333333333...| 1.0|
# +---------+-------------+--------------------+----------+
For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.
Related
How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?
Example1:
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])
Example2:
vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])
What is the difference? Which one is better?
There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.
Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):
spark.version
# u'2.3.1'
# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
[1, 40.4, -20.5, 12., 2.2],
[2, 28., -23.9, -2., -1.7],
[3, 29.5, -19.0, -0.5, -0.2],
[4, 32.8, -18.84, 1.5, 1.8]
],
["id","lat", "long", "other", "label"])
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])
model = pipeline.fit(df)
model.transform(df).show()
The result is:
+---+----+------+-----+-----+-------------+-----------------+
| id| lat| long|other|label| loc| features|
+---+----+------+-----+-----+-------------+-----------------+
| 0|33.3| -17.5| 10.0| 0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
| 1|40.4| -20.5| 12.0| 2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
| 2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
| 3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
| 4|32.8|-18.84| 1.5| 1.8|[-18.84,32.8]|[-18.84,32.8,1.5]|
+---+----+------+-----+-----+-------------+-----------------+
i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...
Anyone able to match the sklearn confusion matrix to h2o?
They never match....
Doing something similar with Keras produces a perfect match.
But in h2o they are always off. Tried it every which way...
Borrowed some code from:
Any difference between H2O and Scikit-Learn metrics scoring?
# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# In[32]:
# Generate predictions on a test set
pred = model.predict(test)
# In[33]:
from sklearn.metrics import roc_auc_score, confusion_matrix
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()
# In[36]:
y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))
# In[37]:
print(cm)
0 1
0 1354 961
1 540 2145
# In[38]:
model.model_performance(test).confusion_matrix()
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.353664307031828:
0 1 Error Rate
0 964.0 1351.0 0.5836 (1351.0/2315.0)
1 274.0 2411.0 0.102 (274.0/2685.0)
Total 1238.0 3762.0 0.325 (1625.0/5000.0)
# In[39]:
h2o.cluster().shutdown()
This does the trick, thx for the hunch Vivek. Still not an exact match but extremely close.
perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)
I also meet the same issue. Here is what I would do to make a fair comparison:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)
Since we train the model using training data and validation data, then the pred['predict'] will use the threshold which maximizes the F1 score of validation data. To make sure, one can use these lines:
threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)
To get another confusion matrix from scikit learn:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true, pred_df['predict'])
In my case, I don't understand why I get slightly different results. Something like, for example:
print(cm1)
>> [[3063 176]
[ 94 146]]
print(cm2)
>> [[3063 176]
[ 95 145]]
How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix
# Load training data
df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
training <- df
testing <- df
# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10)
# Model summary
summary(model)
# Prediction
predictions <- predict(model, testing)
head(predictions)
# Performance evaluation
I've tried caret::confusionMatrix(testing$label,testing$prediction) it shows error:
Error in unique.default(x, nmax = nmax) : unique() applies only to vectors
Caret's confusionMatrix will not work, since it needs R dataframes while your data are in Spark dataframes.
One not recommended way for getting your metrics is to "collect" locally your Spark dataframes to R using as.data.frame, and then use caret etc.; but this means that your data can fit in the main memory of your driver machine, in which case of course you have absolutely no reason to use Spark...
So, here is a way to get the accuracy in a distributed manner (i.e. without collecting data locally), using the iris data as an example:
sparkR.version()
# "2.1.1"
df <- as.DataFrame(iris)
model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10)
predictions <- predict(model, df)
summary(predictions)
# SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string]
createOrReplaceTempView(predictions, "predictions")
correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species")
count(correct)
# 149
acc = count(correct)/count(predictions)
acc
# 0.9933333
(Regarding the 149 correct predictions out of 150 samples, if you do a showDF(predictions, numRows=150) you will see indeed that there is a single virginica sample misclassified as versicolor).
I used cross validation to train a linear regression model using the following code:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
now I want to draw the roc curve, I used the following code but I get this error:
'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
I also want to check the objectiveHistory at each itaration, I know that I can get it at the end
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
but I want to get it at each iteration, how can I do this?
Moreover I want to evaluate the model on the test data, how can I do that?
prediction = cvModel.transform(test)
I know for the training data set I can write:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
but how can I get these metrics for testing data set?
1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.
2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124
I am using Spark 1.5.1 and,
In pyspark, after I fit the model using:
model = LogisticRegressionWithLBFGS.train(parsedData)
I can print the prediction using:
model.predict(p.features)
Is there a function to print the probability score also along with the prediction?
You have to clear the threshold first, and this works only for binary classification:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]
model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
model.threshold
# 0.5
model.predict(parsed_data[2].features)
# 1
model.clearThreshold()
model.predict(parsed_data[2].features)
# 0.9873840020002339
I presume the question is on computing probability score for the predicting the entire training set. if so , I did the following to compute it. Not sure if the post is still active, but this is howI did this:
#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.
#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))
# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)
#Now store the model.predict rdd structures
predict=model.predict(fs)
pr=predict.collect()
correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)
Note the above is for multi-class classification.