How to evaluate implicit trained model in spark MLlib [duplicate] - apache-spark

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out.
I can get this to work for an explicit data model with a regression RMSE evaluator, as follows:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import rand
conf = SparkConf() \
.setAppName("MovieLensALS") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)
paramMapExplicit = ParamGridBuilder() \
.addGrid(alsExplicit.rank, [8, 12]) \
.addGrid(alsExplicit.maxIter, [10, 15]) \
.addGrid(alsExplicit.regParam, [1.0, 10.0]) \
.build()
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)
predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()
When I try to do this for implicit data (let's say counts of views rather than ratings), I get an error that I can't quite figure out. Here's the code (very similar to the above):
dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)
predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()
I tried doing this with an RMSE evaluator and I get an error. As I understand, I should also be able to use the AUC metric for the binary classification evaluator, because the predictions of the implicit matrix factorization are a confidence matrix c_ui for predictions of a binary matrix p_ui per this paper, which the documentation for pyspark ALS cites.
Using either evaluator gives me an error and I can't find any fruitful discussion about cross-validating implicit ALS models online. I'm looking through the CrossValidator source code to try to figure out what's wrong, but am having trouble. One of my thoughts is that after the process converts the implicit data matrix r_ui to the binary matrix p_ui and confidence matrix c_ui, I'm not sure what it's comparing the predicted c_ui matrix against during the evaluation stage.
Here is the error:
Traceback (most recent call last):
File "<ipython-input-16-6c43b997005e>", line 1, in <module>
cvModel = cv.fit(dfCounts)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
return self._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
model = est.fit(train, epm[j])
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
return self.copy(params)._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
etc.......
UPDATE
I tried scaling the input so it's in the range of 0 to 1 and using a RMSE evaluator. It seems to work well until I try to insert it into the CrossValidator.
The following code works. I get predictions and i get an RMSE value from my evaluator.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
conf = SparkConf() \
.setAppName("ALSPractice") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0), (0,3,0.0), (0,4,0.0),
(1,0,5.0), (1,2,4.0), (1,3,0.0), (1,4,0.0),
(2,0,0.0), (2,2,0.0), (2,3,5.0), (2,4,5.0),
(3,0,0.0), (3,1,0.0), (3,3,4.0) ],
["user", "item", "rating"])
dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
(1,0), (1,1), (1,2), (1,3), (1,4),
(2,0), (2,1), (2,2), (2,3), (2,4),
(3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])
# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)
evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))
preds = defaultModelImplicit.transform(dfCountsTest2)
What I don't understand is why the following doesn't work. I'm using the same estimator, the same evaluator and fitting the same data. Why would these work above but not within the CrossValidator:
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)

Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.
you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator
You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.
map label column to [0, 1] range and use RMSE.
convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this
from pyspark.ml.recommendation import *
from pyspark.sql.functions import udf, col
from pyspark.mllib.linalg import DenseVector, VectorUDT
class BinaryALS(ALS):
def fit(self, df):
assert self.getImplicitPrefs()
model = super(BinaryALS, self).fit(df)
return ALSBinaryModel(model._java_obj)
class ALSBinaryModel(ALSModel):
def transform(self, df):
transformed = super(ALSBinaryModel, self).transform(df)
as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn(
"rawPrediction", as_vector(col("prediction")))
# Add binary label column
with_binary = dfCounts.withColumn(
"label_binary", (col("rating") > 0).cast("double"))
als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)
evaluatorB = BinaryClassificationEvaluator(
metricName="areaUnderROC", labelCol="label_binary")
evaluatorB.evaluate(als_binary_model.transform(with_binary))
## 1.0
Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.

With implicit feedbacks we don't have user reactions to our recommendations. Thus, we cannot use precision based metrics.
In the already cited paper, the expected percentile ranking metric is used instead.
You can try to implement an Evaluator based on a similar metric in the Spark ML lib, and use it in your Cross Validation pipeline.

Very late to the party here, but I'll post in case anyone stumbles upon this question like I did.
I was getting a similar error when trying to use CrossValidator with an ALS model. I resolved it by setting the coldStartStrategy parameter in ALS to "drop". That is:
alsImplicit = ALS(implicitPrefs=True, coldStartStrategy="drop")
and keep the rest of the code the same.
I expect what was happening in my example is that the cross-validation splits created scenarios where I had items in the validation set that did not appear in the training set, which results in NaN prediction values. The best solution is to drop the NaN values when evaluating, as described in the documentation.
I don't know if we were getting the same error so can't guarantee this would solve OP's problem, but it's good practice to set coldStartStrategy="drop" for cross validation anyway.
Note: my error message was "Params must be either a param map or a list/tuple of param maps", which didn't seem to imply an issue with the coldStartStrategy parameter or NaN values but this solution resolved the error.

In order to cross validate my ALS model with implicitPrefs=True, I needed to adapt #zero323's answer slightly for pyspark==2.3.0 where I was getting the following exception:
xspy4j.Py4JException: Target Object ID does not exist for this gateway :o2733\\n\tat py4j.Gateway.invoke(Gateway.java...java:79)\\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\\n\tat java.lang.Thread.run(Thread.java:748)\\n
ALS extends JavaEstimator which provides the hooks necessary for fitting Estimators that wrap Java/Scala implementations. We need to override _create_model in BinaryALS so PySpark can keep all the Java object references straight:
import pyspark.sql.functions as F
from pyspark.ml.linalg import DenseVector, VectorUDT
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.dataframe import DataFrame
class ALSBinaryModel(ALSModel):
def transform(self, df: DataFrame) -> DataFrame:
transformed = super().transform(df)
as_vector = F.udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
return transformed.withColumn("rawPrediction", as_vector(F.col("prediction")))
class BinaryALS(ALS):
def fit(self, df: DataFrame) -> ALSBinaryModel:
assert self.getImplicitPrefs()
return super().fit(df)
def _create_model(self, java_model) -> ALSBinaryModel:
return ALSBinaryModel(java_model=java_model)

Related

fit_transform vs transform when doing inference

I have trained a keras model and saved it. I now want to use the model in a web app for inference. I want to preprocess the inputs by scaling them using StandardScaler() from sklearn.
But whenever i run transform(inputs) an error occurs wanting me to do fitting first. This was the code
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.transform(inputs)
preds = model.predict(inputs, batch_size = 1)
I then changed the code inorder to do fitting
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.fit_transform(inputs)
preds = model.predict(inputs, batch_size = 1)
It worked but the scaled data are all bunch of zeros regardless of the inputs i provide, making wrong predicitions. Am certain am missing some key concepts here, i am asking for help. Thank you
The standard scaler function has formula:
z = (x - u) / s
Here,
x: Element
u: Mean
s: Standard Deviation
This element transformation is done column-wise.
Therefore, when you call to fit the values of mean and standard_deviation are calculated.
Eg:
from sklearn.preprocessing import StandardScaler
import numpy as np
x = np.random.randint(50,size = (10,2))
x
Output:
array([[26, 9],
[29, 39],
[23, 26],
[29, 22],
[28, 41],
[11, 6],
[42, 40],
[ 1, 25],
[ 0, 39],
[44, 45]])
Now, fitting the standard scaler
scale = StandardScaler()
scale.fit(x)
You can see the mean and standard deviation using the built methods for the StandardScaler object
# Mean
scale.mean_ # array([23.3, 29.2])
# Standard Deviation
scale.scale_ # array([14.36697602, 13.12859475])
You transform these values using the transform method.
scale.transform(x)
Output:
array([[ 0.18793099, -1.53862621],
[ 0.3967432 , 0.74646222],
[-0.02088122, -0.24374277],
[ 0.3967432 , -0.54842122],
[ 0.32713913, 0.89880145],
[-0.85613006, -1.76713506],
[ 1.3015961 , 0.82263184],
[-1.55217075, -0.31991238],
[-1.62177482, 0.74646222],
[ 1.44080424, 1.20347991]])
Calculation for 1st element:
z = (26 - 23.3) / 14.36697602
z = 0.18793099
How to use this?
The transformation should be done before training your model. The training should be done on transformed data. And for the prediction, the test data should use the same mean and standard deviation values as your training data. ie. Do not use fit method on the test data. You should use the object that was used to transform the training data to transform your test data.

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

Anomaly detection with PCA in Spark

I read the following article
Anomaly detection with Principal Component Analysis (PCA)
In the article is written following:
• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.
• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.
• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.
Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables“. How to use Mahalanobis distance when there is no more correlation between the variables?
Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?
Lets assume you have a dataset of 3-dimensional points.
Each point has coordinates (x, y, z).
Those (x, y, z) are dimensions.
Point represented by three values e. g. (8, 7, 4). It called input vector.
When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).
Example: (8, 7, 4) => (-4, 13)
Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.
Example solution:
import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
object SparkApp extends App {
val session = SparkSession.builder()
.appName("spark-app").master("local[*]").getOrCreate()
session.sparkContext.setLogLevel("ERROR")
import session.implicits._
val df = Seq(
(1, 4, 0),
(3, 4, 0),
(1, 3, 0),
(3, 3, 0),
(67, 37, 0) //outlier
).toDF("x", "y", "z")
val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
.setWithStd(true)
val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
val pipeline = new Pipeline().setStages(
Array(vectorAssembler, standardScalar, pca)
)
val pcaDF = pipeline.fit(df).transform(df)
def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
val mahalanobois = udf[Double, Vector] { v =>
val vB = DenseVector(v.toArray)
vB.t * invCovariance * vB
}
df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
}
val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
session.close()
}

Ax = b solver on coordinate matrix Apache Spark

How can I solve the Ax = b problem using Apache spark. My input is a coordinate matrix:
import numpy as np
import scipy
from scipy import sparse
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
A = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
#take the first column of A
b = sparse.coo_matrix((data, (row, 1)), shape=(4, 1))
#Solve Ax = b
np.linalg.solve(A,b)
Now I want to solve for x in Ax=b using the python libraries of the Apache Spark framework so the solution should be [1,0,0,0] since b is the 1st column of A
Below is the Apache Spark linear regression. Now, how do I set up the problem such that the input is a coordinate matrix (A) and coordinate vector (b)?
from pyspark.ml.regression import LinearRegression
# Load training data
training = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
How can I solve the Ax = b problem using Apache spark.
Directly (analytically) you cannot. Spark doesn't provide linear algebra library.
Indirectly - use pyspark.ml.regression to approximately solve OLS problem. You can refer to:
API docs.
MLlib guide
for details regarding expected input and required steps.

cross validation in pyspark

I used cross validation to train a linear regression model using the following code:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
now I want to draw the roc curve, I used the following code but I get this error:
'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
I also want to check the objectiveHistory at each itaration, I know that I can get it at the end
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
but I want to get it at each iteration, how can I do this?
Moreover I want to evaluate the model on the test data, how can I do that?
prediction = cvModel.transform(test)
I know for the training data set I can write:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
but how can I get these metrics for testing data set?
1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.
2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124

Resources