cross validation with pipe line in spark

cross validation with pipe line in spark - apache-spark

Cross Validation outside from pipeline.
val naivebayes
val indexer
val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes))
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5))
.build()
val crossValidator = new CrossValidator().setEstimator(pipeLine)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
val crossValidatorModel = crossValidator.fit(trainData)
val predictions = crossValidatorModel.transform(testData)
Cross Validation inside pipeline
val naivebayes
val indexer
// param grid for multiple parameter
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(0.35, 0.1, 0.2, 0.3, 0.5))
.build()
// validator for naive bayes
val crossValidator = new CrossValidator().setEstimator(naiveBayes)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
// pipeline to execute compound transformation
val pipeLine = new Pipeline().setStages(Array(indexer, crossValidator))
// pipeline model
val pipeLineModel = pipeLine.fit(trainData)
// transform data
val predictions = pipeLineModel.transform(testData)
So i want to know which way is better and its pro & cons.
For both functions, i am getting same result and accuracy. Even second approach is little bit faster than first.

As per a training I attended - this should be the best practice :
cv = CrossValidator(estimator=lr,..)
pipelineModel = Pipeline(stages=[idx,assembler,cv])
cv_model= pipelineModel.fit(train)
This way your pipeline would fit only once and not with each recurring run with the param_grid which makes it run faster.
Hope this helps!

Related

pyspark: Stage failure due to One hot encoding

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it.
Below is the code snippet for data transformation:
qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4,
inputCols=["time"], outputCols=["time_qd"])
#Normalize Vector
scaler = StandardScaler()\
.setInputCol ("vectorized_features")\
.setOutputCol ("features")
#Encoder for VesselTypeGroupName
encoder = StringIndexer(handleInvalid='skip')\
.setInputCols (["type"])\
.setOutputCols (["type_enc"])
#OneHot encoding categorical variables
encoder1 = OneHotEncoder()\
.setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
.setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
#Assembling Variables
assembler = VectorAssembler(handleInvalid="keep")\
.setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
.setOutputCol ("vectorized_features")
The total number of features after one hot encoding will not exceed 200. The model code is below:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label',
weightCol='classWeightCol')
pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
#Create Logistic Regression parameter grids for parameter tuning
paramGrid_lr = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
.addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
.build())
cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr,
evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
cv_lr_model = cv_lr.fit(train_df)
.fit method throws the below error:
I have tried increasing the driver memory but still facing the same error. Please suggest what is the cause of this issue.

'CrossValidatorModel' object has no attribute 'featureImportances'

I'm trying to extract the feature importance's of a random forest classifier model I have trained using Pyspark. I referred to the following article to get the feature importance scores for the random forest model I trained.
PySpark & MLLib: Random Forest Feature Importances
However, as I use the method describe in this article I get the following error
'CrossValidatorModel' object has no attribute 'featureImportances'
Here is the code I used to train my model
cols = new_data.columns
stages = []
label_stringIdx = StringIndexer(inputCol = 'Bought_Fibre', outputCol = 'label')
stages += [label_stringIdx]
numericCols = new_data.schema.names[1:-1]
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_data)
new_data.fillna(0, subset=cols)
new_data = pipelineModel.transform(new_data)
new_data.fillna(0, subset=cols)
new_data.printSchema()
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 1045)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
train_sampled = train_initial.sampleBy("label", fractions={0: 0.1, 1: 1.0}, seed=0)
train_sampled.groupBy("label").count().orderBy("label").show()
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(train_sampled)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(train_sampled)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
paramGrid = ParamGridBuilder() \
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) \
.addGrid(rf_model.impurity,['entropy','gini']) \
.addGrid(rf_model.maxDepth,[2,3,4,5]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_sampled)
Please help to resolve the above mentioned error and help to extract the features

That's because the CrossValidatorModel doesn't have a feature importance attribute, but the RandomForestModel model has.
Since you are using a Pipeline and CrossValidator to fit your data, you'll need to get the underlying stage of the best fitted model :
# '2' is the index of your RandomForestModel inside of the Pipeline
your_model = cvModel.bestModel.stages[2]
var_imp = your_model.featureImportances

How to get the probabilities of classes in Spark Naive Bayes classifier?

I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code:
val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true)
val piVector = new DenseVector(model.pi)
//val prob = thetaMatrix.multiply(test.features)
val x = test.map {p =>
val prob = thetaMatrix.multiply(p.features)
BLAS.axpy(1.0, piVector, prob)
prob
}
Does this work properly? The line BLAS.axpy(1.0, piVector, prob) keeps giving me an error that the value 'axpy' is not found.

In a recent pull-request this was added to the Spark trunk and will be released in Spark 1.5 (closing SPARK-4362). you can therefore call
def predictProbabilities(testData: RDD[Vector]): RDD[Vector]
or
def predictProbabilities(testData: Vector): Vector

Spark MlLib linear regression (Linear least squares) giving random results

Im new in spark and Machine learning in general.
I have followed with success some of the Mllib tutorials, i can't get this one working:
i found the sample code here :
https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
(section LinearRegressionWithSGD)
here is the code:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")
(that's exactly what's is on the website)
The result is
training Mean Squared Error = 6.2087803138063045
and
valuesAndPreds.collect
gives
Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),
(-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544),
(-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276),
(0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985),
(1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792),
(1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306),
(1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344),
(1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454),
(1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635),
(1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634),
(1.8000583,0.7812664902440078), (1.8484548,0.94605507153074),
(1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...
My problem here is predictions looks totally random (and wrong), and since its the perfect copy of the website example, with the same input data (training set), i don't know where to look, am i missing something ?
Please give me some advices or clue about where to search, i can read and experiment.
Thanks

As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)
So, to fix your problem, change the following line in your code (Pyspark):
model = LinearRegressionWithSGD.train(parsedData, numIterations)
to
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.

Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.
In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
For another example on a more realistic dataset, see
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

Linear regression in Apache Spark giving wrong intercept and weights

Using MLLib LinearRegressionWithSGD for the dummy data set (y, x1, x2) for y = (2*x1) + (3*x2) + 4 is producing wrong intercept and weights. Actual data used is,
x1 x2 y
1 0.1 6.3
2 0.2 8.6
3 0.3 10.9
4 0.6 13.8
5 0.8 16.4
6 1.2 19.6
7 1.6 22.8
8 1.9 25.7
9 2.1 28.3
10 2.4 31.2
11 2.7 34.1
I set the following input parameters and got the below model outputs
[numIterations, step, miniBatchFraction, regParam] [intercept, [weights]]
[5,9,0.6,5] = [2.36667135839938E13, weights:[1.708772545209758E14, 3.849548062850367E13] ]
[2,default,default,default] = [-2495.5635231554793, weights:[-19122.41357929275,-4308.224496146531]]
[5,default,default,default] = [2.875191315671051E8, weights: [2.2013802074495964E9,4.9593017130199933E8]]
[20,default,default,default] = [-8.896967235537095E29, weights: [-6.811932001659158E30,-1.5346020624812824E30]]
Need to know,
How do i get the correct intercept and weights [4, [2, 3]] for the above mentioned dummy data.
Will tuning the step size help in convergence? I need to run this in a automated manner for several hundred variables, so not keen to do that.
Should I scale the data? How will it help?
Below is the code used to generate these results.
object SciBenchTest {
def main(args: Array[String]): Unit = run
def run: Unit = {
val sparkConf = new SparkConf().setAppName("SparkBench")
val sc = new SparkContext(sparkConf)
// Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
// i.e. intercept should be 4, weights (2, 3)?
val data = sc.textFile("data/dummy.csv")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(2).toDouble
val features = Array(parts(0), parts(1)) map (_.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
//parsedData.collect().foreach(x => println(x));
// Scale the features
/*val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(parsedData.map(x => x.features))
val scaledData = parsedData
.map(x =>
LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray))))
scaledData.collect().foreach(x => println(x));*/
// Building the model: SGD = stochastic gradient descent
val numIterations = 20 //5
val step = 9.0 //9.0 //0.7
val miniBatchFraction = 0.6 //0.7 //0.65 //0.7
val regParam = 5.0 //3.0 //10.0
//val model = LinearRegressionWithSGD.train(parsedData, numIterations, step) //scaledData
val algorithm = new LinearRegressionWithSGD() //train(parsedData, numIterations)
algorithm.setIntercept(true)
algorithm.optimizer
//.setMiniBatchFraction(miniBatchFraction)
.setNumIterations(numIterations)
//.setStepSize(step)
//.setGradient(new LeastSquaresGradient())
//.setUpdater(new SquaredL2Updater()) //L1Updater //SimpleUpdater //SquaredL2Updater
//.setRegParam(regParam)
val model = algorithm.run(parsedData)
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")
// Evaluate model on training examples
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, point.features, prediction)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({ case (v, f, p) =>
println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")
})
}
}

As described in the documentation
https://spark.apache.org/docs/1.0.2/mllib-optimization.html
selecting the best step-size for SGD methods can often be delicate.
I would try with lover values, for example
// Build linear regression model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.001)
val model = regression.run(parsedData)

Adding the stepsize did not help us much.
We used the following parameters to calculate the intercept/weights and loss and used the same to construct a linear regression model in order to predict our features. Thanks #selvinsource for pointing me in the correct direction.
val data = sc.textFile("data/dummy.csv")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(2).toDouble
val features = Array(parts(0), parts(1)) map (_.toDouble)
(label, MLUtils.appendBias(Vectors.dense(features)))
}.cache()
val numCorrections = 5 //10//5//3
val convergenceTol = 1e-4 //1e-4
val maxNumIterations = 20 //20//100
val regParam = 0.00001 //0.1//10.0
val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
parsedData,
new LeastSquaresGradient(),//LeastSquaresGradient
new SquaredL2Updater(), //SquaredL2Updater(),SimpleUpdater(),L1Updater()
numCorrections,
convergenceTol,
maxNumIterations,
regParam,
Vectors.dense(0.0, 0.0, 0.0))//initialWeightsWithIntercept)
loss.foreach(println)
val model = new LinearRegressionModel(
Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)),
weightsWithIntercept(weightsWithIntercept.size - 1))
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")
// Evaluate model on training examples
val valuesAndPreds = parsedData.collect().map { point =>
var prediction = model.predict(Vectors.dense(point._2.apply(0), point._2.apply(1)))
(prediction, point._1)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({ case (v, f) =>
println(s"Features: ${f}, Predicted: ${v}")//, Actual: ${v}")
})

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

cross validation with pipe line in spark - apache-spark

Related

pyspark: Stage failure due to One hot encoding

'CrossValidatorModel' object has no attribute 'featureImportances'

How to get the probabilities of classes in Spark Naive Bayes classifier?

Spark MlLib linear regression (Linear least squares) giving random results

Linear regression in Apache Spark giving wrong intercept and weights

Categories

Resources