Loading a trained crossValidation model in Spark - apache-spark

I am a new beginner in Apache Spark. I trained a LogisticRegression model using crossValidation. For instance:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
val cvModel = cv.fit(data)
I was able to train and test my model without any error. Then I saved the model and the pipeline using:
cvModel.save("/path-to-my-model/spark-log-reg-transfer-model")
pipeline.save("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")
Up till this stage, the operations worked perfect. Then later on, I tried to load my model back for prediction on new data points, then the following error occured:
val sameModel = PipelineModel.load("/path-to-my-model/spark-log-reg-transfer-model")
java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name org.apache.spark.ml.tuning.CrossValidatorModel
Any idea what I may have done wrong? Thanks.

You are trying to load CrossValidator with PipelineModel object.
You should use correct loaders...
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
val sameModel = PipelineModel.load("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")

To load a Cross Validator it should be
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
To load a Cross Validator Model use
(Note: A Cross Validator becomes a Cross Validator model when you call fit() on CrossValidator)
val crossValidatorModel = CrossValidatorModel.load("/path-to-my-model/spark-log-reg-transfer-model")
Since you are trying to load a model, CrossValidatorModel.load would be the correct one.

Related

Is there way to load xgb native model into spark?

Here is my scenario:
I train XGB model with single machine, and want to load it into spark to process data. Is there a way to do it?
The official document give a way to train xgb model with spark, and convert it into native model. But it doesn't give the reverse direction.
XGBoostClassificationModel.load only supports passing a spark-version xgb model path, if passing a path of native model, it will report error.
According to github.com/dmlc/xgboost/issues/3689 , the step is 1, read native booster, 2, construct Model
github.com/dmlc/xgboost/issues/3689 only resolves 2 construct model, but I can't find a way read native booster with xgboost-spark 1.0.0
I guess the way to load native booster can be divided into 2 steps:
load native booster
create XGBModel
import ml.dmlc.xgboost4j.scala.XGBoost
val booster = XGBoost.loadModel(nativeBoostPath)
// create a bridge class according to github.com/dmlc/xgboost/issues/3689
val model = new XGBoostClassificationModelBridge("1",2, booster) // this will report error
But the 2nd step report error
I know this question is rather old, but I ran across the same thing when trying to load my native XGBoost model trained in Python over in Spark/Scala.
This seems to work for the bridge class:
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
import java.lang.reflect.Constructor
class XGBoostClassificationModelBridge(uid: String, numClasses: Int, _booster: Booster) {
val constructor: Constructor[XGBoostClassificationModel] = classOf[XGBoostClassificationModel].getDeclaredConstructor(classOf[String], classOf[Int], classOf[Booster])
constructor.setAccessible(true)
val xgbClassificationModel: XGBoostClassificationModel = constructor.newInstance(uid, Int.box(numClasses), _booster)
}
I was then able to use it something like this:
val booster = XGBoost.loadModel("/path/to/model.xgb")
val bridge = new XGBoostClassificationModelBridge(null, 20, booster)
val classifier = bridge.xgbClassificationModel
// if you need params:
classifier.set(classifier.objective, "multi:softprob")
classifier.set(classifier.missing, 0f)
Trying to set the params directly on the Booster object did not seem to work.

How to save the model after doing pipeline fit?

I wrote this code in Spark ML
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline
val lr = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(fooIndexer, fooHotEncoder, assembler, lr))
val model = pipeline.fit(training)
This code takes a long time to run. Is it possible that after running pipeline.fit I save the model on HDFS so that I don't have to run it again and again?
Edit: Also, how to load it back from HDFS when I have to apply transform on the model so that I can make predictions.
Straight from the official documentation - saving:
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
and loading:
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
Related:
Save ML model for future usage

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

How to use RandomForest in Spark Pipeline

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
Thanks
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.

Labeled LDA inference in Stanford Topic Modeling Toolbox

I am using Stanford Topic Modeling Toolbox v.0.3 for doing LabeledLDA.
I was able to train a LabeledLDA model using the documentation (example-6-llda-learn.scala) provided.
How can I predict labels for a new dataset?
I tried using code similar to example-3-lda-infer.scala for the inference on new dataset but it was not successful. Can anyone please help me with this issue?
Edit
This is the code I use for inference but it is not working:
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
// the path of the model to load
val modelPath = file("llda-cvb0-2053ec3f-13-a69e079a-5cd58962");
//Labeled LDA model
println("Loading "+modelPath);
val model = LoadCVB0LabeledLDA(modelPath);
// Or, for a Gibbs model, use:
// val model = LoadGibbsLDA(modelPath);
// A new dataset for inference. (Here we use the same dataset
// that we trained against, but this file could be something new.)
val source = CSVFile("test_lab_lda.csv") ~> IDColumn(1);
// Test File
val text = {
source ~> // read from the source file
Column(2) ~> // select column containing text
TokenizeWith(model.tokenizer.get) // tokenize with existing model's tokenizer
}
// Base name of output files to generate
val output = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
// turn the text into a dataset ready to be used with LDA
val dataset = LabeledLDADataset(text);
val out_val=InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
This code upon execution as java -Xmx3g -jar tmt-0.3.3.jar infer_llda.scala produces following error:
infer_llda.scala:40: error: overloaded method value apply with alternatives:
(name: String,terms: Iterable[Iterable[String]],labels: Iterable[Iterable[String]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]],tokenizer: Option[scalanlp.text.tokenize.Tokenizer])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[((Iterable[String], Iterable[String]), Int)] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])]
cannot be applied to (scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[String,Iterable[String]]]])
val dataset = LabeledLDADataset(text);
^
infer_llda.scala:43: error: could not find implicit value for evidence parameter of type scalanlp.serialization.TableWritable[scalanlp.collection.LazyIterable[(String, scalala.collection.sparse.SparseArray[Double])]]
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
With help from #Skarab here is the solution to Labeled LDA learning and inference:
Learning
Inference

Resources