How to use RandomForest in Spark Pipeline - apache-spark

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
Thanks

However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.

Related

Add new fitted stage to a exitsting PipelineModel without fitting again

I would like to concatenate several trained Pipelines to one, which is similar to
"Spark add new fitted stage to a exitsting PipelineModel without fitting again" however the solution as below is for PySpark.
> pipe_model_new = PipelineModel(stages = [pipe_model , pipe_model2])
> final_df = pipe_model_new.transform(df1)
In Apache Spark 2.0 "PipelineModel"'s constructor is marked as private, hence it can not be called outside. While in "Pipeline" class, only "fit" method creates "PipelineModel"
val pipelineModel = new PipelineModel("randomUID", trainedStages)
val df_final_full = pipelineModel.transform(df)
Error:(266, 26) constructor PipelineModel in class PipelineModel cannot be accessed in class Preprocessor
val pipelineModel = new PipelineModel("randomUID", trainedStages)
There is nothing* wrong with using Pipeline and invoking fit method. If a stage is a Transfomer, and PipelineModel is**, fit works like identity.
You can check relevant Python:
if isinstance(stage, Transformer):
transformers.append(stage)
dataset = stage.transform(dataset)
and Scala code:
This means that fitting process will only validate the schema and create a new PipelineModel object.
case t: Transformer =>
t
* The only possible concern is presence of non-lazy Transformers, though, with exception to deprecated OneHotEncoder, Spark core API doesn't provide such.
** In Python:
from pyspark.ml import Transformer, PipelineModel
issubclass(PipelineModel, Transformer)
True
In Scala
import scala.reflect.runtime.universe.typeOf
import org.apache.spark.ml._
typeOf[PipelineModel] <:< typeOf[Transformer]
Boolean = true

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

Run LDA algorithm on Spark 2.0

I use spark 2.0.0 and I'd like to train a LDA model to Tweets dataset, when I try to execute
val ldaModel = new LDA().setK(3).run(corpus)
I get this error
error: reference to LDA is ambiguous;
it is imported twice in the same scope by import org.apache.spark.ml.clustering.LDA and import org.apache.spark.mllib.clustering.LDA
Could someone please help me ?
Thanks !
It looks like you have both of the following import statements:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.mllib.clustering.LDA
You would need to remove one of them.
If you are using Spark ML (data frame based API), the proper syntax would be:
import org.apache.spark.ml.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.fit(corpus)
if you are using RDD-based API then you would have to write:
import org.apache.spark.mllib.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.run(corpus)

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this:
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)
As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).
From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?
Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:
areaUnderROC
areaUnderPR
with the former one being default, and can be set using setMetricName method.
These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

Labeled LDA inference in Stanford Topic Modeling Toolbox

I am using Stanford Topic Modeling Toolbox v.0.3 for doing LabeledLDA.
I was able to train a LabeledLDA model using the documentation (example-6-llda-learn.scala) provided.
How can I predict labels for a new dataset?
I tried using code similar to example-3-lda-infer.scala for the inference on new dataset but it was not successful. Can anyone please help me with this issue?
Edit
This is the code I use for inference but it is not working:
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
// the path of the model to load
val modelPath = file("llda-cvb0-2053ec3f-13-a69e079a-5cd58962");
//Labeled LDA model
println("Loading "+modelPath);
val model = LoadCVB0LabeledLDA(modelPath);
// Or, for a Gibbs model, use:
// val model = LoadGibbsLDA(modelPath);
// A new dataset for inference. (Here we use the same dataset
// that we trained against, but this file could be something new.)
val source = CSVFile("test_lab_lda.csv") ~> IDColumn(1);
// Test File
val text = {
source ~> // read from the source file
Column(2) ~> // select column containing text
TokenizeWith(model.tokenizer.get) // tokenize with existing model's tokenizer
}
// Base name of output files to generate
val output = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
// turn the text into a dataset ready to be used with LDA
val dataset = LabeledLDADataset(text);
val out_val=InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
This code upon execution as java -Xmx3g -jar tmt-0.3.3.jar infer_llda.scala produces following error:
infer_llda.scala:40: error: overloaded method value apply with alternatives:
(name: String,terms: Iterable[Iterable[String]],labels: Iterable[Iterable[String]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]],tokenizer: Option[scalanlp.text.tokenize.Tokenizer])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[((Iterable[String], Iterable[String]), Int)] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])]
cannot be applied to (scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[String,Iterable[String]]]])
val dataset = LabeledLDADataset(text);
^
infer_llda.scala:43: error: could not find implicit value for evidence parameter of type scalanlp.serialization.TableWritable[scalanlp.collection.LazyIterable[(String, scalala.collection.sparse.SparseArray[Double])]]
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
With help from #Skarab here is the solution to Labeled LDA learning and inference:
Learning
Inference

Resources