Run LDA algorithm on Spark 2.0 - apache-spark

I use spark 2.0.0 and I'd like to train a LDA model to Tweets dataset, when I try to execute
val ldaModel = new LDA().setK(3).run(corpus)
I get this error
error: reference to LDA is ambiguous;
it is imported twice in the same scope by import org.apache.spark.ml.clustering.LDA and import org.apache.spark.mllib.clustering.LDA
Could someone please help me ?
Thanks !

It looks like you have both of the following import statements:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.mllib.clustering.LDA
You would need to remove one of them.
If you are using Spark ML (data frame based API), the proper syntax would be:
import org.apache.spark.ml.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.fit(corpus)
if you are using RDD-based API then you would have to write:
import org.apache.spark.mllib.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.run(corpus)

Related

Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT#f71b0bce

spark version is 2.2.0 and scala version is 2.11。When I use ml lib, error occurs : " Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT#f71b0bce."
This is my code:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val trainingData = dataSet
.select(col("features"), col("label")).cache()
val lr = new LogisticRegression()
.setMaxIter(maxIter)
.setRegParam(regParam)
.setElasticNetParam(0)
.setThreshold(threshold)
.setFitIntercept(false)
val lrModel = lr.fit(trainingData)
It confused me several days。Who can help me?
The error message is pretty clear you are using org.apache.spark.mllib.linalg.VectorUDT (the old MLlib API) while any new API (ML) requires org.apache.spark.ml.linalg.Vector.
You omitted the part of the code where you create dataSet, but you should replace:
org.apache.spark.mllib.linalg._
imports with:
org.apache.spark.ml._
and adjust upstream code accordingly.
Related:
MatchError while accessing vector column in Spark 2.0

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

How to save IDFmodel with PySpark

I have produced an IDFModel with PySpark and ipython notebook as follows:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:
Save Apache Spark mllib model in python
But when I tried the suggestion in the answer
idf_train.save(sc, "/home/ubuntu/newfolder")
I get the error code
AttributeError: 'IDFModel' object has no attribute 'save'
Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!
I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}

How to use RandomForest in Spark Pipeline

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
Thanks
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.

Labeled LDA inference in Stanford Topic Modeling Toolbox

I am using Stanford Topic Modeling Toolbox v.0.3 for doing LabeledLDA.
I was able to train a LabeledLDA model using the documentation (example-6-llda-learn.scala) provided.
How can I predict labels for a new dataset?
I tried using code similar to example-3-lda-infer.scala for the inference on new dataset but it was not successful. Can anyone please help me with this issue?
Edit
This is the code I use for inference but it is not working:
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
// the path of the model to load
val modelPath = file("llda-cvb0-2053ec3f-13-a69e079a-5cd58962");
//Labeled LDA model
println("Loading "+modelPath);
val model = LoadCVB0LabeledLDA(modelPath);
// Or, for a Gibbs model, use:
// val model = LoadGibbsLDA(modelPath);
// A new dataset for inference. (Here we use the same dataset
// that we trained against, but this file could be something new.)
val source = CSVFile("test_lab_lda.csv") ~> IDColumn(1);
// Test File
val text = {
source ~> // read from the source file
Column(2) ~> // select column containing text
TokenizeWith(model.tokenizer.get) // tokenize with existing model's tokenizer
}
// Base name of output files to generate
val output = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
// turn the text into a dataset ready to be used with LDA
val dataset = LabeledLDADataset(text);
val out_val=InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
This code upon execution as java -Xmx3g -jar tmt-0.3.3.jar infer_llda.scala produces following error:
infer_llda.scala:40: error: overloaded method value apply with alternatives:
(name: String,terms: Iterable[Iterable[String]],labels: Iterable[Iterable[String]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]],tokenizer: Option[scalanlp.text.tokenize.Tokenizer])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[((Iterable[String], Iterable[String]), Int)] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])]
cannot be applied to (scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[String,Iterable[String]]]])
val dataset = LabeledLDADataset(text);
^
infer_llda.scala:43: error: could not find implicit value for evidence parameter of type scalanlp.serialization.TableWritable[scalanlp.collection.LazyIterable[(String, scalala.collection.sparse.SparseArray[Double])]]
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
With help from #Skarab here is the solution to Labeled LDA learning and inference:
Learning
Inference

Resources