save() on a Pyspark ML Word2vec model is creating empty folders - python-3.x

I'm trying to save a word2vec model that I built in pyspark on spark 2.0.
word2vec_model.write().overwrite().save('filepath/word2vec')
This successfully finishes and creates 2 sub-folders (data & metadata) under the folder word2vec but these 2 subfolders are empty except for an empty file titled _SUCCESS.
And subsequently the load fails.
w2vw = Word2Vec.load('filepath/word2vec')
with the exception: java.lang.UnsupportedOperationException: empty collection
The word2vec model itself works fine and I create it via series of simple transformers. I'm not sure what is going wrong. My model creation code snippet:
tokenizer = Tokenizer(inputCol="input", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_1")
customRemover = CustomRemover(inputCol="filtered_1",outputCol="filtered")
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
Any help would be appreciated.

as I think ,I guess you save word2vec model instead of word2vec,SO
for word2vec model you must read it by code below:
from pyspark.ml.feature import Word2VecModel
w2vw_model = Word2VecModel.load('filepath/word2vec')
and if you save just only word2vec that i mean this object:
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
word2vec.write().overwrite().save('filepath_to_just_word2vec_not_its_model')
you must import with this code blocks
w2vw = Word2Vec.load('filepath_to_just_word2vec_not_its_model')

Related

Is there way to load xgb native model into spark?

Here is my scenario:
I train XGB model with single machine, and want to load it into spark to process data. Is there a way to do it?
The official document give a way to train xgb model with spark, and convert it into native model. But it doesn't give the reverse direction.
XGBoostClassificationModel.load only supports passing a spark-version xgb model path, if passing a path of native model, it will report error.
According to github.com/dmlc/xgboost/issues/3689 , the step is 1, read native booster, 2, construct Model
github.com/dmlc/xgboost/issues/3689 only resolves 2 construct model, but I can't find a way read native booster with xgboost-spark 1.0.0
I guess the way to load native booster can be divided into 2 steps:
load native booster
create XGBModel
import ml.dmlc.xgboost4j.scala.XGBoost
val booster = XGBoost.loadModel(nativeBoostPath)
// create a bridge class according to github.com/dmlc/xgboost/issues/3689
val model = new XGBoostClassificationModelBridge("1",2, booster) // this will report error
But the 2nd step report error
I know this question is rather old, but I ran across the same thing when trying to load my native XGBoost model trained in Python over in Spark/Scala.
This seems to work for the bridge class:
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
import java.lang.reflect.Constructor
class XGBoostClassificationModelBridge(uid: String, numClasses: Int, _booster: Booster) {
val constructor: Constructor[XGBoostClassificationModel] = classOf[XGBoostClassificationModel].getDeclaredConstructor(classOf[String], classOf[Int], classOf[Booster])
constructor.setAccessible(true)
val xgbClassificationModel: XGBoostClassificationModel = constructor.newInstance(uid, Int.box(numClasses), _booster)
}
I was then able to use it something like this:
val booster = XGBoost.loadModel("/path/to/model.xgb")
val bridge = new XGBoostClassificationModelBridge(null, 20, booster)
val classifier = bridge.xgbClassificationModel
// if you need params:
classifier.set(classifier.objective, "multi:softprob")
classifier.set(classifier.missing, 0f)
Trying to set the params directly on the Booster object did not seem to work.

Persist BERT model on disk as pickle file

I have managed to get the BERT model to work on johnsnowlabs-spark-nlp library. I am able to save the "trained model" on disk as follows.
Fit Model
df_bert_trained = bert_pipeline.fit(textRDD)
df_bert=df_bert_trained.transform(textRDD)
save model
df_bert_trained.write().overwrite().save("/home/XX/XX/trained_model")
However,
First, as per the docs here https://nlp.johnsnowlabs.com/docs/en/concepts, it's stated that one can load the model as
EmbeddingsHelper.load(path, spark, format, reference, dims, caseSensitive)
but it's unclear to me what the variable "reference" represents at this point.
Second, has anyone managed to save the BERT embeddings as a pickle file in python?
In Spark NLP, BERT comes as a pre-trained model. It means it's already a model that was trained, fit, etc. and saved in the right format.
That's being said, there is no reason to fit or save it again. You can, however, save the result of it once you transform your DataFrame to a new DataFrame that has BERT embeddings for each token.
Example:
Start a Spark Session in spark-shell with Spark NLP package
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.base._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Download and load the pretrained BERT model
val embeddings = BertEmbeddings.pretrained(name = "bert_base_cased", lang = "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)
.setPoolingLayer(0)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings
))
// Test and transform
val testData = Seq(
"I like pancakes in the summer. I hate ice cream in winter.",
"If I had asked people what they wanted, they would have said faster horses"
).toDF("text")
val predictionDF = pipeline.fit(testData).transform(testData)
The predictionDF is a DataFrame that contains BERT embeddings for each token inside your dataset. The BertEmbeddings pre-trained models are coming from TF Hub, which means they are the exact same pre-trained weights published by Google. All 5 models are available:
bert_base_cased (en)
bert_base_uncased (en)
bert_large_cased (en)
bert_large_uncased (en)
bert_multi_cased (xx)
Let me know if you have any questions or problems and I'll update my answer.
References:
https://github.com/JohnSnowLabs/spark-nlp
https://github.com/JohnSnowLabs/spark-nlp-models
https://github.com/JohnSnowLabs/spark-nlp-workshop

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

Loading a trained crossValidation model in Spark

I am a new beginner in Apache Spark. I trained a LogisticRegression model using crossValidation. For instance:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
val cvModel = cv.fit(data)
I was able to train and test my model without any error. Then I saved the model and the pipeline using:
cvModel.save("/path-to-my-model/spark-log-reg-transfer-model")
pipeline.save("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")
Up till this stage, the operations worked perfect. Then later on, I tried to load my model back for prediction on new data points, then the following error occured:
val sameModel = PipelineModel.load("/path-to-my-model/spark-log-reg-transfer-model")
java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name org.apache.spark.ml.tuning.CrossValidatorModel
Any idea what I may have done wrong? Thanks.
You are trying to load CrossValidator with PipelineModel object.
You should use correct loaders...
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
val sameModel = PipelineModel.load("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")
To load a Cross Validator it should be
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
To load a Cross Validator Model use
(Note: A Cross Validator becomes a Cross Validator model when you call fit() on CrossValidator)
val crossValidatorModel = CrossValidatorModel.load("/path-to-my-model/spark-log-reg-transfer-model")
Since you are trying to load a model, CrossValidatorModel.load would be the correct one.

Labeled LDA inference in Stanford Topic Modeling Toolbox

I am using Stanford Topic Modeling Toolbox v.0.3 for doing LabeledLDA.
I was able to train a LabeledLDA model using the documentation (example-6-llda-learn.scala) provided.
How can I predict labels for a new dataset?
I tried using code similar to example-3-lda-infer.scala for the inference on new dataset but it was not successful. Can anyone please help me with this issue?
Edit
This is the code I use for inference but it is not working:
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
// the path of the model to load
val modelPath = file("llda-cvb0-2053ec3f-13-a69e079a-5cd58962");
//Labeled LDA model
println("Loading "+modelPath);
val model = LoadCVB0LabeledLDA(modelPath);
// Or, for a Gibbs model, use:
// val model = LoadGibbsLDA(modelPath);
// A new dataset for inference. (Here we use the same dataset
// that we trained against, but this file could be something new.)
val source = CSVFile("test_lab_lda.csv") ~> IDColumn(1);
// Test File
val text = {
source ~> // read from the source file
Column(2) ~> // select column containing text
TokenizeWith(model.tokenizer.get) // tokenize with existing model's tokenizer
}
// Base name of output files to generate
val output = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
// turn the text into a dataset ready to be used with LDA
val dataset = LabeledLDADataset(text);
val out_val=InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
This code upon execution as java -Xmx3g -jar tmt-0.3.3.jar infer_llda.scala produces following error:
infer_llda.scala:40: error: overloaded method value apply with alternatives:
(name: String,terms: Iterable[Iterable[String]],labels: Iterable[Iterable[String]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]],tokenizer: Option[scalanlp.text.tokenize.Tokenizer])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[((Iterable[String], Iterable[String]), Int)] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])]
cannot be applied to (scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[String,Iterable[String]]]])
val dataset = LabeledLDADataset(text);
^
infer_llda.scala:43: error: could not find implicit value for evidence parameter of type scalanlp.serialization.TableWritable[scalanlp.collection.LazyIterable[(String, scalala.collection.sparse.SparseArray[Double])]]
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
With help from #Skarab here is the solution to Labeled LDA learning and inference:
Learning
Inference

Resources