Persist BERT model on disk as pickle file - apache-spark

I have managed to get the BERT model to work on johnsnowlabs-spark-nlp library. I am able to save the "trained model" on disk as follows.
Fit Model
df_bert_trained = bert_pipeline.fit(textRDD)
df_bert=df_bert_trained.transform(textRDD)
save model
df_bert_trained.write().overwrite().save("/home/XX/XX/trained_model")
However,
First, as per the docs here https://nlp.johnsnowlabs.com/docs/en/concepts, it's stated that one can load the model as
EmbeddingsHelper.load(path, spark, format, reference, dims, caseSensitive)
but it's unclear to me what the variable "reference" represents at this point.
Second, has anyone managed to save the BERT embeddings as a pickle file in python?

In Spark NLP, BERT comes as a pre-trained model. It means it's already a model that was trained, fit, etc. and saved in the right format.
That's being said, there is no reason to fit or save it again. You can, however, save the result of it once you transform your DataFrame to a new DataFrame that has BERT embeddings for each token.
Example:
Start a Spark Session in spark-shell with Spark NLP package
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.base._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Download and load the pretrained BERT model
val embeddings = BertEmbeddings.pretrained(name = "bert_base_cased", lang = "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)
.setPoolingLayer(0)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings
))
// Test and transform
val testData = Seq(
"I like pancakes in the summer. I hate ice cream in winter.",
"If I had asked people what they wanted, they would have said faster horses"
).toDF("text")
val predictionDF = pipeline.fit(testData).transform(testData)
The predictionDF is a DataFrame that contains BERT embeddings for each token inside your dataset. The BertEmbeddings pre-trained models are coming from TF Hub, which means they are the exact same pre-trained weights published by Google. All 5 models are available:
bert_base_cased (en)
bert_base_uncased (en)
bert_large_cased (en)
bert_large_uncased (en)
bert_multi_cased (xx)
Let me know if you have any questions or problems and I'll update my answer.
References:
https://github.com/JohnSnowLabs/spark-nlp
https://github.com/JohnSnowLabs/spark-nlp-models
https://github.com/JohnSnowLabs/spark-nlp-workshop

Related

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

save() on a Pyspark ML Word2vec model is creating empty folders

I'm trying to save a word2vec model that I built in pyspark on spark 2.0.
word2vec_model.write().overwrite().save('filepath/word2vec')
This successfully finishes and creates 2 sub-folders (data & metadata) under the folder word2vec but these 2 subfolders are empty except for an empty file titled _SUCCESS.
And subsequently the load fails.
w2vw = Word2Vec.load('filepath/word2vec')
with the exception: java.lang.UnsupportedOperationException: empty collection
The word2vec model itself works fine and I create it via series of simple transformers. I'm not sure what is going wrong. My model creation code snippet:
tokenizer = Tokenizer(inputCol="input", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_1")
customRemover = CustomRemover(inputCol="filtered_1",outputCol="filtered")
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
Any help would be appreciated.
as I think ,I guess you save word2vec model instead of word2vec,SO
for word2vec model you must read it by code below:
from pyspark.ml.feature import Word2VecModel
w2vw_model = Word2VecModel.load('filepath/word2vec')
and if you save just only word2vec that i mean this object:
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
word2vec.write().overwrite().save('filepath_to_just_word2vec_not_its_model')
you must import with this code blocks
w2vw = Word2Vec.load('filepath_to_just_word2vec_not_its_model')

How to increase the accuracy of neural network model in spark?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc,"/home/.../neural.txt").toDF()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
val layers = Array[Int](4, 5, 4, 4)
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
I am using MultilayerPerceptronClassifier to build neural network in Spark. I am getting 62.5% of accuracy. What all parameters I should change to get good accuracy?
As some people has said , the question is too broad and cant be answered without more detail but some advice(independently of the models/altorithms used or the tools and libraries for implementing them) would be:
Use a cross validation set and perform some cross validation with different network architectures.
Plot "Learning curves"
Identify if you are having high bias or high variance
See if you can or need to apply feature scaling and/or normalization.
Do some "Error Analysis"(manually verify which examples failed and evaluate or categorize them to see if you can find a pattern)
Not neccesarily in that order, but that could help you identify if you have underfitting, overfitting, if you need more training data, add or remove features, add regularization, etc. In summary , perform machine learning debugging.
Hope that helps, you can find more deep details about this in Andrew Ngs series of videos, starting with this:
https://www.youtube.com/watch?v=qIfLZAa32H0

Loading a trained crossValidation model in Spark

I am a new beginner in Apache Spark. I trained a LogisticRegression model using crossValidation. For instance:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
val cvModel = cv.fit(data)
I was able to train and test my model without any error. Then I saved the model and the pipeline using:
cvModel.save("/path-to-my-model/spark-log-reg-transfer-model")
pipeline.save("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")
Up till this stage, the operations worked perfect. Then later on, I tried to load my model back for prediction on new data points, then the following error occured:
val sameModel = PipelineModel.load("/path-to-my-model/spark-log-reg-transfer-model")
java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name org.apache.spark.ml.tuning.CrossValidatorModel
Any idea what I may have done wrong? Thanks.
You are trying to load CrossValidator with PipelineModel object.
You should use correct loaders...
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
val sameModel = PipelineModel.load("/path-to-my-pipeline/spark-log-reg-transfer-pipeline")
To load a Cross Validator it should be
val crossValidator = CrossValidator.load("/path-to-my-model/spark-log-reg-transfer-model")
To load a Cross Validator Model use
(Note: A Cross Validator becomes a Cross Validator model when you call fit() on CrossValidator)
val crossValidatorModel = CrossValidatorModel.load("/path-to-my-model/spark-log-reg-transfer-model")
Since you are trying to load a model, CrossValidatorModel.load would be the correct one.

Labeled LDA inference in Stanford Topic Modeling Toolbox

I am using Stanford Topic Modeling Toolbox v.0.3 for doing LabeledLDA.
I was able to train a LabeledLDA model using the documentation (example-6-llda-learn.scala) provided.
How can I predict labels for a new dataset?
I tried using code similar to example-3-lda-infer.scala for the inference on new dataset but it was not successful. Can anyone please help me with this issue?
Edit
This is the code I use for inference but it is not working:
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
// the path of the model to load
val modelPath = file("llda-cvb0-2053ec3f-13-a69e079a-5cd58962");
//Labeled LDA model
println("Loading "+modelPath);
val model = LoadCVB0LabeledLDA(modelPath);
// Or, for a Gibbs model, use:
// val model = LoadGibbsLDA(modelPath);
// A new dataset for inference. (Here we use the same dataset
// that we trained against, but this file could be something new.)
val source = CSVFile("test_lab_lda.csv") ~> IDColumn(1);
// Test File
val text = {
source ~> // read from the source file
Column(2) ~> // select column containing text
TokenizeWith(model.tokenizer.get) // tokenize with existing model's tokenizer
}
// Base name of output files to generate
val output = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));
// turn the text into a dataset ready to be used with LDA
val dataset = LabeledLDADataset(text);
val out_val=InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
This code upon execution as java -Xmx3g -jar tmt-0.3.3.jar infer_llda.scala produces following error:
infer_llda.scala:40: error: overloaded method value apply with alternatives:
(name: String,terms: Iterable[Iterable[String]],labels: Iterable[Iterable[String]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]],tokenizer: Option[scalanlp.text.tokenize.Tokenizer])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[((Iterable[String], Iterable[String]), Int)] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],termIndex: Option[scalanlp.util.Index[String]],labelIndex: Option[scalanlp.util.Index[String]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])] <and>
[ID(in method apply)](text: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]],labels: scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[ID(in method apply),Iterable[String]]]])edu.stanford.nlp.tmt.model.llda.LabeledLDADataset[(ID(in method apply), Iterable[String], Iterable[String])]
cannot be applied to (scalanlp.stage.Parcel[scalanlp.collection.LazyIterable[scalanlp.stage.Item[String,Iterable[String]]]])
val dataset = LabeledLDADataset(text);
^
infer_llda.scala:43: error: could not find implicit value for evidence parameter of type scalanlp.serialization.TableWritable[scalanlp.collection.LazyIterable[(String, scalala.collection.sparse.SparseArray[Double])]]
CSVFile(output+"-document-topic-distributuions.csv").write(out_val);
With help from #Skarab here is the solution to Labeled LDA learning and inference:
Learning
Inference

Resources