How to save a Spark LogisticRegressionModel model? - apache-spark

I am using MLlib 1.1.0 and struggling to find a way to save my model. Docs do not seem to support such as feature in this version. Any ideas?

There is save model option like:
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
But I see it starting from v1.3. I m not sure if it will be still valid for 1.1
You can try this and upgrade if it does not work??

Related

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

The problem
I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array<float>, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.
What I found until now
I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).
How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?
Minimal example with horovod
# Saving df
print(spark.config.get('spark.hadoop.parquet.enable.summary-metadata')) # outputs 'true'
df.repartition(10).write.mode('overwrite').parquet(path)
# ...
# Training
import horovod.spark.keras as hvd
from horovod.spark.common.store import Store
model = build_model()
opti = Adadelta(learning_rate=0.015)
loss='sparse_categorical_crossentropy'
store = Store().create(prefix_path=prefix_path,
train_path=train_path,
val_path=val_path)
keras_estimator = hvd.KerasEstimator(
num_proc=16,
store=store,
model=model,
optimizer=opti,
loss=loss,
feature_cols=['features'],
label_cols=['next'],
batch_size=auto_steps_per_epoch,
epochs=auto_nb_epochs,
sample_weight_col='weight'
)
keras_model = keras_estimator.fit_on_parquet() # Fails here with ArrowIOError
The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to #joris' comment for pointing this out.

Loading a TensorFlow frozen graph (.pb) in Node.js

I saved a tensorflow model in a frozen PB file which is suitable to be used by TensorFlow Lite.
This file can be loaded in Android and works well by following code:
import org.tensorflow.contrib.android.TensorFlowInferenceInterface;
…
TensorFlowInferenceInterface inferenceInterface;
inferenceInterface = new TensorFlowInferenceInterface(context.getAssets(), "MODEL_FILE.pb");
Is there any way to load the frozen graph in Node.js?
I found the solution here:
1. the model must be converted to a web-friendly format which is a JSON file.
2. Then can be loaded using '#tensorflow/tfjs' in Nodejs.

CSV Input in gensim LDA via corpora.csvcorpus

I wanna use the LDA in gensim for topic modeling over a few thousand documents.
Therefore I´m using a csv-File as Input in the format of a term-document-matrix.
Currently it occurs an error when running the following code:
from gensim import corpora
import_path ="TDM.csv"
dictionary = corpora.csvcorpus(import_path, labels='true')
The error is the following:
dictionary = corpora.csvcorpus(import_path, labels='true')
AttributeError: module 'gensim.corpora' has no attribute 'csvcorpus'
Am I using the module correctly and if so, where is my mistake?
Thanks in advance.
This also bugged me for quite awhile.
It looks like csvcorpus is actually in the experimental stage as you can see in their github issue, https://github.com/RaRe-Technologies/gensim/issues/1583
I would recommend going by the old fashioned way of using the csv package to read your csv file instead.
Cheers.

Keras-CNTK saving model-v2 format

I'm using CNTK as the backend for Keras. I'm trying to use my model which I have trained using Keras in C++.
I have trained and saved my model using Keras which is in HDF5. How do I now use CNTK API to save it in their model-v2 format?
I tried this:
model = load_model('model2.h5')
cntk.ops.functions.Function.save(model, 'CNTK_model2.pb')
but i got the following error:
TypeError: save() missing 1 required positional argument: 'filename'
If tensorflow were the backend I would have done this:
model = load_model('model2.h5')
sess = K.get_session()
tf_saver = tf.train.Saver()
tf_saver.save(sess=sess, save_path=checkpoint_path)
How can I achieve the same thing?
As per the comments here, I was able to use this:
import cntk as C
import keras.backend as K
keras_model = K.load_model('my_keras_model.h5')
C.combine(keras_model.model.outputs).save('my_cntk_model')
cntk_model = C.load_model('my_cntk_model')
You can do something like this
model.outputs[0].save('CNTK_model2.pb')
I'm assuming here you have called model.compile (i.e. that's the only case I have tried :-)
The reason you see this error is because keras' cntk backend use a user defined function to do reshape on batch axis, which can't been serialized. We have fixed this issue in CNTK v2.2. Please upgrade your cntk to v2.2, and upgrade keras to last master.
Please see this pull request:
https://github.com/fchollet/keras/pull/7907

Unable to serialize a apache spark transformer in mleap

I use Spark 2.1.0 and Scala 2.11.8.
I am trying to build a twitter sentiment analysis model in apache spark and service it using MLeap.
When I am running the model without using mleap, things work smoothly.
Problem happens only when I try to save the model in mleap's serialization format so I can serve the model later using mleap.
Here is the line with throws the error -
val modelSavePath = "/tmp/sampleapp/model-mleap/"
val pipelineConfig = json.get("PipelineConfig").get.asInstanceOf[Map[String, Any]]
val loaderConfig = json.get("LoaderConfig").get.asInstanceOf[Map[String, Any]]
val loaderPath = loaderConfig
.get("DataLocation")
.get
.asInstanceOf[String]
var data = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
load(loaderPath)
val pipeline = Pipeline(pipelineConfig)
val model = pipeline.fit(data)
val mleapPipeline: Transformer = model
I get java.util.NoSuchElementException: key not found: org.apache.spark.ml.feature.Tokenizer in the last line.
When I did a quick search I found out that mleap does not support all the transformers. But I was not able to find an exhaustive list.
How do I find out if the transformers that I am using are actually not supported or there is some other error.
I am one of the creators of MLeap, and we do support Tokenizer! I am curious, which version of MLeap are you trying to use? I think you may be looking at an outdated codebase from TrueCar, check out our new codebase here:
https://github.com/combust/mleap
We also have fairly complete documentation here, including a full list of supported transformers:
Documentation: http://mleap-docs.combust.ml/
Transformer List: http://mleap-docs.combust.ml/core-concepts/transformers/support.html
I hope this helps, and if things still aren't working, file an issue in github and we can help you debug it from there.

Resources