pySpark: Save ML Model - apache-spark

Can someone please give an example of how you would save a ML model in pySpark?
For
ml.classification.LogisticRegressionModel
I try to use the following:
model.save("path")
but it does not seem to work.

If I understand your question correctly, your method signature is incorrect.
According to the docs you also need to pass in your spark context.
Docs: https://spark.apache.org/docs/1.6.1/api/python/pyspark.mllib.html?highlight=save#pyspark.mllib.classification.LogisticRegressionModel.save

In Spark 2.3.0, if you are using ML:
model.save("path")
Refer: Spark ML model .save
(I just ran LogisticRegression and saved it.)
But if you are using mllib, then as the other answer suggests use:
save(sc, path)
Refer: Spark MLLib model .save

Related

How to use ONNX models for inference in Spark

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.
Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.
Is there a better way of doing this? Am I doing things "the right way"?
I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.
EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

how to export scala spark CrossValidatorModel to PMML?

i have a problem about exporting my model to PMML.
my model used CrossValidatorModel to get best params.
but when i try to export my model to PMML that give me an error like this
value toPMML is not a member of org.apache.spark.ml.tuning.CrossValidatorModel
so how to get the best model from crossValidatorModel and export to PMML.
in spark doc CVM not have .toPMML method.
spark 2.3.1 and scala 2.12.6
thanks
Use the JPMML-SparkML library.
Also available in the form of pyspark2pmml and sparklyr2pmml packages.

Whether we can update existing model in spark-ml/spark-mllib?

We are using spark-ml to build the model from existing data. New data comes on daily basis.
Is there a way that we can only read the new data and update the existing model without having to read all the data and retrain every time?
It depends on the model you're using but for some Spark does exactly what you want. You can look at StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and more broadly StreamingLinearAlgorithm.
To complete Florent's answer, if you are not in a streaming context, some Spark mllib models support an initialModel as a starting point for incremental updates. See KMeans, or GMM for instance.

How to use MLlib in spark SQL

Lately, I've been learning about spark sql, and I wanna know, is there any possible way to use mllib in spark sql, like :
select mllib_methodname(some column) from tablename;
here, the "mllib_methodname" method is a mllib method.
Is there some example shows how to use mllib methods in spark sql?
Thanks in advance.
The new pipeline API is based on DataFrames, which is backed by SQL. See
http://spark.apache.org/docs/latest/ml-guide.html
Or you can simply register the predict method from MLlib models as UDFs and use them in your SQL statement. See
http://spark.apache.org/docs/latest/sql-programming-guide.html#udf-registration-moved-to-sqlcontextudf-java--scala

Resources