How to use ONNX models for inference in Spark - apache-spark

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.
Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.
Is there a better way of doing this? Am I doing things "the right way"?

I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.
EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.

Related

Run on spark a weka Classification Algorithm

As I checked the Spark MLib, there are limited classification algorithms included.
Is it possible to add Weka jar file to the spark and use its classification algorithms on Spark workers.
I know there is not any parallelization benefit with doing this job, but I am curious to know if I could do so.
HamidReza,
You cannot really use directly weka jars to distribute your learning algorithms. Their implementations are not compatible with distributed systems.
Nevertheless, there exists a specific project named distributedWekaSparkDev but I haven't tried it yet.
You can learn more about it here : http://markahall.blogspot.com/2017/07/integrating-spark-mllib-into-weka.html

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

Scalable invocation of Spark MLlib 1.6 predictive model w/a single data record

I have a predictive model (Logistic Regression) built in Spark 1.6 that has been saved to disk for later reuse with new data records. I want to invoke it with multiple clients with each client passing in single data record. It seems that using a Spark job to run single records through would have way too much overhead and would not be very scalable (each invocation will only pass in a single set of 18 values). The MLlib API to load a saved model requires the Spark Context though so am looking for suggestions of how to do this in a scalable way. Spark Streaming with Kafka input comes to mind (each client request would be written to a Kafka topic). Any thoughts on this idea or alternative suggestions ?
Non-distributed (in practice it is majority) models from o.a.s.mllib don't require an active SparkContext for single item predictions. If you check API docs you'll see that LogisticRegressionModel provides predict method with signature Vector => Double. It means you can serialize model using standard Java tools, read it later and perform prediction on local o.a.s.mllib.Vector object.
Spark also provides a limited PMML support (not for logistic regression) so you share your models with any other library which supports this format.
Finally non-distributed models are usually not so complex. For linear models all you need is intercept, coefficients and some basic math functions and linear algebra library (if you want a decent performance).
o.a.s.ml models are slightly harder to handle but there are some external tools which try to address that. You can check related discussion on the developers list, (Deploying ML Pipeline Model) for details.
For distributed models there is really no good workaround. You'll have to start a full job on distributed dataset one way or another.

Resources