How to train a neural network autoencoder (Keras) on Spark dataframe - apache-spark

I created a very large Spark Dataframe with PySpark on my cluster, which is too big to fit into memory. I also have an autoencoder model with Keras, which takes in a Pandas dataframe (in-memory object).
What is the best way to bring those two worlds together?
I found some libraries that provide Deep Learning on Spark, but is seems only for hyper parameter tuning or wont support autoencoders like Apache SystemML
I am surely not the first one to train a NN on Spark Dataframes. I have a conceptual gap here, please help!

As you mentioned Pandas DF in Spark are in-memory object and training won't be distributed. For distributed training you have to rely on Spark DF and some specific third party packages to handle the distributed training :
You can find the information here :
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/index.html

Related

How to use ONNX models for inference in Spark

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.
Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.
Is there a better way of doing this? Am I doing things "the right way"?
I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.
EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.

Training in Python and Deploying in Spark

Is it possible to train an XGboost model in python and use the saved model to predict in spark environment ? That is, I want to be able to train the XGboost model using sklearn, save the model. Load the saved model in spark and predict in spark. Is this possible ?
edit:
Thanks all for the answer , but my question is really this. I see the below issues when I train and predict different bindings of XGBoost.
During training I would be using XGBoost in python, and when  predicting I would be using XGBoost in mllib.
I have to load the saved model from XGBoost python (Eg: XGBoost.model file) to be predicted in spark, would this model be compatible to be used with the predict function in the mllib
The data input formats of both XGBoost in python and XGBoost in spark mllib are different. Spark takes vector assembled format but with python, we can feed the dataframe as such. So, how do I feed the data when I am trying to predict in spark with a model trained in python. Can I feed the data without vector assembler ? Would XGboost predict function in spark mllib take non-vector assembled data as input ?
You can run your python script on spark using spark-submit command so that can compile your python code on spark and then you can predict the value in spark.
you can
load data/ munge data using pyspark sql,
then bring data to local driver using collect/topandas(performance bottleneck)
then train xgboost on local driver
then prepare test data as RDD,
broadcast the xgboost model to each RDD partition, then predict data in parallel
This all can be in one script, you spark-submit, but to make the things more concise, i will recommend split train/test in two script.
Because step2,3 are happening at driver level, not using any cluster resource, your worker are not doing anything
Here is a similar implementation of what you are looking for. I have a SO post explaining details as I am trying to troubleshoot the errors described in the post to get the code in the notebook working .
XGBoost Spark One Model Per Worker Integration
The idea is to train using xgboost and then via spark orchestrate each model to run on a spark worker and then predictions can be applied via xgboost predict_proba() or spark ml predict().

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

How to re-train models on new batches only (without taking the previous training dataset) in Spark Streaming?

I'm trying to write my first recommendations model (Spark 2.0.2) and i would like to know if is it possible,
after initial train when the model elaborate all my rdd, work with just a delta for the future train.
Let me explain through an example:
First batch perform the first training session, with all rdd (200000
elements), when the system starts.
At the end of the train the model is saved.
A second batch application (spark streaming) load the model
previously saved and listen on kinesis queue.
When a new element arrived the second batch should be perform a
training (in delta mode?!) without load all the 200000 elements
before but just with the model and a new element.
At the end of the train the updated model is saved.
The question is, is it possible to execute in some way the step 4?
My understanding is that it is only possible with machine learning algorithms that are designed to support streaming training like StreamingKMeans or StreamingLogisticRegressionWithSGD.
Quoting their documentations (see the active references above):
(StreamingLogisticRegressionWithSGD) trains or predicts a logistic regression model on streaming data. Training uses Stochastic Gradient Descent to update the model based on each new batch of incoming data from a DStream (see LogisticRegressionWithSGD for model equation)
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.
What worries me about the algorihtms is that they belong to org.apache.spark.mllib.clustering package which is now deprecated (as it's RDD-based not DataFrame-based). I don't know if they've got their JIRAs to retrofit them with DataFrame.

Spark Multi Label classification

I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.

Resources