Is it possible to train an XGboost model in python and use the saved model to predict in spark environment ? That is, I want to be able to train the XGboost model using sklearn, save the model. Load the saved model in spark and predict in spark. Is this possible ?
edit:
Thanks all for the answer , but my question is really this. I see the below issues when I train and predict different bindings of XGBoost.
During training I would be using XGBoost in python, and when predicting I would be using XGBoost in mllib.
I have to load the saved model from XGBoost python (Eg: XGBoost.model file) to be predicted in spark, would this model be compatible to be used with the predict function in the mllib
The data input formats of both XGBoost in python and XGBoost in spark mllib are different. Spark takes vector assembled format but with python, we can feed the dataframe as such. So, how do I feed the data when I am trying to predict in spark with a model trained in python. Can I feed the data without vector assembler ? Would XGboost predict function in spark mllib take non-vector assembled data as input ?
You can run your python script on spark using spark-submit command so that can compile your python code on spark and then you can predict the value in spark.
you can
load data/ munge data using pyspark sql,
then bring data to local driver using collect/topandas(performance bottleneck)
then train xgboost on local driver
then prepare test data as RDD,
broadcast the xgboost model to each RDD partition, then predict data in parallel
This all can be in one script, you spark-submit, but to make the things more concise, i will recommend split train/test in two script.
Because step2,3 are happening at driver level, not using any cluster resource, your worker are not doing anything
Here is a similar implementation of what you are looking for. I have a SO post explaining details as I am trying to troubleshoot the errors described in the post to get the code in the notebook working .
XGBoost Spark One Model Per Worker Integration
The idea is to train using xgboost and then via spark orchestrate each model to run on a spark worker and then predictions can be applied via xgboost predict_proba() or spark ml predict().
Related
I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.
Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.
Is there a better way of doing this? Am I doing things "the right way"?
I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.
EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.
I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.
I created a very large Spark Dataframe with PySpark on my cluster, which is too big to fit into memory. I also have an autoencoder model with Keras, which takes in a Pandas dataframe (in-memory object).
What is the best way to bring those two worlds together?
I found some libraries that provide Deep Learning on Spark, but is seems only for hyper parameter tuning or wont support autoencoders like Apache SystemML
I am surely not the first one to train a NN on Spark Dataframes. I have a conceptual gap here, please help!
As you mentioned Pandas DF in Spark are in-memory object and training won't be distributed. For distributed training you have to rely on Spark DF and some specific third party packages to handle the distributed training :
You can find the information here :
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/index.html
I have a keras deep learning model and I have to now process a large dataset over it and calculate the results. This model is already trained, so training is not an issue. I tried exposing my model as a REST service and then calling the same via spark is working fine,but there is a latency factor and for a huge dataset this is a problem. Is there an example someone can quote which I can use as a reference to use my keras model in pyspark and process data by direct python calls instead of REST calls
I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.