MLeap and Spark ML SQLTransformer - apache-spark

I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns.
As we all know, Mleap doesn't support SQLTransformer - see here :
https://github.com/combust/mleap/issues/126
so I've implemented the former of these 2 suggestions:
For non-row operations, move the SQL out of the ML Pipeline that you
plan to serialize
For row-based operations, use the available ML
transformers or write a custom transformer <- this is where the
custom transformer documentation will help.
I've externalized the SQL transformation on the training data used to build the model, and I do the same for the input data when I run the model for evaluation.
The problem I'm having is that I'm unable to obtain the same results across the 2 models.
Model 1 - Pure Spark ML model containing
SQLTransformer + later transformations : StringIndexer ->
OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier
Model 2 - Externalized version with SQL queries run on training data in building the model. The transformations are
everything after SQLTransformer in Model 1:
StringIndexer -> OneHotEncoderEstimator ->
VectorAssembler -> RandomForestClassifier
I'm wondering how I could go about debugging this problem. Is there a way to somehow compare the results after each stage to see where the differences show up ?
Any suggestions are appreciated.

Related

Pyspark ML CrossValidator evaluate several evaluators

In Sklearn in the GridSearchCV we can give the model different scorings and with the refit param we refit one of them using the best found parameters in on the whole dataset.
Is there any way to do something similar with CrossValidator from the ML package from pyspark?

How do you implement a model built using sklearn pipeline in pyspark?

I would like to use the model I built using sklearn pipeline in pyspark. The pipeline takes care of imputation, scaling and one-hot encoding and Random Forest Classification.I tried broadcasting the model and using pandas udf to predict.it did not work, got py4jjavaerror.

Training in Python and Deploying in Spark

Is it possible to train an XGboost model in python and use the saved model to predict in spark environment ? That is, I want to be able to train the XGboost model using sklearn, save the model. Load the saved model in spark and predict in spark. Is this possible ?
edit:
Thanks all for the answer , but my question is really this. I see the below issues when I train and predict different bindings of XGBoost.
During training I would be using XGBoost in python, and when  predicting I would be using XGBoost in mllib.
I have to load the saved model from XGBoost python (Eg: XGBoost.model file) to be predicted in spark, would this model be compatible to be used with the predict function in the mllib
The data input formats of both XGBoost in python and XGBoost in spark mllib are different. Spark takes vector assembled format but with python, we can feed the dataframe as such. So, how do I feed the data when I am trying to predict in spark with a model trained in python. Can I feed the data without vector assembler ? Would XGboost predict function in spark mllib take non-vector assembled data as input ?
You can run your python script on spark using spark-submit command so that can compile your python code on spark and then you can predict the value in spark.
you can
load data/ munge data using pyspark sql,
then bring data to local driver using collect/topandas(performance bottleneck)
then train xgboost on local driver
then prepare test data as RDD,
broadcast the xgboost model to each RDD partition, then predict data in parallel
This all can be in one script, you spark-submit, but to make the things more concise, i will recommend split train/test in two script.
Because step2,3 are happening at driver level, not using any cluster resource, your worker are not doing anything
Here is a similar implementation of what you are looking for. I have a SO post explaining details as I am trying to troubleshoot the errors described in the post to get the code in the notebook working .
XGBoost Spark One Model Per Worker Integration
The idea is to train using xgboost and then via spark orchestrate each model to run on a spark worker and then predictions can be applied via xgboost predict_proba() or spark ml predict().

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

Spark : Can i tuning a pipeline with 2 estimator simultaneously

I have a flow ( pipeline in Spark ) like this :
I have a DataFrame A, which have strings
Create a Word2Vec estimator
Create a Word2VecModel transformer
Apply Word2VecModel to DataFrame A, to create a DataFrame B,which have vectors
Create a KMean estimator
Create a KMeanModel transformer
Apply KMeanModel to DataFrame B, for clustering
In this flow, we have 2 estimators and 2 transformer models, so we will need 2 pipeline, and tuning for each pipeline separately.
But can we do tuning in one pipeline ? I have no idea about how to do it , so which methods is the best way for tuning my flow ?
Edit:
In Spark-ml lib, input for pipelines components is only dataframe, and output is dataframe or transformer. But if we chain 2 estimator on 1 pipelines, output from estimator 1 will be a transformer, so you can not continue to chain next estimator 2 on same pipeline ( accept only dataframe as input ). So do we have any trick for tuning 2 estimator ?
There is no conflict here. Spark ML Pipeline can contain arbitrary number of Estimators. All you have to do is to ensure that output column names are unique.
val kmeans: KMeans = ???
kmeans.setPredictionCol("k_means_prediction")
val word2vec: Word2Vec = ???
word2vec.setOutputCol("word2vec_output")
new Pipeline().setStages(Array(kmeans, word2vec))
However different models typically require different feature engineering steps it not very useful in practice.

Resources