Unable to serialize logistic regressing in mleap - apache-spark

java.lang.AssertionError: assertion failed: This op only supports binary logistic regression
I am trying to serialize a spark pipeline in mleap.
I am using Tokenizer, HashingTF and LogisticRegression in my pipeline.
When I am trying to serialize my pipeline I get the above error.
Here is the code I am using to serialize the pipeline -
val pipeline = Pipeline(pipelineConfig)
val model = pipeline.fit(data)
(for(bf <- managed(BundleFile("jar:file:/tmp/abc.model.twitter.zip"))) yield {
model.writeBundle.format(SerializationFormat.Json).save(bf).get
}).tried.get
sc.stop()
As per the documentation, LR is supported by mleap. So I am totally clueless about what I might be doing wrong here.

yashdosi,
MLeap defaults to support for Spark 2.0 (sorry this isn't well documented). In 2.0, only binary logistic regression was supported. With the introduction of 2.1 there is multinomial logistic regression. Because MLeap is meant to support 2.0.0 and up, we have built in a mechanism for selecting which version of Spark you are using (currently MLeap supports 2.0 and 2.1, but defaults to 2.0).
Try adding this line to your application.conf file in your resources directory, it will let MLeap know to use the Spark 2.1 transformers when serializing:
// application.conf in src/main/resources
ml.combust.mleap.spark.registry.default = ${ml.combust.mleap.spark.registry.v21}

Related

Training in Python and Deploying in Spark

Is it possible to train an XGboost model in python and use the saved model to predict in spark environment ? That is, I want to be able to train the XGboost model using sklearn, save the model. Load the saved model in spark and predict in spark. Is this possible ?
edit:
Thanks all for the answer , but my question is really this. I see the below issues when I train and predict different bindings of XGBoost.
During training I would be using XGBoost in python, and when  predicting I would be using XGBoost in mllib.
I have to load the saved model from XGBoost python (Eg: XGBoost.model file) to be predicted in spark, would this model be compatible to be used with the predict function in the mllib
The data input formats of both XGBoost in python and XGBoost in spark mllib are different. Spark takes vector assembled format but with python, we can feed the dataframe as such. So, how do I feed the data when I am trying to predict in spark with a model trained in python. Can I feed the data without vector assembler ? Would XGboost predict function in spark mllib take non-vector assembled data as input ?
You can run your python script on spark using spark-submit command so that can compile your python code on spark and then you can predict the value in spark.
you can
load data/ munge data using pyspark sql,
then bring data to local driver using collect/topandas(performance bottleneck)
then train xgboost on local driver
then prepare test data as RDD,
broadcast the xgboost model to each RDD partition, then predict data in parallel
This all can be in one script, you spark-submit, but to make the things more concise, i will recommend split train/test in two script.
Because step2,3 are happening at driver level, not using any cluster resource, your worker are not doing anything
Here is a similar implementation of what you are looking for. I have a SO post explaining details as I am trying to troubleshoot the errors described in the post to get the code in the notebook working .
XGBoost Spark One Model Per Worker Integration
The idea is to train using xgboost and then via spark orchestrate each model to run on a spark worker and then predictions can be applied via xgboost predict_proba() or spark ml predict().

MLeap and Spark ML SQLTransformer

I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns.
As we all know, Mleap doesn't support SQLTransformer - see here :
https://github.com/combust/mleap/issues/126
so I've implemented the former of these 2 suggestions:
For non-row operations, move the SQL out of the ML Pipeline that you
plan to serialize
For row-based operations, use the available ML
transformers or write a custom transformer <- this is where the
custom transformer documentation will help.
I've externalized the SQL transformation on the training data used to build the model, and I do the same for the input data when I run the model for evaluation.
The problem I'm having is that I'm unable to obtain the same results across the 2 models.
Model 1 - Pure Spark ML model containing
SQLTransformer + later transformations : StringIndexer ->
OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier
Model 2 - Externalized version with SQL queries run on training data in building the model. The transformations are
everything after SQLTransformer in Model 1:
StringIndexer -> OneHotEncoderEstimator ->
VectorAssembler -> RandomForestClassifier
I'm wondering how I could go about debugging this problem. Is there a way to somehow compare the results after each stage to see where the differences show up ?
Any suggestions are appreciated.

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

PySpark MLlib: AssertionError: Classifier doesn't extend from HasRawPredictionCol

I am a newbie in Spark . I want to use multiclass classification for SVM in PySpark MLlib. I installed Spark 2.3.0 on Windows.
But I searched and found that SVM is implemented for binary classification only in Spark , so we have to use one-vs-all strategy. It gave me an error when I tried to use one-vs-all with SVM . I searched for the error but do not find a solution for it.
I used the code of one-vs-all from this link
https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#one-vs-rest-classifier-aka-one-vs-all
here is my code :
from pyspark.mllib.classification import SVMWithSGD , SVMModel
from pyspark.ml.classification import OneVsRest
# instantiate the One Vs Rest Classifier.
svm_model = SVMWithSGD()
ovr = OneVsRest(classifier=svm_model)
# train the multiclass model.
ovrModel = ovr.fit(rdd_train)
# score the model on test data.
predictions = ovrModel.transform(rdd_test)
The error is in the line "ovr.fit(rdd_train)". Here is the error
File "D:/Mycode-newtrials - Copy/stance_detection -norelieff-lgbm - randomizedsearch - modified - spark.py", line 1460, in computescores
ovrModel = ovr.fit(rdd_train)
File "D:\python27\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "D:\python27\lib\site-packages\pyspark\ml\classification.py", line 1758, in _fit
"Classifier %s doesn't extend from HasRawPredictionCol." % type(classifier)
AssertionError: Classifier <class 'pyspark.mllib.classification.SVMWithSGD'> doesn't extend from HasRawPredictionCol.
You get the error because you are trying to use a model from Spark ML (OneVsRest) with a base binary classifier from Spark MLlib (SVMWithSGD).
Spark MLlib (the old, RDD-based API) and Spark ML (the new, dataframe-based API) are not only different libraries, but they are also incompatible: you cannot mix models between them (looking closer at the examples, you'll see that they import the base classifier from pyspark.ml, and not from pyspark.mllib, as you are trying to do here).
Unfortunately, as at the time of writing (Spark 2.3) Spark ML does not include SVMs, you cannot currently use the algorithm as a base classifier with OneVsRest...

In spark mllib, can LogisticRegressionWithSGD do multiple classification tasks?

I want to use LogisticRegressionWithSGD to do multiple classification tasks, but there is no setNumClasses method in org.apache.spark.mllib.classification.LogisticRegressionWithSGD. I know that LogisticRegressionWithLBFGS can do multiple classification tasks, but why LogisticRegressionWithSGD cann't ?
Multiclass classification using LogisticRegressionWithSGD() is not supported, though it is a requested feature: https://issues.apache.org/jira/browse/SPARK-10179 . It was decided not to add this feature since SparkML will be the main Machine Learning API for Spark in future, not Spark Mllib.

Resources