Persist bestmodel from pipeline in pyspark - python-3.x

I have a question regards how to extract pipeline's bestmodel for scoring and further use. For example i have tried saving it to pmml file using JPMML pyspark2 library but i came to issue saving the file. Is there a other way of saving the pipeline model using pyspark ?

use bestModel function on your trained model transformer like this ...
print(spark.version)
2.4.3
# fit model on training data to cv/grid search
cvModel = cv_grid.fit(train_df)
# save best model from cv grid search
mPath = "/path/to/model/folder"
cvModel.bestModel.write().overwrite().save(mPath)
# read pickled model via pipeline api
from pyspark.ml.pipeline import PipelineModel
persistedModel = PipelineModel.load(mPath)
# predict
predictionsDF = persistedModel.transform(test_df)
Source code for an extra read => https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/tuning.html

Related

Load a Picked or Joblib Pre trained ML Model to Sagemaker and host as endpoint

If I have a trained model in Using pickle, or Joblib.
Lets say its Logistic regression or XGBoost.
I would like to host that model in AWS Sagemaker as endpoint without running a training job.
How to achieve that.
#Lets Say myBucketName contains model.pkl
model = joblib.load('filename.pkl')
# X_test = Numpy Array
model.predict(X_test)
I am not interested to sklearn_estimator.fit('S3 Train, S3 Validate' ) , I have the trained model
For Scikit Learn for example, you can get inspiration from this public demo https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb
Step 1: Save your artifact (eg the joblib) compressed in S3 at s3://<your path>/model.tar.gz
Step 2: Create an inference script with the deserialization function model_fn. (Note that you could also add custom inference functions input_fn, predict_fn, output_fn but for scikit the defaults function work fine)
%%writefile inference_script.py. # Jupiter command to create file in case you're in Jupiter
import joblib
import os
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf
Step 3: Create a model associating the artifact with the right container
from sagemaker.sklearn.model import SKLearnModel
model = SKLearnModel(
model_data='s3://<your path>/model.tar.gz',
role='<your role>',
entry_point='inference_script.py',
framework_version='0.23-1')
Step 4: Deploy!
model.deploy(
instance_type='ml.c5.large', # choose the right instance type
initial_instance_count=1)

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.
From PySpark side, I'm using the dataframe API (spark.ml), not mllib.
Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.
Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.
Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:
from pyspark.ml.feature import StringIndexer
StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")
and in Scala
import org.apache.spark.ml.feature.StringIndexer
val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo
That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.
Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

Pickle for datapreprocessing

I was going through various tutorials and articles on using pickle on the ml model so that that can be used later.
But I am not able to get something pickle or something similar for data pre- processing. I am doing the preprocessing:
Changing the datatype of few columns/features.
Feature engineering.
Hot Encoding/Dummy variables
Scaling the data using below code
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, I want to do this for every dataset which I pass for predictions.
Is there any way to do something like pickle to load the data preprocessing steps before I was this to loaded ML model from pickle.
Please guide
I created a function and saved it a independent file. Then called that function whenever required.
Below is the code on how I am calling the data pre process function
from DataPreparationv3 import Data_Preprocess
Base_Data = pd.read_csv('Validate.csv')
DataReady = Data_Preprocess(Base_Data)
This solved my problem.
Regards
Sudhir

PySpark MLlib: AssertionError: Classifier doesn't extend from HasRawPredictionCol

I am a newbie in Spark . I want to use multiclass classification for SVM in PySpark MLlib. I installed Spark 2.3.0 on Windows.
But I searched and found that SVM is implemented for binary classification only in Spark , so we have to use one-vs-all strategy. It gave me an error when I tried to use one-vs-all with SVM . I searched for the error but do not find a solution for it.
I used the code of one-vs-all from this link
https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#one-vs-rest-classifier-aka-one-vs-all
here is my code :
from pyspark.mllib.classification import SVMWithSGD , SVMModel
from pyspark.ml.classification import OneVsRest
# instantiate the One Vs Rest Classifier.
svm_model = SVMWithSGD()
ovr = OneVsRest(classifier=svm_model)
# train the multiclass model.
ovrModel = ovr.fit(rdd_train)
# score the model on test data.
predictions = ovrModel.transform(rdd_test)
The error is in the line "ovr.fit(rdd_train)". Here is the error
File "D:/Mycode-newtrials - Copy/stance_detection -norelieff-lgbm - randomizedsearch - modified - spark.py", line 1460, in computescores
ovrModel = ovr.fit(rdd_train)
File "D:\python27\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "D:\python27\lib\site-packages\pyspark\ml\classification.py", line 1758, in _fit
"Classifier %s doesn't extend from HasRawPredictionCol." % type(classifier)
AssertionError: Classifier <class 'pyspark.mllib.classification.SVMWithSGD'> doesn't extend from HasRawPredictionCol.
You get the error because you are trying to use a model from Spark ML (OneVsRest) with a base binary classifier from Spark MLlib (SVMWithSGD).
Spark MLlib (the old, RDD-based API) and Spark ML (the new, dataframe-based API) are not only different libraries, but they are also incompatible: you cannot mix models between them (looking closer at the examples, you'll see that they import the base classifier from pyspark.ml, and not from pyspark.mllib, as you are trying to do here).
Unfortunately, as at the time of writing (Spark 2.3) Spark ML does not include SVMs, you cannot currently use the algorithm as a base classifier with OneVsRest...

Pipeline does not get converted to PMML properly using JPMML and Pyspark

I am using Pyspark and JPMML library to generate PMML models from my pipeline models. But I don't think it's generating properly. For testing this, I created two different pipeline models using the same dataset and the classifier as below.
pipeline = Pipeline(stages = [assembler, slicer,pca, binarizer,assembler2, formula,classifier])
pipeline2 = Pipeline(stages = [assembler, slicer, binarizer,assembler2, formula,classifier])
But when I generate the PMML file using the following code snippet, it outputs two identical files. Which means there is no difference between the models. I am confused. The generated PMML files should be different if it's converting properly right?
pipelineModel1 = pipeline.fit(df)
pmmlBytes = toPMMLBytes(spark, df, pipelineModel1)
with open('test.pmml','wb') as output:
output.write( pmmlBytes)
pipelineModel2 = pipeline2.fit(df)
pmmlBytes2 = toPMMLBytes(spark, df, pipelineModel2)
with open('test1.pmml','wb') as output:
output.write( pmmlBytes2)
The generated PMML files should be different if it's converting properly right?
Not necessarily. It all depends on your classification function - it may happen that PCA generated columns are simply not included in the PMML document, because they do not "contribute" to separating the classes. To test this hypothesis, try different classification functions such as DecisionTreeClassifier vs. LogisticRegression.
Also, the only way to verify whether a PMML document is correct or not is to execute it, and verify its results against the original Apache Spark(ML) results.

Resources