I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same
I have a question regards how to extract pipeline's bestmodel for scoring and further use. For example i have tried saving it to pmml file using JPMML pyspark2 library but i came to issue saving the file. Is there a other way of saving the pipeline model using pyspark ?
use bestModel function on your trained model transformer like this ...
print(spark.version)
2.4.3
# fit model on training data to cv/grid search
cvModel = cv_grid.fit(train_df)
# save best model from cv grid search
mPath = "/path/to/model/folder"
cvModel.bestModel.write().overwrite().save(mPath)
# read pickled model via pipeline api
from pyspark.ml.pipeline import PipelineModel
persistedModel = PipelineModel.load(mPath)
# predict
predictionsDF = persistedModel.transform(test_df)
Source code for an extra read => https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/tuning.html
I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.
In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.
The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.
I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.
You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.
I use Spark 2.1.0.
I've been trying to export Spark-MLlib Linear Regression model as PMML file. I've also successfully exported the PMML file. But in that file, I couldn't see any field name in it. All I can see is like this,
Can anyone let me know what's the reason for this? Also, please let me know how to obtain the column names in place of that.
There are two approaches to exporting Apache Spark models into PMML data format. First, when working at Spark ML abstraction level, then you can use the JPMML-SparkML library. Second, when working at Spark MLlib abstraction level, which appears to be the case here, then you can use the built-in PMMLExportable trait.
JPMML-SparkML retrieves column names from the Spark ML data schema via DataFrame#schema(). Unfortunately, there is no such option for Spark MLlib, so feature names "field_{n}" and the label name "target" are simply dummy hard-coded names.
It is fairly easy to rename fields in the PMML document using the JPMML-Model library:
pmmlExportable.toPMML("/tmp/raw-pmml-file")
org.dmg.pmml.PMML pmml = org.jpmml.model.JAXBUtil.unmarshal("/tmp/raw-pmml-file");
org.jpmml.model.visitors.FieldRenamer targetRenamer = new FieldRenamer(FieldName.create("target"), FieldRenamer.create("y"));
targetRenamer.applyTo(pmml);
org.jpmml.model.JAXBUtil.marshal(pmml, "/tmp/final-pmml-file");
If you marshal this PMML object instance to a PMML file, then you can see that the field "target" (and all its references) has been renamed to "y". Repeat the procedure with features.
I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site