Spark ML pipeline usage - apache-spark

I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!

You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

Is it possible to keep training the same Azure Translate Custom Model with additional data sets?

I just finished training a Custom Azure Translate Model with a set of 10.000 sentences. I now have the options to review the result and test the data. While I already get a good result score I would like to continue training the same model with additional data sets before publishing. I cant find any information regarding this in the documentation.
The only remotely close option I can see is to duplicate the first model and add the new data sets but this would create a new model and not advance the original one.
Once the project is created, we can train with different models on different datasets. Once the dataset is uploaded and the model was trained, we cannot modify the content of the dataset or upgrade it.
https://learn.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model
The above document can help you.

Applying transformation to new data Spark

I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.
In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.
The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.
I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.
You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.

Adding "external" features to Pipeline

I'd like to compose a pipleline for textual text. However there is an
extra step in the pipeline where "external" features are added. These
features are stored in external database and are accessed by document id
(the row number in the input).
The custom pipeline stage comes after a tfidf step. Meaning the input to the
stage will be a sparse matrix. Is there a way for me to pass the indices in
the input matrix as well? Or maybe a generic way to pass some metadata between
pipeline stages?
Note that the input to the pipeline is selected by GridSearchCV.
I saw the Feature Union with Heterogeneous Data Sources, but fail to see how to apply it to my situation since I can't compute the features from the input to the stage.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources