spark ml pipeline handle unseen labels - apache-spark

To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation.
if the pipeline consists of 3 steps
preprocessing
learn most frequent item
stringIndexer for each categorical column
vector assembler
estimator e.g. random forest
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
I can fit and transform 1 for train and test data. This means all nan values were handled, i.e. imputed
2,3 will fit nicely as well as 4,5
then I can use
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
these deliberately introduced nulls are imputed by the most frequent imputer
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values?
I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.

I finally realized that this functionality is required to reside in the pipeline to work properly, i.e. requires an additional new PipelineStage component.

Related

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

Using Apache Spark ML, how do you transform (for predictions) a dataset that doesn't have a label?

I'm certain I've developed a gap in my understanding Spark ML's Pipelines.
I have a pipeline that trains against a set of data, with a schema of "label", "comment" (both strings). My pipeline transforms "label", adding "indexedLabel", and vectorizes "comment" by tokenizing and then HashingTF (ending with "vectorizedComment") The pipeline concludes with a LogisticRegression, with label column "indexedLabel" and a features column of "vectorizedComment".
And it works great! I can fit against my pipeline and get a pipeline model that transforms datasets with "label", "comment" all day long!
However, my goal is to be able to throw datasets of just "comment", since "label" is only present for training the model purposes.
I'm confident that I've got a gap in understanding of how predictions with pipelines work - could someone point it out for me?
Transformations of the label can be done outside of the pipeline (i.e. before). The label is only necessary during training and not during actual usage of the pipeline/model. By performing label transformations in the pipeline any dataframe is required to have a label column which is undesired.
Small example:
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val df2 = indexer.fit(df).transform(df)
// Create pipeline with other stages and use df2 to fit it
Alternativly, you could have two separate pipelines. One including the label transformations which is used during training and one without it. Make sure the other stages refer to the same objects in both pipelines.
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
// Create feature transformers and add to the pipelines
val pipelineTraining = new Pipeline().setStages(Array(indexer, ...))
val pipelineUsage = new Pipeline().setStages(Array(...))

Applying transformation to new data Spark

I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.
In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.
The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.
I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.
You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.

Spark : Can i tuning a pipeline with 2 estimator simultaneously

I have a flow ( pipeline in Spark ) like this :
I have a DataFrame A, which have strings
Create a Word2Vec estimator
Create a Word2VecModel transformer
Apply Word2VecModel to DataFrame A, to create a DataFrame B,which have vectors
Create a KMean estimator
Create a KMeanModel transformer
Apply KMeanModel to DataFrame B, for clustering
In this flow, we have 2 estimators and 2 transformer models, so we will need 2 pipeline, and tuning for each pipeline separately.
But can we do tuning in one pipeline ? I have no idea about how to do it , so which methods is the best way for tuning my flow ?
Edit:
In Spark-ml lib, input for pipelines components is only dataframe, and output is dataframe or transformer. But if we chain 2 estimator on 1 pipelines, output from estimator 1 will be a transformer, so you can not continue to chain next estimator 2 on same pipeline ( accept only dataframe as input ). So do we have any trick for tuning 2 estimator ?
There is no conflict here. Spark ML Pipeline can contain arbitrary number of Estimators. All you have to do is to ensure that output column names are unique.
val kmeans: KMeans = ???
kmeans.setPredictionCol("k_means_prediction")
val word2vec: Word2Vec = ???
word2vec.setOutputCol("word2vec_output")
new Pipeline().setStages(Array(kmeans, word2vec))
However different models typically require different feature engineering steps it not very useful in practice.

Spark ML pipeline usage

I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site

Resources