Applying transformation to new data Spark - apache-spark

I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.

In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.

The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.

I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.

You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.

Related

Why does Spark Mllib need Vectors to work correctly?

I am kind of confused about why Spark's Mllib ETL functions, MinMaxScaler, for example, need vectors to be assembled instead of just using the column from the dataframe. I.e. instead of being able to do this:
scaler = MinMaxScaler(inputCol="time_since_live", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)
I need to do this:
assembler = VectorAssembler(inputCols=['time_since_live'],outputCol='time_since_liveVect')
main_df = assembler.transform(main_df)
scaler = MinMaxScaler(inputCol="time_since_liveVect", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)
It seems like such an unnecessary step because I end up creating a vector with one input column to run the MinMaxScaler on. Why does it need this to be in vector format instead of just a dataframe column?
In machine learning and pattern recognition, a set of such features is always represented as a Vector and it is called a “Feature Vector”. wiki read on feature and feature vector
Thus all the major ml libraries are API's are built to work with feature vectors
Now, the question has become more about where should we have the vector conversion step, should it be in the client code [as it is present now] or it should be within the API and client code should be able to call the API just by listing the feature columns.
IMHO, we can have both, if you have some time to spare, you can add a new API to accept list of columns instead of a feature-vector and make a pull request.
Lets see what the Spark community thinks about this

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

difference between tf.data.Dataset batch and map and tf.contrib.data.map_and_batch

I created a tf.data.Dataset and want to train a model using this dataset:
dataset = dataset.prefeth()
dataset = dataset.shuffle()
dataset = dataset.repeat()
dataset = dataset.map()
dataset = dataset.filter()
dataset = dataset.batch()
I want to know what is the difference between the above dataset with the bellow one:
dataset = dataset.prefeth()
dataset = dataset.shuffle()
dataset = dataset.repeat()
dataset = dataset.apply(tf.contrib.data.map_and_batch())
I know that they should not be different except in performance. But I don't know should I use the .apply() method or not?
Is the first implementation correct?
First off, most of the tf.contrib.data functions are deprecated and moved to tf.data.experimental. So watch out for that.
Take a look at input pipeline performance guide to get a good idea about what could be a good optimal ordering of the transformations for your application. Regarding map and batch, yes we pass the result of the map and batch to the apply function, and it is specified in the return description of map and batch for reference confirmation.
And we want to use map and batch for efficiency reasons, which normally depends on what your data is and how costly your map function is. The performance guide has some guidelines for the same.
Regarding the difference between your first and second blocks of code, there is a filter function in between, so both blocks might not give the same result depending on what you are filtering.

Spark ML pipeline usage

I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site

spark ml pipeline handle unseen labels

To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation.
if the pipeline consists of 3 steps
preprocessing
learn most frequent item
stringIndexer for each categorical column
vector assembler
estimator e.g. random forest
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
I can fit and transform 1 for train and test data. This means all nan values were handled, i.e. imputed
2,3 will fit nicely as well as 4,5
then I can use
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
these deliberately introduced nulls are imputed by the most frequent imputer
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values?
I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.
I finally realized that this functionality is required to reside in the pipeline to work properly, i.e. requires an additional new PipelineStage component.

Resources