Adding "external" features to Pipeline - scikit-learn

I'd like to compose a pipleline for textual text. However there is an
extra step in the pipeline where "external" features are added. These
features are stored in external database and are accessed by document id
(the row number in the input).
The custom pipeline stage comes after a tfidf step. Meaning the input to the
stage will be a sparse matrix. Is there a way for me to pass the indices in
the input matrix as well? Or maybe a generic way to pass some metadata between
pipeline stages?
Note that the input to the pipeline is selected by GridSearchCV.
I saw the Feature Union with Heterogeneous Data Sources, but fail to see how to apply it to my situation since I can't compute the features from the input to the stage.

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

Tuned model with GroupKFold Cross-Validaion requires Group parameter when Predicting

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.

Applying transformation to new data Spark

I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.
In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.
The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.
I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.
You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.

Spark ML pipeline usage

I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site

Non linear (DAG) ML pipelines in Apache Spark

I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a non-linear (DAG) pipeline. All I could find about this feature is this paragraph from the Spark ML-Guide:
It is possible to create non-linear Pipelines as long as the data flow
graph forms a Directed Acyclic Graph (DAG). This graph is currently
specified implicitly based on the input and output column names of
each stage (generally specified as parameters). If the Pipeline forms
a DAG, then the stages must be specified in topological order.
My understanding of the paragraph is that if I set the inputCol(s), outputCol parameters for each transformer and specify the stages in topological order when I create the pipeline, then the engine will use that information to build an execution DAG s.t. the stages of the DAG could run once their input is ready.
Some questions about that:
Is my understanding correct?
What happens if for one of the stages/transformers I don't specify an output column (e.g. the stage only filters some of the lines)?. Will it assume that for DAG creation purposes the stage is changing all columns so all subsequent stages should be waiting for it?
Likewise, what happens if for one of the stages I don't specify an inputCol(s)? Will the stage wait until all previous stages are complete?
It seems I can specify multiple input columns but only one output column. What happens if a transformer adds two columns to a dataframe (Spark itself has no problem with that)? Is there some way to let the DAG creation engine know about it?
Is my understanding correct?
Not exactly. Because stages are provided in a topological order all you have to do to traverse the graph in the correct order is to apply PipelineStages from left to right. And this exactly what happens when you call PipelineTransform.
Sequence of stages is traversed twice:
once to validate schema using transformSchema which is simply implemented as stages.foldLeft(schema)((cur, stage) => stage.transformSchema(cur)). This is the part where actual schema validation is performed.
once to fit actually transform data using Transformers and fit Estimators. This is just a simple for loop which applies stages sequentially one by one.
Likewise, what happens if for one of the stages I don't specify an inputCol(s)?
Pretty much nothing interesting. Since stages are applied sequentially, and the only schema validation is applied by the given Transformer using its transformSchema method before actual transformations begin, it will processed as any other stage.
What happens if a transformer adds two columns to a dataframe
Same as above. As long as it generates valid input schema for subsequent stages it is not different than any other Transformer.
transformers don't look at the output of one another I was hoping I could run them in parallel
Theoretically you could try to build a custom composite transformer which encapsulates multiple different transformations but the only part that could be performed independently and benefit from this type of operation is model fitting. At the end of the day you have to return a single transformed DataFrame which can be utilized by downstream stages and actual transformations are most likely scheduled as a single data scan anyway.
Question remains if it is really worth the effort. While it possible to execute multiple jobs at the same time, it provides some edge only, if amount of available resources is relatively high compared to amount of work required to handle a single job. It usually requires some low level management (number of partitions, number of shuffle partitions) which is not the strongest suit of Spark SQL.

Resources