I have a flow ( pipeline in Spark ) like this :
I have a DataFrame A, which have strings
Create a Word2Vec estimator
Create a Word2VecModel transformer
Apply Word2VecModel to DataFrame A, to create a DataFrame B,which have vectors
Create a KMean estimator
Create a KMeanModel transformer
Apply KMeanModel to DataFrame B, for clustering
In this flow, we have 2 estimators and 2 transformer models, so we will need 2 pipeline, and tuning for each pipeline separately.
But can we do tuning in one pipeline ? I have no idea about how to do it , so which methods is the best way for tuning my flow ?
Edit:
In Spark-ml lib, input for pipelines components is only dataframe, and output is dataframe or transformer. But if we chain 2 estimator on 1 pipelines, output from estimator 1 will be a transformer, so you can not continue to chain next estimator 2 on same pipeline ( accept only dataframe as input ). So do we have any trick for tuning 2 estimator ?
There is no conflict here. Spark ML Pipeline can contain arbitrary number of Estimators. All you have to do is to ensure that output column names are unique.
val kmeans: KMeans = ???
kmeans.setPredictionCol("k_means_prediction")
val word2vec: Word2Vec = ???
word2vec.setOutputCol("word2vec_output")
new Pipeline().setStages(Array(kmeans, word2vec))
However different models typically require different feature engineering steps it not very useful in practice.
Related
I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same
I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns.
As we all know, Mleap doesn't support SQLTransformer - see here :
https://github.com/combust/mleap/issues/126
so I've implemented the former of these 2 suggestions:
For non-row operations, move the SQL out of the ML Pipeline that you
plan to serialize
For row-based operations, use the available ML
transformers or write a custom transformer <- this is where the
custom transformer documentation will help.
I've externalized the SQL transformation on the training data used to build the model, and I do the same for the input data when I run the model for evaluation.
The problem I'm having is that I'm unable to obtain the same results across the 2 models.
Model 1 - Pure Spark ML model containing
SQLTransformer + later transformations : StringIndexer ->
OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier
Model 2 - Externalized version with SQL queries run on training data in building the model. The transformations are
everything after SQLTransformer in Model 1:
StringIndexer -> OneHotEncoderEstimator ->
VectorAssembler -> RandomForestClassifier
I'm wondering how I could go about debugging this problem. Is there a way to somehow compare the results after each stage to see where the differences show up ?
Any suggestions are appreciated.
I'm certain I've developed a gap in my understanding Spark ML's Pipelines.
I have a pipeline that trains against a set of data, with a schema of "label", "comment" (both strings). My pipeline transforms "label", adding "indexedLabel", and vectorizes "comment" by tokenizing and then HashingTF (ending with "vectorizedComment") The pipeline concludes with a LogisticRegression, with label column "indexedLabel" and a features column of "vectorizedComment".
And it works great! I can fit against my pipeline and get a pipeline model that transforms datasets with "label", "comment" all day long!
However, my goal is to be able to throw datasets of just "comment", since "label" is only present for training the model purposes.
I'm confident that I've got a gap in understanding of how predictions with pipelines work - could someone point it out for me?
Transformations of the label can be done outside of the pipeline (i.e. before). The label is only necessary during training and not during actual usage of the pipeline/model. By performing label transformations in the pipeline any dataframe is required to have a label column which is undesired.
Small example:
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val df2 = indexer.fit(df).transform(df)
// Create pipeline with other stages and use df2 to fit it
Alternativly, you could have two separate pipelines. One including the label transformations which is used during training and one without it. Make sure the other stages refer to the same objects in both pipelines.
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
// Create feature transformers and add to the pipelines
val pipelineTraining = new Pipeline().setStages(Array(indexer, ...))
val pipelineUsage = new Pipeline().setStages(Array(...))
I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site
To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation.
if the pipeline consists of 3 steps
preprocessing
learn most frequent item
stringIndexer for each categorical column
vector assembler
estimator e.g. random forest
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
I can fit and transform 1 for train and test data. This means all nan values were handled, i.e. imputed
2,3 will fit nicely as well as 4,5
then I can use
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
these deliberately introduced nulls are imputed by the most frequent imputer
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values?
I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.
I finally realized that this functionality is required to reside in the pipeline to work properly, i.e. requires an additional new PipelineStage component.