In this page https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
It calls fit_transfrom for tranforming the data as follows:
from sklearn.pipeline import FeatureUnion
feats = FeatureUnion([('text', text),
('length', length),
('words', words),
('words_not_stopword', words_not_stopword),
('avg_word_length', avg_word_length),
('commas', commas)])
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)
While during training with feature processing, it only uses fit then predict
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features',feats),
('classifier', RandomForestClassifier(random_state = 42)),
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
The question is, is the fit doing the transformation on X_train (as what is achieved by transform, since we are not calling fit_transform here) for second case?
sklearn-pipeline has some nice features. It perform several task in a very clean way. We define our features, its transformation and list of classifiers, we want to perform, all in one function.
In the first step of this
pipeline = Pipeline([
('features',feats),
('classifier', RandomForestClassifier(random_state = 42)),
])
you have defined the features's name and its transformation function(that is incorporated in feat), in second step, you have defined the classifier's name and classifier classifier.
Now while calling pipeline.fit, it first fit features and transform it, then fit the classifier on the transformed features. So, it does some steps for us. More you can check-here
Related
LightGBM's sklearn api classifier, LGBMClassifier, allows you to designate early_stopping_rounds, eval_metric, and eval_set parameters in its LGBMClassifier.fit() method. While it's convenient, it doesn't play well with a custom data processor and sklearn's Gridseach. Example:
ml_pipeline = Pipeline(steps=[
('cdf',custom_data_transformer()),
('lgb',LGBMClassifier())])
# You can't throw in lgb__early_stopping_rounds here because that parameter
# is used during the .fit() method, not the instantiation of the LGBMClassifier()
params = {'lgb__max_depth':np.arange(3,10),
'lgb__reg_alpha':np.linspace(0,1,num=11),
}
rgs = RandomizedSearchCV(estimator=ml_pipeline,
param_distributions=params,
n_iter=10,
cv=5)
# So we designate lgb__early_stopping_rounds in the RandomizedGridSearchCV
# .fit() method. but oour eval_set() will not have gone through
# custom_data_transformer(), so the x_train and x_test will be very different.
rgs.fit(x_train,y_train,
lgb__early_stopping_rounds=10,
lgb__eval_set=[(x_test,y_test)],
lgb__eval_metric='auc')
tl;dr: Has anyone got any tips on making LightGBM's sklearn API play well with sklearn's Pipeline with a custom data transformation step and early stopping with an eval_set?
You can always transform the data for your eval_set manually:
x_train_eval = ml_pipeline['cdf'].fit_transform(x_train)
x_test_eval = ml_pipeline['cdf'].transform(x_test)
eval_set_manual = [(x_train_eval, y_train), (x_test_eval, y_test)]
rgs.fit(x_train,y_train,
lgb__early_stopping_rounds=10,
lgb__eval_set=eval_set_manual],
lgb__eval_metric='auc')
I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.
I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.
Following reproducible script is used to compute the accuracy of a Word2Vec classifier with the W2VTransformer wrapper in gensim:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import W2VTransformer
from gensim.utils import simple_preprocess
# Load synthetic data
data = pd.read_csv('https://pastebin.com/raw/EPCmabvN')
data = data.head(10)
# Set random seed
np.random.seed(0)
# Tokenize text
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
# Get labels
y_train = data.label
train_input = [x[0] for x in X_train]
# Train W2V Model
model = W2VTransformer(size=10, min_count=1)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), y_train)
text_w2v = Pipeline(
[('features', model),
('classifier', clf)])
score = text_w2v.score(train_input, y_train)
score
0.80000000000000004
The problem with this script is that it only works when train_input = [x[0] for x in X_train], which essentially is always the first word only.
Once change to train_input = X_train (or train_input simply substituted by X_train), the script returns:
ValueError: cannot reshape array of size 10 into shape (10,10)
How can I solve this issue, i.e. how can the classifier work with more than one word of input?
Edit:
Apparently, the W2V wrapper can't work with the variable-length train input, as compared to D2V. Here is a working D2V version:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess, lemmatize
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(dm=1, size=50, min_count=2, iter=10, seed=0)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1, random_state=0)
clf.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
y_pred = pipeline.predict(X_train)
score = accuracy_score(y_train,y_pred)
print(score)
This is technically not an answer, but cannot be written in comments so here it is. There are multiple issues here:
LogisticRegression class (and most other scikit-learn models) work with 2-d data (n_samples, n_features).
That means that it needs a collection of 1-d arrays (one for each row (sample), in which the elements of array contains the feature values).
In your data, a single word will be a 1-d array, which means that the single sentence (sample) will be a 2-d array. Which means that the complete data (collection of sentences here) will be a collection of 2-d arrays. Even in that, since each sentence can have different number of words, it cannot be combined into a single 3-d array.
Secondly, the W2VTransformer in gensim looks like a scikit-learn compatible class, but its not. It tries to follows "scikit-learn API conventions" for defining the methods fit(), fit_transform() and transform(). They are not compatible with scikit-learn Pipeline.
You can see that the input param requirements of fit() and fit_transform() are different.
fit():
X (iterable of iterables of str) – The input corpus.
X can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from
disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec
module for such examples.
fit_transform():
X (numpy array of shape [n_samples, n_features]) – Training set.
If you want to use scikit-learn, then you will need to have the 2-d shape. You will need to "somehow merge" word-vectors for a single sentence to form a 1-d array for that sentence. That means that you need to form a kind of sentence-vector, by doing:
sum of individual words
average of individual words
weighted averaging of individual words based on frequency, tf-idf etc.
using other techniques like sent2vec, paragraph2vec, doc2vec etc.
Note:- I noticed now that you were doing this thing based on D2VTransformer. That should be the correct approach here if you want to use sklearn.
The issue in that question was this line (since that question is now deleted):
X_train = vectorizer.fit_transform(X_train)
Here, you overwrite your original X_train (list of list of words) with already calculated word vectors and hence that error.
Or else, you can use other tools / libraries (keras, tensorflow) which allow sequential input of variable size. For example, LSTMs can be configured here to take a variable input and an ending token to mark the end of sentence (a sample).
Update:
In the above given solution, you can replace the lines:
model = D2VTransformer(dm=1, size=50, min_count=2, iter=10, seed=0)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1, random_state=0)
clf.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
y_pred = pipeline.predict(X_train)
with
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_train)
No need to fit and transform separately, since pipeline.fit() will automatically do that.
I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.