XGboost: cannot pass validation data for eval_set in pipeline - python-3.x

I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
And I want to pass these fit params
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
I am trying to fit model
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool
I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work.
Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.
Full code
columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()
num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', num_preprocessor, num_cols),
('cat', cat_preprocessor, cat_cols)
])
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
param_grid = {
"XGBmodel__n_estimators": [10, 50, 100, 500],
"XGBmodel__learning_rate": [0.1, 0.5, 1],
}
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?

There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.
The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.

One way to train a pipeline that is using EarlyStopping is to train the preprocessing and the regressor separately.
The steps are the following:
fit_transform() the transformers
transform() the validation data.
fit() the model with Xgboost parameters
dump the fitted pipeline
as follows:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import pickle
import numpy as np
import joblib
rng = np.random.RandomState(0)
X_train, X_val = rng.randn(50, 3), rng.randn(20, 3)
y_train, y_val = rng.randn(50, 1), rng.randn(20, 1)
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', XGBRegressor(random_state=0)),
])
X_train_transformed = pipeline[:-1].fit_transform(X_train)
x_val_transformed = pipeline[:-1].transform(X_val)
pipeline[-1].fit(
X=X_train_transformed,
y=y_train,
eval_set=[(x_val_transformed, y_val)],
early_stopping_rounds=10,
)
joblib.dump(pipeline, 'pipeline.pkl')
pipe = joblib.load('pipeline.pkl')
pipe.score(X_val, y_val)
Notes: This will work if you you want to fit the pipeline. However, if you want to perform a GridSearch using earlyStropping, you will have to write your own gridsearch like in this article.

Related

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

tuning models with grid search

I ran across an example on parameters tuning with Grid search and text data using TfidfVectorizer() in the pipeline.
As far as I've understood is that when we call grid_search.fit(X_train, y_train) it will transform the data then fit the model as it is described in a dictionary. However during the evaluation, I'm a bit confused with the test dataset, since when we call grid_search.predict(X_test) I don't know whether/(how) the TfidfVectorizer() is applied on this test chunk.
Thanks
David
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_
score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,
verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
This is an example of scikit-learn pipelines magic. It works like this:
First, you define elements of a pipeline with Pipeline constructor - all data, whether on train or test (predict) stage, will be processed through all the defined steps - in this case by TfidfVectorizer and then the output will be passed to LogisticRegression model.
Passing defined pipeline to GridSearchCV constructor allows you to use the method fit, that not only performs grid search but also internally sets both TfidfVectorizer and LogisticRegression to best found parameters, so later running predict does so on best-found models.
You can find more info on creating pipelines in scikit-learn documentation.

fit vs fit_transform in pipeline

In this page https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
It calls fit_transfrom for tranforming the data as follows:
from sklearn.pipeline import FeatureUnion
feats = FeatureUnion([('text', text),
('length', length),
('words', words),
('words_not_stopword', words_not_stopword),
('avg_word_length', avg_word_length),
('commas', commas)])
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)
While during training with feature processing, it only uses fit then predict
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features',feats),
('classifier', RandomForestClassifier(random_state = 42)),
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
The question is, is the fit doing the transformation on X_train (as what is achieved by transform, since we are not calling fit_transform here) for second case?
sklearn-pipeline has some nice features. It perform several task in a very clean way. We define our features, its transformation and list of classifiers, we want to perform, all in one function.
In the first step of this
pipeline = Pipeline([
('features',feats),
('classifier', RandomForestClassifier(random_state = 42)),
])
you have defined the features's name and its transformation function(that is incorporated in feat), in second step, you have defined the classifier's name and classifier classifier.
Now while calling pipeline.fit, it first fit features and transform it, then fit the classifier on the transformed features. So, it does some steps for us. More you can check-here

Computing Pipeline logistic regression predict_proba in sklearn

I have a dataframe with 3 features and 3 classes that I split into X_train, Y_train, X_test, and Y_test and then run Sklearn's Pipeline with PCA, StandardScaler and finally Logistic Regression. I want to be able to calculate the probabilities directly from the LR weights and the raw data without using predict_proba but don't know how because I'm not sure exactly how pipeline pipes X_test through PCA and StandardScaler into logistic regression. Is this realistic without being able to use PCA's and StandardScaler's fit method? Any help would be greatly appreciated!
So far, I have:
pca = PCA(whiten=True)
scaler = StandardScaler()
logistic = LogisticRegression(fit_intercept = True, class_weight = 'balanced', solver = sag, n_jobs = -1, C = 1.0, max_iter = 200)
pipe = Pipeline(steps = [ ('pca', pca), ('scaler', scaler), ('logistic', logistic) ]
pipe.fit(X_train, Y_train)
predict_probs = pipe.predict_proba(X_test)
coefficents = pipe.steps[2][1].coef_ (3 by 30)
intercepts = pipe.steps[2][1].intercept_ (1 by 3)
This is also the question I don't figure out, thanks for Kumar's answer.
I regarded pipeline will lead to new transform for x_test, but when I tried to run Pipeline composed of StandardScalar and LogisticRegression, and to run my own defined function using StandardScalar and LogisticRegression, I found that Pipeline actually use the transform fitted by x_train. So don't worry about using pipeline, it's really a convenient and useful tool for machine learning!

sklearn GridSearchCV: how to get classification report?

I am using GridSearchCV like this:
corpus = load_files('corpus')
with open('stopwords.txt', 'r') as f:
stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]
x = corpus.data
y = corpus.target
pipeline = Pipeline([
('vec', CountVectorizer(stop_words=stop_words)),
('classifier', MultinomialNB())])
parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
'classifier__alpha': [1e-2, 1e-3],
'classifier__fit_prior': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)
gs_clf = gs_clf.fit(x, y)
joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)
Then, in another file, to classify new documents (not from the corpus), I do this:
classifier = joblib.load(filepath) # path to .pkl file
result = classifier.predict(tokenlist)
My question is: Where do I get the values needed for the classification_report?
In many other examples, I see people split the corpus into traing set and test set.
However, since I am using GridSearchCV with kfold-cross-validation, I don't need to do that.
So how can I get those values from GridSearchCV?
If you have GridSearchCV object:
from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))
If you have saved the best estimator and loaded it then:
classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))
The best model is in clf.best_estimator_. You need to fit the training data to this; then predict your test data and use ytest and ypreds for the classification report.

Resources