GridSearchCV, Pipeline and Functional Model - keras

Problem with predict function of Functional Model
I am trying to combine, nested-cross validation and pipeline with my Functional Model.
This is the code:
binaryModel = is hereby a Functional ANN
grid = dict(ann__n_neurons=[2], ann__num_hidden=[2], ann__used_optimizer=["adam"],
ann__l1_reg=[0.0], ann__l2_reg=[0.0], ann__learning_rate=[0.01],
ann__dropout_rate=[0.0])
X, y = prepare_dataset("", short, bin_categorical, "",
continous_to_binary, target)
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) #outer cross-validatio 10 times, to test model
# enumerate splits
outer_results = list()
i=0
for train_ix, test_ix in cv_outer.split(X):
print("Outer-Split: ",i)
i+=1
# split data
X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
y_train, y_test = y[train_ix], y[test_ix]
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) #inner cross-validation 3 times, to configure model
# define the model
ann = KerasClassifier(build_fn=binaryModel, input_shape=X_train.shape[1],
batch_size=32,
epochs=10, validation_split=0.2)
# define search
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search
cv = GridSearchCV(
pipe, grid, n_jobs=1, cv=cv_inner,refit=True)
# execute search
cv.fit(X_train, y_train, ann__verbose=0)
print('Best score and parameter combination = ')
print(cv.best_score_)
print(cv.best_params_)
print(cv.best_estimator_)
y_predicted = cv.predict(X_test)
Output:
Best score and parameter combination =
0.8449265360832214
{'ann__dropout_rate': 0.0, 'ann__l1_reg': 0.0, 'ann__l2_reg': 0.0, 'ann__learning_rate': 0.01, 'ann__n_neurons': 2, 'ann__num_hidden': 2, 'ann__used_optimizer': 'adam'}
Pipeline(steps=[('scaler', StandardScaler()),
('ann',
<tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7efef01ffd30>)])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-3f1c5b78794d> in <module>
61 print(cv.best_params_)
62 print(cv.best_estimator_[1])
---> 63 y_predicted = cv.predict(X_test)
AttributeError: 'Functional' object has no attribute 'predict_classes'
How to make predictions with the final best model?
my outcome is a Pipeline but i cant use the predict function, why?
i want to use the predict function to evaluate each fold (accuracy, sensitivity and so on...)

The problem is that a Keras Functional API model doesn't have a 'predict_classes' attribute, which is what sklearn's GridSearchCV uses to perform the predictions, only Sequential Keras models have it. I have been running into the same problem, what I would suggest trying is to implement your own GridSearchCV, or try out https://github.com/autonomio/talos which seems promising, though I have not tried it myself yet.

Related

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

Why is Random Search showing better results than Grid Search?

I'm playing with RandomizedSearchCV function from scikit-learn. Some academic paper claims that Randomized Search can provide 'good enough' results comparing with a whole grid search, but saves a lot of time.
Surprisingly, on one occasion, the RandomizedSearchCV provided me better results than GridSearchCV. I think GridSearchCV is suppose to be exhaustive, so the result has to be better than RandomizedSearchCV suppose they search through the same grid.
for the same dataset and mostly same settings, GridsearchCV returned me the following result:
Best cv accuracy: 0.7642857142857142
Test set score: 0.725
Best parameters: 'C': 0.02
the RandomizedSearchCV returned me the following result:
Best cv accuracy: 0.7428571428571429
Test set score: 0.7333333333333333
Best parameters: 'C': 0.008
To me the test score of 0.733 is better than 0.725, and the difference between test score and training score for the RandomizedSearchCV is smaller, which to my knowledge means less overfitting.
So why did GridSearchCV return me worse results?
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = KFold(n_splits=kfold, shuffle=True, random_state=0)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
#high C means more chance of overfitting
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = linear_SVC(x=x_train, y=y_train, param=param, kfold=3)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
RandomizedSearchCV code:
from sklearn.model_selection import RandomizedSearchCV
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=0)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = Linear_SVC_Rand(x=x_train, y=y_train, param=param, kfold=3, n=100)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
First, try to understand this:
https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
So you should know that StratifiedKFold is better than KFold.
Use StratifiedKFold in both GridSearchCV and RandomizedSearchCV. Make sure to set "shuffle = False" and not use "random_state" parameter. What this does: the dataset you are using will not be shuffled so that your results won't be changed each time you train it. You might get what you expect.
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
RandomizedSearchCV code:
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
From what it looks like it is performing properly. Your train/cv set accuracy in gridsearch is higher than train/cv set accuracy in randomized search. The hyper parameters should not be tuned using the test set, so assuming you're doing that properly it might just be a coincidence that the hyper parameters that were chosen from randomized search performed better on the test set.

How to run only one fold of cross validation in sklearn?

I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.
model = xgb.XGBRegressor()
GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)
I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:
/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
187 else:
188 assert env.cvfolds is not None
189
190 def callback(env):
191 """internal function"""
--> 192 score = env.evaluation_result_list[-1][1]
score = undefined
env.evaluation_result_list = []
193 if len(state) == 0:
194 init(env)
195 best_score = state['best_score']
196 best_iteration = state['best_iteration']
How can i apply GridSearch on XGBoost with using early_stopping_rounds?
note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}
When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.
See the documentation of xgboosts fit method for details.
Here you see a minimal fully working example:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
fit_params=fit_params,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
An update to #glao's answer and a response to #Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):
import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))
gridsearch.fit(trainX, trainY, **fit_params)
Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.
Over-ride the XGBRegressor or XGBClssifier.fit() Function
This step uses train_test_split() to select the specified number of
validation records from X for the eval_set and then passes the
remaining records along to fit().
A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
**kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
Example Usage
Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:
X_train contains text documents passed to the pipeline.
XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
Example Fitting the Pipeline:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
Example Fitting GridSearchCV:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)

GridSearch / make_scorer strange results with xgboost model

I am tryting to use sklearn's gridsearch with a model created by xgboost. To do this, I am creating a custom scorer based on ndcg evaluation. I am successfully able to use Snippet 1 but it is too messy / hacky, I would prefer to use good old sklearn to simplify the code. I tried to implement GridSearch and the results is completely off: for the same X and y sets I get NDCG#k = 0.8 with Snippet 1 versus 0.5 with Snippet 2. Obviously there something I am not doing right here ...
The following pieces of code return very different results:
Snippet1:
kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=42)
max_depth = [6]
learning_rate = [0.22]
n_estimators = [43]
reg_alpha = [0.1]
reg_lambda = [10]
for md in max_depth:
for lr in learning_rate:
for ne in n_estimators:
for ra in reg_alpha:
for rl in reg_lambda:
xgb = XGBClassifier(objective='multi:softprob',
max_depth=md,
learning_rate=lr,
n_estimators=ne,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.6, colsample_bytree=0.6, seed=0)
print([md, lr, ne])
score = []
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
xgb.fit(X_train, y_train)
y_pred = xgb.predict_proba(X_test)
score.append(ndcg_scorer(y_test, y_pred))
print('all scores: %s' % score)
print('average score: %s' % np.mean(score))
Snippet2:
from sklearn.grid_search import GridSearchCV
params = {
'max_depth':[6],
'learning_rate':[0.22],
'n_estimators':[43],
'reg_alpha':[0.1],
'reg_lambda':[10],
'subsample':[0.6],
'colsample_bytree':[0.6]
}
xgb = XGBClassifier(objective='multi:softprob',seed=0)
scorer = make_scorer(ndcg_scorer, needs_proba=True)
gs = GridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False)
gs.fit(X,y)
gs.best_score_
While snippet1 gives me the result as expected, the score returned by Snippet2 is not consistent with the ndcg_scorer.
The problem is with cv inGridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False). It can recieve a KFold / StratifiedKFold instead of an int. Unlike what is says in the doc it seems that by default an agrument of type 'int' is not calling StratifiedKFold another function maybe KFold.

Resources